Main Page > Server Based Approach

Spam Analysis and Reputation Project

1. Server Based Approach: Email Source Analysis

Design

In the server based approach, we created a mail server (a honeypot) for the purpose of attracting spam. Whenever a new email is received at the server, the parser parses the email header and stores the required parameters in a database. 'message_id' is a parameter that is retrieved from the email header and is used as the primary key to uniquely identify each entry in the database. The controller passes message_id to all the modules using which these modules retrieve the data from the database that they use it to perform various tests and analysis.

Design of server based approach and email source analysis module can be represented as follows:

Server Based Design

Figure 1.1: Server Based Design

As shown in figure 1.1, whenever a new email is received, the parser passes the message_id of the received message to Email Source Analysis module using a controller. Using that message_id, the sender's address of the message that is stored by the parser is extracted from the database. Friend list consists of all the addresses to which the mails have been sent from the mail server. This module queries the database in which the friend list is populated from the IMAP sent folder and compares the sender's address with all the entries in the friend list. If a match is found, the sender is classified as a friend of the server, otherwise non-friend. Corresponding results are stored in the database that can be used for further analysis.

Implementation

Email Source Analysis module is written in Java and JDBC is used for connecting to the database. This schema uses the following tables of the database:

1. 'TO' is the table that stores the receiver's information retrieved form the message header. It is populated by the parser.

2. 'FROM' is the table that stores the sender's information retrieved form the message header. It is populated by the parser.

3. 'FRIEND_LIST' is the table that stores the addresses to which a mail has been sent and its corresponding server from which the mail has been sent. It will be populated using IMAP retrieval at regular intervals.

4. 'FRIEND_RESULT' is the table where the results obtained from this test are stored. It contains information about each received mail regarding whether it is from a known sender or not.

Database Schema can be found at: http://wiki.cs.columbia.edu:8080/display/spam/DatabaseSchema

Problems and Solutions

The honeypot being very new was unable to attract enough spam. We tried to attract spam to this honey post by publishing its address at various sources in the Internet like blogs, personal websites, etc. However, very less spam was attracted. To obtain concrete results we had to run this test on several messages. So we decided to switch to the standalone approach that used messages from the existing mailboxes for analysis. Also, an approach used by Phil Bradley[13] can be used to publicize the server address.

Next: Standalone Approach


Last updated: 2008-08-19 by Nirav Shah