Spam Analysis and Reputation Project Report: Received Header vs From and Sent Header

By :

Preethi Narayan

Columbia University

Department of Computer Science

New York, NY 10027

pn2156@columbia.edu


Abstract
    

This project aims to gather statistics about the various trends in e-mails so that an analysis can be made as to what affects e-mails to be classified as “spam”. This can be observed by running a number of tests on the various sources of e-mails and then making conclusions from this. The project deals with recognizing patterns associated with e-mails to be classified as “spam” or “ham” and not to decide whether the e-mails are themselves “spam” or not. There are two basic approaches to decide whether mails are “spam” or “ham”(non spam) . The first is be to observe the body of the mail and decide whether they are legitimate or not. The second is to the observe the information related to the e-mails present in the headers(e-mail headers). The second approach is used to make a study of the trends in e-mails to be classified as “spam” or “ham”. Headers in e-mail contain a wide variety of information. This is used to observe behavior of both “non-spam” and “spam” e-mails. To gather statistics an application containing different tests to be made is run on various IMAP e-mail accounts. The statistics generated are based on the headers present in each of the e-mails present the account. The statistics can be generated by using parameters like the comparison between results of the various sources like Blacklist or Friendlist,Received Header vs From and Sent Header, etc . This application takes into consideration three folders for each account. These folders are “sent mail”, “inbox” and “spam”. The results are gathered by running the tests on these folders. The module covered in this report is Received Header vs From and Sent Header.The application when run on a large number of IMAP e-mail accounts helps in deciding whether the tests on the headers were good indicators or not. The main purpose of creating this application is to analyse which test performed on the e-mail headers is a good indicator to recognise “non-spam” and “spam” messages.

1. Project Overview    

The project is divided into two parts. The first part is the server based approach. The second part involves developing a stand alone statistics generator which can be run on individual IMAP mail boxes. The first approach consists of configuring a server to receive e-mails from various sources. Then these e-mails are parsed into different portions. The parsed portions are stored in a database. The database consists of a number of portions each indicative of the different portions of the header. From the information available in the database the different checks are be performed on the e-mails. The problems faced in this approach leads to the secomd approach.

The second approach deals with the development of a standalone statistics generator. This application is used for IMAP enabled e-mail accounts. This involves the retrieval of e-mail messages from the server where the messages are stored and running the tests on them. The results of running these tests are displayed in the graphical user interface. A separate module was developed for each of the tests to be made. These modules were integrated and used in both the approaches. The modules developed for the project are as follows:

Each of these modules operate on different parts of the headers and body of e-mails. Each of these modules have been implemented by different members of the team.

The design of the project is briefly described below:
2.Introduction

The modules that I implemented for the sever based approach and the standalone generator are :

The received header in e-mails is related to the trace fields. The "Received:" field contains a (possibly empty) list of name/value pairs followed by a semicolon and a date-time specification. The first item of the name/value pair is defined by item-name, and the second item is either an address-specification, an atom, a domain, or a message-id. The received field in the header was chosen as a classifier because it indicates the trace from where the e-mail originates. This field contains the trace of the route from where the e-mail originates. Each time an e-mail reaches a hop, the received header is added to the list of headers with details of the domain of the current hop and from where the e-mail was received from.

An example of a message header for an email sent from MrJones@emailprovider.com to MrSmith@gmail.com:

Delivered-To: MrSmith@gmail.com
Received: by 10.36.81.3 with SMTP id e3cs239nzb; Tue, 29 Mar 2005 15:11:47 -0800 (PST)
Return-Path:
Received: from mail.emailprovider.com (mail.emailprovider.com [111.111.11.111]) by mx.gmail.com with SMTP id h19si826631rnb.2005.03.29.15.11.46; Tue, 29 Mar 2005 15:11:47 -0800 (PST)
Message-ID: <20050329231145.62086.mail@mail.emailprovider.com>
Received: from [11.11.111.111] by mail.emailprovider.com via HTTP; Tue, 29 Mar 2005 15:11:45 PST
Date: Tue, 29 Mar 2005 15:11:45 -0800 (PST)
From: Mr Jones
Subject: Hello
To: Mr Smith

In the example, headers are added to the message three times:

1.When Mr. Jones composes the email
Date: Tue, 29 Mar 2005 15:11:45 -0800 (PST)
From: Mr Jones
Subject: Hello
To: Mr Smith

2.When the email is sent through the servers of Mr. Jones' email provider, mail.emailprovider.com
Message-ID: <20050329231145.62086.mail@mail.emailprovider.com>
Received: from [11.11.111.111] by mail.emailprovider.com via HTTP; Tue, 29 Mar 2005 15:11:45 PST

3.When the message transfers from Mr. Jones' email provider to Mr. Smith's Gmail address
Delivered-To: MrSmith@gmail.com
Received: by 10.36.81.3 with SMTP id e3cs239nzb;Tue, 29 Mar 2005 15:11:47 -0800 (PST)
Return-Path: MrJones@emailprovider.com
Received: from mail.emailprovider.com (mail.emailprovider.com [111.111.11.111]) by mx.gmail.com with SMTP id h19si826631rnb; Tue, 29 Mar 2005 15:11:47 -0800 (PST)

Below is a description of each section of the email header:
Delivered-To: MrSmith@gmail.com
The email address the message will be delivered to.

Received: by 10.36.81.3 with SMTP id e3cs239nzb;
Tue, 29 Mar 2005 15:11:47 -0800 (PST)
The time the message reached Gmail's servers.

Return-Path:
The address from which the message was sent.

Received: from mail.emailprovider.com
(mail.emailprovider.com [111.111.11.111])
by mx.gmail.com with SMTP id h19si826631rnb.2005.03.29.15.11.46;
Tue, 29 Mar 2005 15:11:47 -0800 (PST)
The message was received from mail.emailprovider.com, by a Gmail server on March 29, 2005 at approximately 3 pm.

Message-ID: 20050329231145.62086.mail@mail.emailprovider.com
A unique number assigned by mail.emailprovider.com to identify the message.

Received: from [11.11.111.111] by mail.emailprovider.com via HTTP;
Tue, 29 Mar 2005 15:11:45 PST
Mr. Jones used an email composition program to write the message, and it was then received by the email servers of mail.emailprovider.com.

Date: Tue, 29 Mar 2005 15:11:45 -0800 (PST)
From: Mr Jones
Subject: Hello
To: Mr Smith
The date, sender, subject, and destination -- Mr. Jones entered this information (except for the date) when he composed the email.

The "Received:" header field can be used to check the number of e-mails received from known domains and whether they were actually received from the domain from which they were sent. E-mail headers have the “From:” and “Sent:” fields. These fields are not necessarily always present. To perform this test ,the “From” header is compared with the “Received:” header. Alternatively, if the “Sent:” header is present , a comparison with of the “Received:” and “Sent:” header is made. A number of existing spam filters like spamassasin use the received header to run tests to decide the points to be assigned to any particular mail. The statistics gathered are by running the mail statistics generator on my inbox and the mail accounts of friends who have IMAP enabled email accounts.


3.Architecture    

The architechture of the components involved in the module is described below:

3.1 Get Domain Name    

The “Received:” header field in the e-mail header has trace information of the mail hops. This is in the form of either domain names themselves or the IP addresses of the domain names. In both the server approach and for the stand alone generator, the parsed email header returns the domain name if found otherwise the IP address of as part of the “Received:” header. If the IP address is obtained then the domain name has to be extracted from this. This is done using a reverse DNS lookup procedure. From the IP address the domain names can be obtained. This is the first step in the process of testing the “Received:” header with the “From:” and “Sent:” header.

3.2 Check For Domain    

his component does the actual comparison of the domain names which are received from the “Received:” and the “From:” and “Sent:” header. Domain names are ASCII letters "a" through "z" (case-insensitive), the digits "0" through "9", and the hyphen, with some other restrictions. For example "imap.gmail.com", "cs.mit.edu", etc.Domain names are classified as:

The domain names are split into their component parts of top level domain , second level domain , third level domain and sub domains. Then a comparison between the corresponding domain names is made.

3.3. Received Header Analysis    

This component deals with the tokenising the headers received. All the e-mail headers received have to be tokenised to extract only the required components of the e-mail headers. Once the IP addresses and the domain names of the “Received:” and the “From:” and “Sent:” headers are received, a comparison is required to be made. This is done by the CheckForDomain component of the module. If the domain names match then there is a correspondence between the sender's e-mail id and the domain from where the mail came from.


4.Design and Implementation

The design and implementaion of the module for the received header check involoves two parts. The design and implementation for the server based approach and the design and implementaion for the standalone approach. The details of the design for each of the components of the module are described below.

4.1 Server Based Approach
4.2 Stand Alone Based Approach

4.1 Server Based Approach - Design and Implementation    

In this approach, a server was configured to send and receive e-mails. Whenever an incoming message is received, it goes through a parser module which breaks down the message into the header and the body. The message header is further broken down into individual components based on their fields.These parsed values are stored in the database. Each of the individual modules obtain their data from the database and perform the tests on them. The results of the tests performed are stored back in the database.
The class diagram for this approach is shown below.

Domain Data Flow Diagram
Figure 4.1

Figure 4.1 shows the class diagram for the design in the server based approach. The class diagram has three classes indicative of the components which are part of this approach. Here the controller is used as an interface for all the modules. The order of operations is decided by the controller. Each of the modules are called independently to perform the analysis on each message. This contains a message id which is unique for each message. It also contains a vector of all the modules present in the system to perform the corresponding tests. To each module it passes a message id. Using this message id as a primary key to the tables in the database, the values required by the corresponding module are retrieved. The JDBC connection class is used to get the handle for the connection and establish the connection. The data is retrieved from the databse using this. Once that is done the connection is closed. In case of the Received Header module, the message headers corresponding to the “Received:” , “From:” and “Sent:” are retrieved from the database. The Received Header module performs the the tests.

It compares the domain names of the “From:” field and the “Sent:” field with that of the domain names from the “Received:” field. The number of e-mails in which there was a macth with the domain names of the Received Header and the From header are stored in the databse. Similarly the number of e-mails in which there was a match in the Received Header and the Sent header are stored in the database.

The following shows the flow of information in the Received Header Analysis module.


Domain Data Flow Diagram
Figure 4.2

Figure 4.2 is the data flow diagram for the server based approach. Here the messages are retrieved from the server. The messages are parsed and the parsed contents are stored in the database. The handle of the controller is passed to the check for domain module which gets the corresponding fields from the database and sends it to the received header check module. The results of this are stored in the database.

4.2 Stand Alone Approach - Design and Implementation    


Since the server we had configured did not attract many emails to perform all the test, a stand alome application was developed. Here the application has the ability to connect to IMAP servers are retrieve e-mails from the server. These messages are separated out as message body and message header.The individual header values are then retreived. However in this approach , the data from the header is not completely parsed into different components. So a significant part of the module involves parsing the data.

Domain Class
Figure 5.1

Figure 5.1 shows the class diagram for this approach. Here the classes defined are Received Header Analysis, Check For Domain and Get Domain Name. The Received Header Analysis retrieves all the messages from a correponding folder. The folders taken into consideration are folders with “Ham” messages and “Spam” messages. From the messages in each of the folders, the headers are extracted from each message.

The headers are parsed and if only the IP address is present, then the domain name for the corresponding IP address is retrieved. This is done using the Get Domain Names module.

Using the domain name retrieved from this module, the check for domain class computes the comparison between the domain names. Here the domain names are broken down into induvidual components like primary domain name, secondary domain name etc. A comparison is made with each subset of the names and if a match is found then a true value is returned.
In the received header class a hash map is maintained to store the domain names of IP addresses already seen in the messages, so that the speed of the test is increased. The result of this module is the number e-mails in which the received header and the from field matched. The other result is the number of e-mails in which the received header and the sent field matched.

The results are of the format as shown below :

Inbox:

From match count : 1512/1631
Sender match count : 16/1631

Spam:

From match count : 345/976
Sender match count : 5/976

Here the total number of messages in the inbox, or the ham messages are 1631.The number of messages in which the “received:” header matched with the “from:” header is 1512 and the number of messages in which the “received:” header matched with the “sent:” header is 16. Similarly this process is repeated with the spam messages.


5. Results and Analysis

Mail Boxes Considered

The sample data consists of 22 mailboxes on which the tests were performed. The details of these mailboxes are given in the following table:

Mail Box Number Number Of Inbox Mails Number of Spam Mails Total Number of Mails
1 435 186 621
2 1631 976 2607
3 133 145 278
4 1703 207 1910
5 1072 0 1072
6 857 0 857
7 365 61 426
8 212 137 349
9 358 61 419
10 566 141 707
11 2187 0 2187
12 351 0 351
13 202 1 203
14 151 0 151
15 352 0 352
16 1119 21 1140
17 1047 0 1047
18 1104 39 1143
19 1237 0 1237
20 416 73 489
21 1638 0 1638
22 1334 2 1336

Figure 5.0

    

5.1 RECEIVED HEADER MODULE RESULTS    

To compute the results for the Received Header module the two folders considered for running the tests were ham and spam. Here the number of e-mail messages in which the "From:" header field matched with the "Received:" header are considered. And also the number of e-mail messages in which the "Sent:" header field matched with the "Received:" header are considered.

Following figure 5.1 shows the statistics and the data gathered for the Received Header Analysis.This data corresponds to the ham message folder.
Once the statistics are analyzed, the result is then computed.
The first coloumn corresponds to the mailbox number.
The second coloumn corresponds to the number of emails in which the from header matches the received header.
The third coloumn corresponds to the number of emails in which the sent header matches the received header.

Mail Box Number Ham mails matched between Received and From Headers Percentage Ham mails matched between Received and From Headers Ham mails matched between Received and Sent Headers Percentage Ham mails matched between Received and Sent Headers Total Messages
1 255 59 16 4 435
2 1431 88 63 4 1631
3 101 76 1 1 133
4 1451 85 71 4 1703
5 738 69 27 3 1072
6 665 78 42 5 857
7 289 79 21 6 365
8 159 71 1 0 212
9 221 61 10 3 358
10 513 90 1 2 566
11 1897 87 15 1 2187
12 322 92 6 2 351
13 196 97 2 1 202
14 132 87 1 1 151
15 281 80 24 7 352
16 721 64 198 18 1119
17 551 52 245 23 1047
18 984 89 27 2 1104
19 757 61 160 13 1237
20 300 72 10 2 416
21 837 51 384 23 1638
22 919 69 282 21 1334

Figure 5.1

Following figure 5.2 shows the statistics and the data gathered for the Received Header Analysis.This data corresponds to the spam message folder.
Once the statistics are analyzed, the result is then computed.
The first coloumn corresponds to the mailbox number.
The second coloumn corresponds to the number of emails in which the from header matches the received header.
The third coloumn corresponds to the number of emails in which the sent header matches the received header.

Mail Box Number Spam mails matched between Received and From Headers Percentage Spam mails matched between Received and From Headers Spam mails matched between Received and Sent Headers Percentage Spam mails matched between Received and Sent Headers Total Messages
1 16 9 0 0 186
2 212 22 5 1 976
3 33 23 0 0 145
4 77 37 0 0 207
5 0 0 0 0 0
6 0 0 0 0 0
7 20 33 0 0 61
8 51 37 1 0 137
9 13 21 0 0 61
10 51 36 1 0 141
11 0 0 0 0 0
12 0 0 0 0 0
13 0 0 0 0 1
14 0 0 0 0 0
15 0 0 0 0 0
16 11 52 0 0 21
17 0 0 0 0 0
18 15 38 1 3 39
19 0 0 0 0 0
20 31 32 0 0 73
21 0 0 0 0 0
22 0 0 0 0 2

Figure 5.2

    

5.2 RECEIVED HEADER MODULE ANALYSIS    

From the results table we get the scatter plots indicating the match of from headers with the received headers. This is done for both the ham and spam message folders.


Figure 5.3

It can be seen from the plot that on an average for the ham messages the percentage match between the received header and the from header is 75%. And in case of spam messages on an evarage the percentage match is less than 10%. However much of a conclusion cannot be drawn from this as the number of spam messages in the mailboxes used was very less.

From the results table we get the scatter plots indicating the match of sent headers with the received headers. This is done for both the ham and spam message folders.


Figure 5.4

It can be seen from the plot that on an average for the ham messages the percentage match between the received header and the sent header is 40%. And in case of spam messages on an evarage the percentage match is less than 5%. It can be seen that percentage match is very less for the sent header and the received header. One of the main reasons for this is that the number of messages which have the sent header are very few. So the check for the received header with the from header provides better results as against the sent header.

Thus as seen from the statistics generated for the different mailboxes, of different sizes, the Received Header test can considered as a good test based on following two properties:

  • The number of messages in the ham folder whose domains match, with the received header is high.

  • The number of messages in the spam folder whose domains match,with the received header is low.

Some other properties observed are that this check fails for messages which have originated from a mailing list, as the headers for messages from a mailing list are different from those of regular messages.

Result Summary:

  Received Header Module : This module can be used as a fairly good filter to understand and classify messages as spam or ham.


6. Problems and Solutions    


Some of the problems faced during the course of the problems are listed out below.

  • The server based approach did not attract enough e-mails . So no analysis could be made with modules in that approach. This in turn paved way for the standalone mail statistics generator.

  • One of the main problems faced was the lack of rigid rules for the format of headers. Each e-mail service adds its variation to this thus making a generalization tough. This is the problem mainly dealt while parsing the header.

  • The second problem faced was with the execution time of the module.

  • The execution time depends on the speed of the network to retrieve the e-mails from the server.

  • The module without optimization took more than 20 minutes to analyze e-mails greater than 1K. However I started caching the IP addresses once the look up was done, so if the same IP address was found, then instead of contacting the server to do a look up the locally cached IP with its corresponding domain name will be looked up. This made the whole module work much faster and could process mails of around 1K in around 2-3 minutes.

  • Another problem is that the number of mail boxes on which the stand alone application was run was small. To be able to make a decision about the how good the received header is a parameter to indicate whether e-mails are ham or spam , the stand alone application has to be run on a number of mailboxes.


7. Tools Used
    

The tools used were as follows:

  • Navicat MySQL for creating the database for the Server based Approach

  • Netbeans IDE for writing code for the modules as the code was written in JAVA

  • OpenOffice and EditPlus HTML based tools for writing the report.

  • OpenOffice Excel draw for drawing the figures.

  • Open Office Drawing tool for making the class Diagrams

  • Excel to Html converter for converting result tables in excel into the html form


8. Appendix
    

The following link contains the source code for the individual modules:

Received Header vs From and Sent Module


9. References