Spam Analysis and Reputation Project : Email Encryption Headers and Database Schema

Swati Kumar

Columbia University

New York, NY 10027

USA

sak2144@columbia.edu

Abstract

The aim of this project is to gather statistical data about the various fields present in an email header or body that can help us differentiate between spam and non-spam (or ham) mails. If a particular field in the header or body of an email is a good indicator, the statistics gathered for it will differ for ham and spam mails. They should also be consistent over the sample dataset. To gather statistics we performed several checks on the email accounts of various users that formed the sample dataset and recorded data such as the total number of mails in ham and spam folders, number of spam and ham mails passing the test, number of spam and ham mails failing the test, number of mails for which the test is relevant and so on. After examining this data we determine if a particular field is good enough for classification purposes. The statistics are gathered for a substantial number of mailboxes to make our tests reliable and robust.

Table of Contents

i. Abstract

1. Project Overview

2. Introduction


3. Architecture

3.1 DKIM and SPF

3.1.1 Domain Keys Identified Mails
3.1.2 Sender Policy Framework Protocol Test

3.2 Dynamic Host Configuration Protocol and Digital Subscriber Line

3.3 Mailing List Headers

3.4 Reachable Hosts

4. Design and Implementation

4.1 Dynamic and Reachable Hosts
4.2 DKIM, SPF and Mailing Lists

5. Server Based Approach

5.1 Database Schema
5.2 Dynamic and Reachable Hosts
5.3 DKIM, SPF and Mailing Lists

6. Results

6.1 Results for DHCP, DSl and Reachable Hosts
6.2 Results for DKIM, SPF and Mailing Lists

7. Problems and Solutions

8. Tools Used

9. Appendix

10. References

TOP

1.Project Overview

This project determines whether certain parameters present in an email header or body are good enough for classification of mails into ham and spam. A separate module was developed for each of the parameters which was used to analyze the mails. The various modules are as follows:

  • Friend Check
  • Pingable hosts
  • Black Lists
  • Domain Check
  • In-reply-to
  • DKIM and SPF
  • Received Header
  • DHCP and DSL
  • Attachments
  • Getting hour,date,time information from the message
  • Whether the To field and the body contain the name of the person.
  • Columbia Internal mails
Some of the design features are given below:
  • All modules are implemented in java. 
  • The javamail-1.4 library is used extensively by all the modules.
  • There is a main class called MailStats.java that basically calls all the modules synchronously one after the other. 
  • The MailStats connects to the user's account on an imap server, starts up a basic user-interface using which the user can categorize his mail folders into spam, ham and sent.
  • The GUI has a progress bar to indicate which module is currently running and hence gives feedback to the user.
  • MailStats passes javax.mail.Message arrays containing the spam, ham and sent messages to all the modules, which are used to find statistics 
    and print out a result that can be used for analysis. The final output consists of combined results of the individual modules.

TOP

2. Introduction

The modules discussed in this report are as follows:
  • Check the message headers for information about dynamic hosts that use DHCP (Dynamic Hosts Configuration Protocol) or DSL (Digital Subscriber Line).
  • Check if the hosts present in the message headers are reachable. (reachable – when you ping a host with packets, it should respond to the ping by sending acknowledgment)
  • Check the message headers for Domain Keys Identified Mails (DKIM) and Sender Policy Framework (SPF)
  • Check the message headers for Mailing Lists 
The above mentioned fields were chosen because either they are used by the currently existing spam filters like spamassasinor they may act as good classifiers based on empirical evidence. The data and statistics were gathered for a uniform population consisting of Columbia and other university students. The data gathered is not universal in nature and it doesn't have population diversity. Two approaches were used to gather the data as given below. The initial approach failed to attract a lot of emails and the data was insufficient to conduct the tests. This led to using the second approach.
  • Server Based Approach - Statistics were gathered using an IMAP based server that could be used as a honey pot to attract spam mails. The design consisted of parsing the incoming mail on the fly and then storing individual fields in different tables of a database. The testing modules used the data from the database and performed various analysis and checks on them.
  • Standalone Approach - Statistics were gathered using a standalone program that connected to already existing mail servers like cubmail and gmail for different users and the testing modules used these mails to perform the checks. The checks were performed on the current snapshot of a user's mailbox.
The report is organized as follows -
Section 3 discusses the architecture of the above mentioned modules, Section 4 specifies the design and implementation details for all the modules, Section 5 discusses the overall design and implementation of the server based approach, Section 6 gives the results gathered for all the modules and Section 7 lists the problems and their solutions.

TOP

3. Architecture    

The basic architecture of the modules and what each module does is described below.

3.1 DKIM and SPF    

3.1.1 Domain Keys Identified Mail - DKIM    

DKIM lets an organization take responsibility for a message in transit. The domain owner generates one or more private/public key-pairs that will be used to secure messages originating from that domain. The domain owner places the public-key in his domain namespace (i.e., in a DNS record associated with that domain), and makes the private-key available to the outbound email system. When an email is submitted by an authorized user of that domain, the email system uses the private-key to digitally sign the email associated with the sending domain. The signature is added as a header to the email, and the message is transferred to its recipients in the usual way.

For example:

DomainKey-Signature: a=rsa-sha1; q=dns;
d=example.com;
i=user@eng.example.com;
s=jun2005.eng; c=relaxed/simple;
t=1117574938; x=1118006938;
h=from:to:subject:date;
b=dzdVyOfAKCdLXdJOc9G2q8LoXSlEniSb
av+yuU4zGeeruD00lszZVoG4ZHRNiYzR

Thus, the mails that have been authenticated with the signature will be present in the user's inbox. If a mail provider uses DKIM validation, then the ham mails will have authenticated signatures and spam mails won't. This is the basis of searching for DomainKey-Signature” in the email header.

3.1.2 Sender Policy Framework Protocol Test    

The domain owners may authorize hosts to use their domain name in the "MAIL FROM" or "HELO" identity. Compliant domain holders publish Sender Policy Framework (SPF) records specifying which hosts are permitted to use their names, and compliant mail receivers use the published SPF records to test the authorization of sending Mail Transfer Agents (MTAs) using a given "HELO" or "MAIL FROM" identity during a mail transaction.

A mail receiver can perform a set of SPF checks for each mail message it receives. An SPF check tests the authorization of a client host to send mail with a given identity. Typically, such checks are done by a receiving MTA and they result in adding a header in the email as “Received-SPF”.

The Received-Spf header field is followed by a result and some comment conveying supporting information for the result like <ip>, <sender>, and <domain>. The values of the result field are:

  • Passthe message meets the publishing domain's definition of legitimacy.

  • Failthe message does not meet a domain's definition of legitimacy.

  • SoftFailthe message does not meet a domain's strict definition of legitimacy, but the domain cannot confidently state that the message is a forgery.

  • Errorindicates an error during lookup.

  • Unknown - indicates incomplete processing.
  • Neutral - The SPF client must proceed as if a domain did not publish SPF data.
The mails in the ham and spam folders should typically have the result as "pass" or "neutral" and "fail", "softfail" or "neutral" respectively.

3.2 Dynamic Hosts Configuration Protocol and Digital Subscriber Line    

Dynamic Host Configuration Protocol (DHCP) and Digital Subscriber Line (DSL) are used to assign dynamic <ip>.

Dynamic Hosts Configuration Protocol automates the assignment of <ip> addresses. When a DHCP-configured client connects to a network, it sends a broadcast query requesting necessary information from a DHCP server. The DHCP server manages a pool of IP addresses and information about client configuration parameters such as the default gateway, the domain name, the DNS servers, other servers such as time servers, and so forth. Upon receipt of a valid request the server will assign the computer an IP address, a lease (the length of time for which the allocation is valid), and other TCP/IP configuration parameters, such as the subnet mask and the default gateway. Thus, the mails that come from a dynamic host cannot be verified based on its host name.

DSL, is a family of technologies that provide digital data transmission over the wires of a local telephone network. The customer end of the connection consists of a DSL modem. This converts data from the digital signals used by computers into a voltage signal of a suitable frequency range which is then applied to the phone line. Thus, a permanent <ip> address will not be available for DSL hosts.

The email headers do not directly contain this information but by analyzing the headers we can find out if the sender's server was using DHCP or DSL. The DHCP and DSL module looks at the email headers and generates statistics for sender and mail servers that use DHCP and DSL. These statistics are interesting because we may be able to establish a relationship between senders of spam mails and hosts for which <ip> is assigned dynamically.

3.3 Mailing List Headers    

The mailing list headers are List ID, List Subscribe and so on. The mailing list headers are used to provide information about the corresponding mailing list. This information can be used to find out if mailing lists are present in ham and spam mails. Generally, mails received from a mailing list, for which the user has a subscription, will not be spam.

3.4 Reachable Hosts

This module checks if the host name in the from and by field present in the received header of a mail, can be pinged or not. If the host can be pinged, then it means that a particular internet address exists and can accept requests. An authentic mail server should be reachable because, it will be up and running most of the times and should be able to accept the TCP/IP packets. The “by” field of the first received header of the trace gives information about the sender. Consider for example:

Received: from [192.168.123.110] (user-387gp1m.cable.mindspring.com [208.120.100.54])
        (user=sak2144 mech=PLAIN bits=0)
        by serrano.cc.columbia.edu (8.14.1/8.14.1) with ESMTP id lBJ7Ww2E028346
        (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT);
        Wed, 19 Dec 2007 02:32:58 -0500 (EST)
Message-ID: <4768C929.3070105@columbia.edu>

Here the first received header contains “from” field indicating the sender machine's ip address and the “by” field contains the server name, that received the mail from the sender. Now, the by field may contain the name of another mail server or the sender's mail server. In both cases, it should be reachable. The “from” field may or may not reachable, because it depends on the sender's computer. Thus, as ham mails use authentic Mail Sending Agents, the by host name should generally be reachable.

TOP

4. Design and Implementation    

This section details the design and implementation of the modules for the standalone program. All the modules implement Module interface, which is used by the main class to call the individual modules. The main class is used to connect to a particular user's mail account on an IMAP enabled server and cache the mail headers required by each module, for ham and spam folders. The Module interface is used to pass data from the main class to individual modules. The output of the program is a combined result that can be stored and later used for analysis. The checking for DHCP, DSL and pinging hosts were combined in a single module called Dynamic and Reachable Hosts Module. The checking for DKIM, SPF and Mailing List were combined in another module called DKIM, SPF and Mailing List Module. The detailed design and implementation of both the modules is given below:

Dynamic and Reachable Hosts - Design and Implementation

Design

The design of this module is object oriented where a single class represents the entire module. The dynamic and reachable hosts use the same data for getting the statistics, hence they could be combined together in the same class. The class diagram is shown below:




Figure 4.1

Figure 4.1 shows the classes of the module. There is one main class DhcpDslPing that inherits from the base class Module and a helper class called PingLookup that is called from DhcpDslPing class to ping a given hostname. The DhcpDslPing class is the main class that calls the ping() and parseReceivedforDhcp() methods to gather data for reachable and dynamic hosts for a given set of messages belonging to a particular folder like spam or ham. Some of the design decisions made were as follows:

  • A helper class was used for pinging hosts because ping is a time consuming process and thus, it could be performed asynchronously.
  • The hosts that have already been pinged can be stored, along with their results, in a data structure so that if a host is repeated, the result can be retrieved from the data structure, thus eliminating the need to ping again.
  • The number of mails for ham and spam folders for a user can be quite large, and to reduce the amount of processing, dynamic and reachable hosts have been combined into a single module.

The fundamental design for the module is to get each message, parse the Received header from the trace field of each message header and then check if the parsed host name is reachable or dynamic.

Implementation

Dynamic Hosts

The received field is checked for DHCP and DSL hosts using the method parseReceivedforDhcp(). In this method, the first Received field from each message is parsed and the from and by domains are extracted. These domains are then checked for the following:

  • String dhcp – Checks to see if the string “dhcp” is a part of the from and by fields of the first received header.

  • String dsl - Checks to see if the string “dsl” is a part of the from and by fields of the first received header.

  • String dclient - Checks to see if the string “dclient” is a part of the from and by fields of the first received header.

  • String cable-- Checks to see if the string “cable” is a part of the from and by fields of the first received header.

  • <ip> separated by dashes - The ip address is separated by dashes and then appears again separated by dots. For eg: [192-34-45-66] followed by [192.34.45.66].

Thus, the mails that satisfy the above checks are said to have come from dynamic hosts. The method has a counter which it increments every time a string match takes place. Finally the counter value is added to the result. These strings are matched using regular expressions in java. For the string checks, the indexOf(string) method is used and if the string is present in the domain name, then its value is greater than -1. For the ip address separated by dashes and then by dots, the from.*((\\d+-){3}\\d+) regular expression is used.

Reachable Hosts

The Received header needs to be parsed differently for the ping module. The host name of the from and by fields need to be extracted and then given to ping() method. Since, the hosts may be repeated for more than one mails, the host names are stored in a data structure called hashmap and only the ones that are not repeated and pinged. The result is evaluated and if the host was reachable, a counter is incremented. The result for a hostname is "true" if the host is reachable and "false" if it isn't reachable. After the ping method returns, the hostname and its corresponding result is stored in a hashmap. Thus, if this host name appears again, it will first be searched in the hashmap and only if it is not found, the actual ping command will be executed. This is done to improve the efficiency of the module, because a “ping” is a time consuming operation and lesser the pings, faster will be the program. To further increase the efficiency, first all the domain names of the from and by parts of the recieved fields are extracted and stored in an arraylist. This eliminates the need to repeat parsing the recieved header seperately for from hosts and by hosts.

For example:
Received: from [128.59.21.187] (photon.win.cs.columbia.edu [128.59.21.187])
(user=skn3 mech=PLAIN bits=0)
by serrano.cc.columbia.edu (8.14.1/8.14.1) with ESMTP id m0HMnJdt006183
(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT)
for <sak2144@columbia.edu> Thu, 17 Jan 2008 17:49:19 -0500 (EST)

The from host is extracted as 128.59.21.187 and stored in an arraylist. Also, the by host is extracted as serrano.cc.columbia.edu and stored in another arraylist. Once both the lists have been populated, the ping() method is called where the lists are passed to it one by one. The ping method uses thread pool concept in java and pings the hosts in the list. The result of the ping along with the host name is stored in a hashmap and this hashmap is checked first before pinging. The helper class pinglookup is used for pinging the host. It implements the "Runnable" interface and perfoms the ping function synchronously.

The ping() method is implemented using the operating system (OS) ping command. It checks for the OS on the machine where the stand-alone program is running and forms a ping command accordingly.
For eg: ping -n 3 -w 200 <some host name> is the ping command for windows, where n gives the number of packets to be sent and w is the wait time.

After the module has been executed for all the messages for a particular folder like inbox or spam, the result string is formed and returned to Module. This result which, consists of the number of dynamic hosts, number of ping hits for “by” and number of ping hits for “from” is displayed on the GUI.

An example of a result is as follows:


DHCP and Ping module
Non-Spam
    Dynamic senders : 187/749
    Ping hits for trace field By : 504/749
    Ping hits for trace field From : 179/749
Spam
    Dynamic senders : 0/0
    Ping hits for trace field By : 0/0
    Ping hits for trace field From : 0/0


DKIM, SPF and Mailing Lists Module - Design and Implementation

Design

The class diagram, representing the design of this module is shown below:




Figure 4.1.1

Figure 4.1.1 shows the class diagram that gives the design of the module. For every mailbox, the set of messages are iterated upon and the headers of DomainKey-Signature, Received-SPF and List-ID are parsed for each message and analyzed for DKIM, SPF and mailing lists modules respectively. If the DKIM header is present, it means that the message recipient has verified the signature by querying the signer's domain directly to retrieve the appropriate public key, and thereby confirmed that the message was attested to, by a party in possession of the private key for the signing domain, hence, validating the authenticity of the sender. If the Received-SPF header is present, then it also has a result value associated with it as given in section 3.1.2 and depending on the value, we confirm the authenticity of the sender. The List-ID header determines the presence of mailing list. The List-ID provides an identifier for an email distribution list.

For DKIM and mailing lists, checking the header field will suffice, but not all receiving mail servers perform SPF check. Thus, for mails that don't contain Received-SPF header, SPF checking is performed by the module using information from the email headers.SPF checking is performed using the MAIL FROM host name, HELO identity domain name or the MAIL FROM domain name and the ip address of the sender.
The domain owners wishing to be SPF compliant must publish SPF records for the hosts that are used in the "MAIL FROM" and "HELO" identities. The check_host() function uses the following arguments:

The domain portion of sender will usually be the same as the domain argument when check_host() is initially evaluated.

Implementation

The module consists of just one method DkimSpf() that gets the Received-Spf field, DomainKey-Signature field and List-ID field for each message corresponding to a mailbox. The mails from each of the mailboxes are read by the base class Module and stored in arrays of Message class. The message array is iterated over and each message is checked for the three fields. If the Received-Spf field is present, then it also gets the SpfResult for that message and stores it in the final result. The corresponding result counter is updated. Thus, if the result is fail then the counter for fail is incremented by one. Similarly, for the other results like Pass, Fail, Error, Neutral, SoftFail, corresponding counters are incremented. If the SPF checking is not done by the Mail Transfer Agent (MTA), then jSPF library is used to perform the checking. The jSPF library has a checkSPF() method that uses the three paramaeters of the MAIL FROM host name, HELO identity domain name and client ip address. These three parameters are found from the messsage headers as follows:

The parameters are extracted using regex expressions and string manipulations. When these parameters are given to the checkSPF function, it returns a result based on the SPF entry in the DNS. The jSPF library handles this internally. The result obtained, is used to increment the respective counters for pass, fail, error, neutral and softfail.

While performing SPF checking, a special case arises when the domain of the sender is same as that of the receiver. For example, if a mail is sent from one columbia server to another, then all ip addresses will begin with 128.59.x.x and so we will not be able to determine where the actual hop, that is the transfer of the mail from sender's mail transfer agent to receiver's mail server, takes place. This holds true for internal mails on all mail servers. Also, internal mails do not require any checking as they come from a reputed and trusted mail server. Hence, we ignore the internal mails and keep a separate count for them.

The DKIM check is simply checking for the presence or absence of the DomainKey-Signature header and if the header is present, a DKIM counter is incremented by one. Similarly for the mailing lists, presence or absence of List-ID header gives us the information if the message is from a mailing list or not. If the message is from a mailing list, a mailing list counter is incremented by one.

Finally the result string is updated with all this information and is displayed on the Graphical User Interface (GUI). The result is as follows:
DKIM AND SPF module
Non-Spam
 Dkim encryption : 211/749
 Mailing Lists : 65/749
 Spf Result for 348 out of 748 mails
 Fail : 2
 Pass : 238
 Error : 0
 Neutral : 104
 SoftFail : 4
internal mails found = 400
Spam
 Dkim encryption : 0/0
 Mailing Lists : 0/0
 Spf Result for 0 out of 0 mails
 Fail : 0
 Pass : 0
 Error : 0
 Neutral : 0
 SoftFail : 0
internal mails found = 0
The progress bar is updated each time a message has been parsed and its result has been added to the result string. This is to give the user continuous feedback about the module's progress.

TOP


5. Server Based Approach
    

Server based approach consisted of a mail server where the all the mails addressed to this server would be stored. When a mail was received, the parser was invoked and it parsed the message in header and its value pairs. They were then stored in the database. The parsing of the message was done so that information could be retrieved from the database by the other modules. The database had all the fields based on RFC 2821/2822 architecture.

Dynmaic and Reachable hosts, DKIM, SPF and Mailing list modules were implemented for the server-based approach alongwith the database schema. The DKIM, SPF and mailing lists modules were a part of the parser itself, since they needed to retrieve the header values directly present in the email and find the total number of messages for which these fields were present. The dynamic and reachable hosts module were implemented in a manner similar to the standalone program. The internal working of the modules did not differ much from server-based to standalone, except that the values for message headers were queried from the database during server-based implementation and were actually parsed for each message on the fly for standalone approach.
The Server Based Approach was used for the following modules:

Database Schema and Database Connectivity class - Design and Implementation: A Server Based Approach

For the server based approach, the following database schema was used. The database system used was MySQL.

Database schema

TABLE 1 - MESSAGE:
create table message (
message-id varchar2(50) NOT NULL,
date datetime NOT NULL,
sender varchar2(50),
return-path varchar2(50),
list_subscribe tinyint(1),
subject longtext,
body blob,
PRIMARY KEY (msgcount)
);
Explanation of the fields:
message-id - This is the unique id assigned to every message. It consists of characters and numbers. It is stored as a string
date - gives the date and time. It will be stored as yyyy-mm-dd hh:mm:ss
sender - This field is sometimes used even when multiple "from" fields are not present.
For eg: when a gmail account is used to send a message to columbia mail.
return-path - Same as the from/sender field for ham mails, it represents the MAIL FROM identity of the sender.
list_subscribe - It represents the mailing list, if present. It is for our convenience to know if there is a mailing list present.
subject - It is the subject in the mail header. It will be stored as longtext which is similar to clob.
body - The entire body of the message will be stored as a blob object, since it may contain characters that need to be escaped.

TABLE 2 - IN-REPLY-TO:
create table in-reply-to (
CONSTRAINT in-reply-to_fk FOREIGN KEY(parent-msg-id)
REFERENCES message(message-id),
CONSTRAINT message_count_fk FOREIGN KEY(msgcount_id)
REFERENCES message(message-id),
PRIMARY KEY (parent-msg-id)
);
Explanation of fields:
in-reply-to -This field may be used to identify the message (or messages) to which the new message is a reply. As it can have more than one value, hence its a separate table. It has a one-to-many representation with the message-id of the given message.
Also, since in-reply-to is used for denoting the message id of earlier messages, this message id should already be present in the database, thus, parent-msg-id is a foreign key from the message table.
parent-msg-id: It gives the message-id of the parents or the earlier threads to the current message.
message-id: gives the current message and is a foreign key for representing the one-to-many relationship.

TABLE 3 - REFERENCES:
create table references (
CONSTRAINT references_fk FOREIGN KEY(thread-msg-id)
REFERENCES message(message-id),
CONSTRAINT message_count_fk FOREIGN KEY(msgcount_id)
REFERENCES message(message-id),
PRIMARY KEY (thread-msg-id)
);
Explanation of fields:
thread-msg-id - Contains all the message ids belonging to a particular thread. Similar to parent-msg-id of previous table.
message-id - links the thread to a particular message.


TABLE 4 - FROM
create table from (
from_display_name varchar2(50),
from_addr_spec varchar2(50) NOT NULL,
CONSTRAINT message_count_from_fk FOREIGN KEY(msgcount_id)
REFERENCES message(message-id),
PRIMARY KEY(msgcount,from_addr_spec)
);
Explanation of fields:
from_display_name : Gives the optional display name, eg: in "Swati" < sak2144@columbia.edu > , we store Swati
from_addr_spec : Gives the second part ie. sak2144@columbia.edu
The display name is optional but the addr-spec is not. So, the primary key will be message-id that associates the entries with a particular message and the addr-spec. Thus, the primary key is a combination of the foreign key and the unique identifier.

TABLE 5 - TO
create table to (
to_display_name varchar2(50),
to_addr_spec varchar2(50) NOT NULL,
CONSTRAINT message_count_from_fk FOREIGN KEY(msgcount_id)
REFERENCES message(message-id),
PRIMARY KEY(msgcount,to_addr_spec)
);
It represents the "to" field of the message header.
Explanation of fields:
to_display_name : Gives the optional display name eg: in "Swati" < sak2144@columbia.edu >, we store Swati
to_addr_spec : Gives the second part ie. sak2144@columbia.edu
The display name is optional but the addr-spec is not. So, the primary key will be message-id that associates the entries with a particular message and the addr-spec.

TABLE 6 - REPLY-TO
create table reply-to (
reply-to-name varchar2(50),
reply-to_addr_spec varchar2(50) NOT NULL,
CONSTRAINT message_count_from_fk FOREIGN KEY(msgcount_id)
REFERENCES message(message-id),
PRIMARY KEY(msgcount,reply-to_addr_spec)
);
Explanation of fields:
The fields name and addr-spec are the same as for the to and from tables. The reply-to table is for storing the address of the mailbox(s) to which the reply is to be sent. If this is not present all the replies will be sent to "from" field mailbox.


TABLE 7 - HTTP_LINKS
create table htttp_links (
url varchar2(50),
link_id NOT NULL AUTO_INCREMENT,
CONSTRAINT message_count_from_fk FOREIGN KEY(msgcount_id)
REFERENCES message(message-id),
PRIMARY KEY(message-id,link_id)
);
Explanation of fields:
These fields are used for storing urls found in the message body. The link_id is just to uniquely determine the link.
msgcount associates the url with the message.


TABLE 8 - TRACE
tracecount INT NOT NULL,
received_from_host varchar2(50),
received_from_addr varchar2(50),
received_by_host varchar2(50),
received by_addr varchar2(50),
via varchar2(10),
with varchar2(10),
id varchar2(50),
for_display_name varchar2(50),
for_addr_spec varchar2(50),
CONSTRAINT message_count_from_fk FOREIGN KEY(msgcount_id)
REFERENCES message(message-id),
PRIMARY KEY(message-id,tracecount)
);
Explanation of fields:
received_from_host - Gives the hostname present in the "from" sub-field of the received fields.
received_from_addr - Specifies the IP address
Similarly received_by_host and received_by_addr represent the hostname and IP address in the "by" sub-field.
via - gives the protocols eg: TCP
with - gives additional details like with ESMTP
Each trace may/may not contain an id (unique)
for - name and addr-spec. eg: for "Swati"


TABLE 9 - MAILBOX
create table mailbox (
mailbox_display_name varchar2(50),
mailbox_addr_spec varchar2(50),
CONSTRAINT message_count_from_fk FOREIGN KEY(msgcount_id)
REFERENCES message(msgcount)
PRIMARY KEY(mailbox_addr_spec,messge-id)
);
Explanation of fields:
Each message is linked to a person/mailbox for which it is meant. This is a convenience table meant for fast sorting of messages
for a particular mailbox. eg: sak2144@columbia.edu received messages with msgcount 1,2,4,10,14.
The primary key is a combination of mailbox_name and message-id.
Using this table, we can get all the other details for the messages for a particular maibox.

The database was populated by using database connectivity class that uses jdbc connection, which was called by the parser every time a new message arrived. The implementation of the database, database connectivity class and the parser is detailed below.

Modules implemented

Database and Database Connectivity class:


Figure 5.1

The data flow diagram (DFD) in Figure 5.1 shows the parser and the DB Connectivity class. Insert_DB is a function inside DB Connectivity class that calls the appropriate function for inserting the values in the database. The flow of information between modules is as follows:
The parser parses the message calls the DB Connectivity class. This class uses the getter methods for all the values parsed. For example: The From field value is got by the database connectivity class by using getFrom() method of the parser. It then calls insert_from() method to insert the value in the FROMS table of the database. Also, for each messsage, a unique identifier is used called Message-ID using which all the tables and fields of the database can be accessed. This is passed by the parser to the controller class which uses it while calling the other modules.

Dynamic and Reachable Hosts - Design and Implementation: A Server Based Approach

Design

The design for dynamic and reachable hosts was object oriented, with the module being invoked by a Controller class. The Controller class invokes the modules whenever a new message is received on the sarp server. The class diagram of the module is shown below:

Figure 5.2.1

As shown in figure 5.2.1, the parser calls the Controller class and populates the database. The Controller class calls the module, that uses the populated data in the database and performs the checking. The module talks to a database connectivity class as database querying is involved.
The design is made as modular as possible, as querying to the database is separated from the business logic of performing the check and finding the results. The results are found for each mail, whose message-id - a unique identifier to the mail, is passed to the module by the controller.
The advantage gained by following this approach is that if, there is a change in the database schema or the database itself, it will not affect the actual implementation of the module.
The Controller is used to invoke the modules, becuase the mail server is always up and running, so if there is some modification in any of the modules, the Controller can be configured to not invoke that module till the change has been completed. Thus, the server need not be stopped and started to recompile the code, as that can be done remotely and added to the list of modules called by the Controller.

Implementation

The Controller class has the list of modules to be invoked in a array called moduleArray. It reads the array and invokes the appropriate module by calling the run() method of each module.
The run method of DhcpDslPing class contains an object of the database connectivity class. When the getBy() and getFrom() methods are called, the database is queried and "by" and "from" hosts of the trace field for that message are returned. These are used by the parseReceivedforPing() and parseReceivedforDhcp() methods in the same way as given in section 4.1 of the standalone program. The getResult() returns the result of the check which is then stored in a result file.
The JdbcConnection class is the database connectivity and querying class. It has methods like getBy() and getFrom() that are used as interface to retrieve the "by" and "from" dynmaic hosts.

The DFD for Dynamic and Reachable hosts is given as follows:

Figure 5.2.2

As shown in figure 5.2.2, the dynamic and reachable hosts module queries the FROMS and TRACE tables to get the host names. It then stores the results of its processing back to the database in a result file. The message id is given to it by the Controller, which also calls the module.

The concept of implementing the module for standalone approach and server based approach remains constant and so the server based approach can be used in future if the honey-pot attracts spam and the gathers enough data for analysis. In this approach, we are not limited by the population size or type and data can be stored permanently for further analysis. Thus, some of the drawbacks of the standalone program can be overcome by using the server based approach.

DKIM, SPF and Mailing List - Design and Implementation: A Server Based Approach

Design

The DKIM, SPF and Mailing List consisted of extracting fields from the mail header, which could be included in the parser. This is done because the parser is basically used to parse the message header into various parts and store them in the database. The module for DKIM, SPF and Mailing List just checks for the fields of DomainKey-Signature, Received-SPF and List-ID. The values that are parsed and stored in the database for DKIM, SPF and mailing lists are as follows:
DKIM - The header name is DomainKey-Signature, the corresponding value is the signing domain. Refer section 5 for a detailed explanation of the working of DKIM.
SPF - The header name is Received-SPF, the corresponding value is the reuslt like pass, fail, softfail, neutral and error. The domain for which the SPF was obtained was also stored
Mailing List - The header name is List-ID, the coressponding value is the mailing list domain name to which the user has subscribed.

Implementation

The implementation was based on parsing the appropriate header fields. This was done using getHeader(header_name) method. This function returns the entire value for the header and to parse this header we define methods like parseDkim(), parseSpf() and parseList(). The parsing is performed using string manipulation and regex expressions.. Thus, after the entire message has been parsed, the values are stored in local variables of the parser class. The database connectivity class retrieves these values using getter methods and stores them in the database. The parser roughly implements a bean structure for passing data to and from other classes, as it uses getter and setter methods for all the fields and values parsed.

The DKIM header is added to the mail by the sender's mail server, but the SPF checking is done at the receiving server. This was implemented at the sarp mail server by using policyd-spf, which is basically a tool that performs SPF checking and adds the Received-SPF header to the email with the result and domain name. The List-ID header will be present only if the mail is sent by a mailing list.

The concept of implementing the module for standalone approach and server based approach remains constant and so the server based approach can be used in future if the honey-pot attracts spam and the gathers enough data for analysis. In this approach, we are not limited by the population size or type and data can be stored permanently for further analysis. Thus, some of the drawbacks of the standalone program can be overcome by using the server based approach.

TOP


6. Results
    

Sample Data Set

The sample data set consists of 22 mailboxes on which the tests were performed. The details of these mailboxes are given in the following table:

Mailbox # Mailbox Name Ham Mails Spam Mails Total Mails
1 aditi_columbia 1818 0 1818
2 aditi_gmail 497 96 593
3 Deepti_columbia 1174 0 1174
4 Deepti_gmail 576 65 641
5 dhrumin_gmail 5002 103 5105
6 pinank_gmail 1418 264 1682
7 Preetinarayan_columbia 1230 0 1230
8 Preetinarayan_gmail 1788 204 1992
9 sneha_gmail 133 227 360
10 spinank_gmail 524 355 879
11 vasa_columbia 168 0 168
12 dms2169_columbia 1301 21 1322
13 nirav_gmail 1360 48 1408
14 nns_2108 934 0 934
15 manish_gmail 414 45 459
16 pragni_gmail 1999 184 2183
17 preetimalik_columbia 527 0 527
18 preetimalik_gmail 380 0 380
19 sak2144 749 0 749
20 shradha_columbia 140 0 140
21 shradha_gmail 1151 371 1522
22 vasa_gmail 2367 890 3257

Table 6.0

6.1 Dynamic and Reachable Hosts

The check for dynamic hosts and reachable hosts is performed for both ham and spam mails of a sample mailbox. There are cases when no spam messages exist for the mailbox and in this case the check is performed for ham mails only.
The results are represented using a scatter graph. The x axis represents the Mailbox number and the y axis represents the Percentage Mails that pass the check among ham and spam mails of all the sample mailboxes.

Dynamic Hosts Check

The table and description of each table column is shown below.

  • Column 1 - corresponds to the mailbox number
  • Column 2 - gives the percentage of mails among the ham mails for each mailbox that were sent from a dynamic host
  • Column 3 - gives the percentage of mails among the spam mails for each mailbox that were sent from a dynamic host. If the column value is No Spam, it means that there were no spam mails.

Mailbox # % Dynamic Hosts for Ham % Dynamic Hosts for Spam
1 28 No spam
2 1 10
3 24 No Spam
4 1 20
5 2 16
6 1 10
7 32 No Spam
8 4 7
9 0 15
10 0 0
11 11 No Spam
12 36 0
13 12 0
14 37 No Spam
15 0 9
16 1 1
17 19 No Spam
18 2 No Spam
19 25 No Spam
20 5 No Spam
21 1 6
22 0 7

Table 6.1

The scatter graph representing the above data is given below:

dynamic host graph

Figure 6.1

The graph in Figure 6.1 indicates that for most of the mailboxes, the number of dynamic hosts for spam mails is more than ham mails. With the exception of mailbox 12 and 13, the red dots are above the blue dots. This tells us that the spam mails are more likely to be sent from dynamic hosts than ham mails. The percentage of dynamic hosts present also determine if this check can be used as an effective filter. All the figures for both ham and spam are below 40% implying that not many mails are sent from dynamic senders. Thus, the dynamic host check can be used as a filter but with lesser importance.

Reachable Hosts Check

The host names for both the "From" and "By" fields of the email header were checked if they were reachable. Thus, the statistics for the two fields were gathered separately.

The table for reachable hosts in "From" field and its column description is given below: Column 1 - corresponds to the mailbox number
Column 2 - gives the percentage of mails among the ham mails for each mailbox that responded to a ping request sent to the host of their "From" field and were reachable.
Column 3 - gives the percentage of mails among the spam mails for each mailbox that responded to a ping request sent to the host of their "From" field and were reachable. If the column value is No Spam, it means that there were no spam mails.

Mailbox # % Reachable Hosts in "From" field - Ham % Reachable Hosts in "From" field - Spam
1 28 No spam
2 21 18
3 30 No Spam
4 12 29
5 27 26
6 33 20
7 22 No Spam
8 22 15
9 7 22
10 30 31
11 18 No Spam
12 39 81
13 12 2
14 34 No Spam
15 7 18
16 19 6
17 0 No Spam
18 5 No Spam
19 24 No Spam
20 6 No Spam
21 6 14
22 30 25

Table 6.2

The scatter graph representing the above data is given below:

reachable hosts from field

Figure 6.2

The graph in Figure 6.2 shows that the blue and red dots are randomly scattered and do not show any consistent pattern. For some mailboxes the red dots are more than the blue dots indicating that the percentage of hosts that could be reached by pinging them is more than that of ham mails. There can be many reasons for this behavior. The "From" field of the email header indicates the host machine of the sender, which may be a laptop or desktop and is not always reachable. Also, the machine may be situated in a secure network and can't be pinged. Thus, the host names in the "From" field should not be used as a classifier for spam filtering as they lead to inconsistent results.

The table for reachable hosts in "By" field and its column description is given below:
Column 1 - corresponds to the mailbox number
Column 2 - gives the percentage of mails among the ham mails for each mailbox that responded to a ping request sent to the host of their "By" field and were reachable.
Column 3 - gives the percentage of mails among the spam mails for each mailbox that responded to a ping request sent to the host of their "By" field and were reachable. If the column value is No Spam, it means that there were no spam mails.

Mailbox # % Reachable Hosts in "By" field - Ham % Reachable Hosts in "By" field - Spam
1 79 No spam
2 42 11
3 79 No Spam
4 48 20
5 62 10
6 52 24
7 85 No Spam
8 53 33
9 30 25
10 90 30
11 79 No Spam
12 24 5
13 48 79
14 78 No Spam
15 21 9
16 64 32
17 0 No Spam
18 32 No Spam
19 68 No Spam
20 46 No Spam
21 52 22
22 75 28

Table 6.3

The scatter graph representing the above data is given below:

reachable hosts by

Figure 6.3

The graph in Figure 6.3 shows that the hosts in the "By" field of ham mails are much more reachable than spam mails. We can see the blue dots representing the reachable hosts for ham mails are concentrated on the upper portion of the graph and there is a distinct gap between the scatter of ham and spam mails that were found to be reachable. With the exception of mailbox 13, where the percentage of spam mails is greater than ham mails, the rest of the mailboxes display consistent results. Sometimes even if the mail server used for sending the mail, is reachable, it might not respond to ping requests to protect itself from ping attacks and this could be the reason for lesser ham mail hosts passing the reachable hosts check in mailbox 13. As the test displays consistent results with just one exception, it can be used as a strong filter to determine spam mails.

6.2 DKIM, Mailing List and SPF Check

DKIM Check

The DKIM test simply checks the ham mails for the presence of DomainKeys Signature and the spam mails for the absence of it. This is because, if the DomainKeys Signature header is present in the mail, it has already been authenticated and there is no need for further checks. Thus, the DKIM provides a very strong filter and should be used early on for segregation of ham and spam mails. If an email passes the DKIM test, then it is surely ham. If the header is not present, only then it should be subjected to further tests to classify it. The table for DKIM test data is given below:

Mailbox # % Ham Mails with DKIM Header % Spam Mails with DKIM Header
1 17 No spam
2 31 0
3 13 No Spam
4 23 0
5 70 1
6 56 5
7 11 No Spam
8 44 2
9 5 0
10 12 0
11 5 No Spam
12 4 1
13 33 0
14 8 No Spam
15 28 0
16 27 0
17 5 No Spam
18 25 No Spam
19 28 No Spam
20 3 No Spam
21 21 2
22 78 4

Table 6.4

The graph of the percentage of ham mails that contain DomainKeys signature and are authenticated is shown below. This graph helps us to determine the popularity of the DKIM method as it depends on the domains that have published their public key.

Ham mails with DKIM signature

Figure 6.4

The graph in Figure 6.4 shows that only 3 mailboxes have greater than 50% mails that satisfy the DKIM check. On an average, 25% of mails have the DomainKeys Header. To further increase the scope of DKIM more domains need to register and publish their public key.

Mailing List Check

The ham and spam mails for all the mailboxes were checked for the mailing list headers. If the mail has a mailing list header, then it is not spam as the user has subscribed to the mailing list and thus, the mailing list domain is a known domain for the user. This can also be used as a strong filter to categorize the mails into ham and spam.
The table containing mailing list data is given below:

Mailbox # % Ham Mails with Mailing List Header % Spam Mails with Mailing List Header
1 36 No spam
2 12 0
3 30 No Spam
4 0 0
5 66 0
6 40 0
7 38 No Spam
8 42 0
9 0 0
10 2 0
11 8 No Spam
12 36 0
13 6 0
14 30 No Spam
15 0 0
16 22 0
17 10 No Spam
18 16 No Spam
19 10 No Spam
20 0 No Spam
21 0 0
22 38 0

Table 6.5

Thus, we can see that none of the spam mails came from a mailing list. The table also indicates the percentage of mails sent from a mailing list for an average user which helps in determining the scope of the mailing list check. On an average 20% of the mails are sent from mailing list. But, the standard deviation from the mean is about 19 showing that the percentage of mails sent from mailing lists widely differs from one mailbox to the next and is subjective to the user. Nonetheless, mailing list check can be implemented as a good filter to classify mails.

SPF Check

The SPF test was performed on both ham and spam mails, but the mails sent from the same domain as the recipient were not included in the test as it not possible to determine the sender and receiver hosts after the mail had been received. These mails are known as domain internal mails. Also, the chances of a domain internal mail being spam is very minimal.
The table for the SPF results and column description is shown below:
Column 1: Mailbox number
Column 2: The total number of ham mails in each mailbox
Column 3: The number of mails on which the SPF check was performed. (excludes domain internal mails)
Column 4: The number of mails that produced the result "Fail" for the SPF check
Column 5: The number of mails that produced the result "Pass" for the SPF check
Column 6: The number of mails that produced the result "Error" for the SPF check
Column 7: The number of mails that produced the result "Neutral" for the SPF check
Column 8: The number of mails that produced the result "Softfail" for the SPF check
Column 9: The number of domain internal mails on which the SPF check was not performed

SPF Results for Ham mails:
Mailbox# Total ham mails Mails for SPF check Fail Pass Error Neutral Softfail Internal
1 1818 420 0 393 0 25 2 1398
2 497 288 5 221 0 49 13 209
3 1174 234 0 194 0 38 2 940
4 576 373 6 239 0 126 2 203
5 5002 4759 18 4468 1 189 83 243
6 1418 1314 3 1219 2 65 25 104
7 1230 192 0 179 0 13 0 1038
8 1788 1259 14 1108 0 129 8 529
9 133 51 0 27 1 22 1 82
10 524 521 7 426 10 76 2 3
11 168 48 0 39 0 8 1 120
12 1301 237 0 147 0 69 21 1064
13 1360 1154 13 865 0 258 18 206
14 934 269 0 208 0 57 4 665
15 414 290 80 240 0 31 1 124
16 1999 1527 86 1282 2 153 4 472
17 527 113 0 98 0 13 2 414
18 380 173 8 147 0 18 0 207
19 749 349 2 239 0 104 4 400
20 140 33 0 18 0 14 1 107
21 1151 649 12 296 2 322 17 502
22 2367 2282 56 1129 4 1093 0 85
TOTAL -> 25650 16535 310 13182 22 2872 211 9115

Table 6.6

The pie chart is used to represent the fraction of mails that are fail, pass, error, neutral and softfail. The percentages that are found using the column totals are

  • Fail - 2%
  • Pass - 80%
  • Error - 0%
  • Neutral - 17%
  • Softfail - 1%

spf result fractions for ham mails

Figure 6.5

Thus, from Figure 6.5 we can see that very few ham mails fail the SPF test. The results for SPF checking are consistent and SPF check can be used as a good classifier for ham mails.

The table for SPF results on spam mails is given below:
Mailbox# Total spam Mails for SPF Check Fail Pass Error Neutral Softfail Internal
1 0 0 0 0 0 0 0 0
2 96 96 7 0 8 81 0 0
3 0 0 0 0 0 0 0 0
4 65 65 14 0 1 48 2 0
5 103 103 5 7 2 86 3 0
6 264 260 46 9 11 183 11 4
7 0 0 0 0 0 0 0 0
8 204 204 22 8 5 164 5 0
9 227 227 29 1 11 172 14 0
10 355 355 0 312 4 39 0 0
11 0 0 0 0 0 0 0 0
12 21 21 0 1 0 0 20 0
13 48 48 10 0 2 34 2 0
14 0 0 0 0 0 0 0 0
15 45 45 10 0 0 33 2 0
16 184 184 15 45 1 117 6 0
17 0 0 0 0 0 0 0 0
18 0 0 0 0 0 0 0 0
19 0 0 0 0 0 0 0 0
20 0 0 0 0 0 0 0 0
21 371 369 29 69 5 255 11 2
22 890 890 43 491 31 285 40 0
TOTAL -> 2873 2867 230 943 81 1497 116 6

Table 6.7

The pie chart is used to represent the fraction of mails that are fail, pass, error, neutral and softfail. The percentages that are found using the column totals are

  • Fail - 8%
  • Pass - 33%
  • Error - 3%
  • Neutral - 52%
  • Softfail - 4%

spf result fractions for spam mails

Figure 6.6

The chart indicates that SPF check is not sufficient to categorize spam mails. Only 8% of the mails fail the SPF check, while 33% and 52% of the mails give pass and neutral as their result. Thus, we can use SPF check as a filter with medium weight. It should be used in conjunction with other filters for better performance.

Result Analysis Summary

  • Dynamic Hosts Check: The dynamic hosts check can be used a weak filter.
  • Reachable Hosts Check for the "From" field: This can not be reliably used as a filter due to inconsistent results.
  • Reachable Hosts Check for the "By" fields: This can be used a good filter.
  • DKIM Header Check: This is a strong filter with limited scope.
  • Mailing List Header Check: This is a strong filter subjective to the user.
  • SPF Check: This can be used as a boost for ham mails and as a medium weight filter in conjunction with other filters.

TOP

7. Problems Faced and their Solutions    

One of the major problems faced with the server based approach was that it did not attract any spam mails and hence there was no sample data to work with and gather statistics. Thus, the entire approach shifted to standalone module, where instead of using a new server or honeypot to attract mails, already existing IMAP or POP enabled mail accounts on servers like columbia or gmail were used.

For a standalone program, IMAP based or POP based mail servers that are commonly used by people were needed. The limitation was in our ability to find a lot of spam mails as Columbia has an effective spam filter in place. Also, the population for which the data was gathered did not vary and represented a certain set of people namely, university students. The solution is pretty straightforward and that is to distribute the standalone program to more and diverse set of people.

Another major problem faced during the final stages of implementation was with MailStats.java. It was throwing MailBoxClosed exception for mailboxes containing more than 1000 mails. The solution was to open the mailbox only when the messages were to be read and not before that. After this bug was fixed, data could be gathered easily, since the mailbox size was not a limitation.

There were some problems with threads being used in the main module, due to which it seemed that the individual modules were not working, but later on this bug was fixed and threads could be used in parallel to update the progress bar as well as in individual module implementation.

In the standalone program the biggest concern was the amount of time it took for the ping module to complete because the messages needed to parsed and then the host name was extracted from them which was being pinged with packets. This took about 200 ms – 1 s for each message. To overcome this problem, the concept of threadpool was used.

Lastly, the user needed to be given constant feedback and for this a progress bar was implemented that showed the progress of the entire program as well as individual modules.

TOP

8.Appendix    

The link to the source code is SpamTestLatest.zip. The files for the first module consist of the code. There are 3 files namely, DhcpDslPing.java, DkimSpfMailList.java and PingLookup.java.

The link to the result set page where all the cumulative results can be accessed is http://wiki.cs.columbia.edu:8080/display/spam/Resultset

TOP

9.Tools used    

The tools used were as follows:

  • Navicat MySQL for creating the database

  • Netbeans IDE for writing code for the modules

  • OpenOffice for writing the report

  • Visual Paradigm for UML diagrams

  • OpenOffice excel for drawing graphs and representing data

  • Concept Draw Pro for DFD

  • Convert Excel Spreadsheet to HTML

TOP

10.References    

RFC 2822 [http://tools.ietf.org/html/rfc2822]

RFC 2821 [http://tools.ietf.org/html/rfc2821]

JavaMail APIhttp://java.sun.com/products/javamail/]

Spam Analysis and Reputation Project

[http://wiki.cs.columbia.edu:8080/display/spam/Home]

SARP Modules

[http://wiki.cs.columbia.edu:8080/display/spam/IMAP+analyzer+modules]

Adrian Frei - Spam Analysis and Reputation Project: DNS Blacklists

Preethi Narayan - Spam Analysis and Reputation Project : Received Header Vs Sent and From Header

Tejas Nadkarni – Parser and Standalone Framework

Aditi Rajoriya - Spam Analysis and Reputation Project: IMAP Retrieval and To/Body Module.

Dhrumin Shah - Spam Analysis and Reputation Project: Domain Check and Image Analysis Modules.

Nirav Shah - Spam Analysis and Reputation Project: Email Source, Date/Time and Attachment Analysis

Wikipedia – SPF and DKIM

Professor Henning G. Schulzrinne – Project Advisor and Mentor

TOP