TABLE 1 - MESSAGE:
create table message (
message-id varchar2(50) NOT NULL,
date datetime NOT NULL,
sender varchar2(50),
return-path varchar2(50),
list_subscribe tinyint(1),
subject longtext,
body blob,
PRIMARY KEY (msgcount)
);
Explanation of the fields:
message-id - This is the unique id assigned to every message. It consists of characters and numbers. It is stored as a string
date - gives the date and time. It will be stored as yyyy-mm-dd hh:mm:ss
sender - This field is sometimes used even when multiple "from" fields are not present.
For eg: when a gmail account is used to send a message to columbia mail.
return-path - Same as the from/sender field for ham mails, it represents the MAIL FROM identity of the sender.
list_subscribe - It represents the mailing list, if present. It is for our convenience to know if there is a mailing list present.
subject - It is the subject in the mail header. It will be stored as longtext which is similar to clob.
body - The entire body of the message will be stored as a blob object, since it may contain characters that need to be escaped.

TABLE 2 - IN-REPLY-TO:
create table in-reply-to (
CONSTRAINT in-reply-to_fk FOREIGN KEY(parent-msg-id)
REFERENCES message(message-id),
CONSTRAINT message_count_fk FOREIGN KEY(msgcount_id)
REFERENCES message(message-id),
PRIMARY KEY (parent-msg-id)
);
Explanation of fields:
in-reply-to -This field may be used to identify the message (or messages) to which the new message is a reply. As it can have more than one value, hence its a separate table. It has a one-to-many representation with the message-id of the given message.
Also, since in-reply-to is used for denoting the message id of earlier messages, this message id should already be present in the database, thus, parent-msg-id is a foreign key from the message table.
parent-msg-id: It gives the message-id of the parents or the earlier threads to the current message.
message-id: gives the current message and is a foreign key for representing the one-to-many relationship.

TABLE 3 - REFERENCES:
create table references (
CONSTRAINT references_fk FOREIGN KEY(thread-msg-id)
REFERENCES message(message-id),
CONSTRAINT message_count_fk FOREIGN KEY(msgcount_id)
REFERENCES message(message-id),
PRIMARY KEY (thread-msg-id)
);
Explanation of fields:
thread-msg-id - Contains all the message ids belonging to a particular thread. Similar to parent-msg-id of previous table.
message-id - links the thread to a particular message.

TABLE 4 - FROM
create table from (
from_display_name varchar2(50),
from_addr_spec varchar2(50) NOT NULL,
CONSTRAINT message_count_from_fk FOREIGN KEY(msgcount_id)
REFERENCES message(message-id),
PRIMARY KEY(msgcount,from_addr_spec)
);
Explanation of fields:
from_display_name : Gives the optional display name, eg: in "Swati" < sak2144@columbia.edu > , we store Swati
from_addr_spec : Gives the second part ie. sak2144@columbia.edu
The display name is optional but the addr-spec is not. So, the primary key will be message-id that associates the entries with a particular message and the addr-spec. Thus, the primary key is a combination of the foreign key and the unique identifier.

TABLE 5 - TO
create table to (
to_display_name varchar2(50),
to_addr_spec varchar2(50) NOT NULL,
CONSTRAINT message_count_from_fk FOREIGN KEY(msgcount_id)
REFERENCES message(message-id),
PRIMARY KEY(msgcount,to_addr_spec)
);
It represents the "to" field of the message header.
Explanation of fields:
to_display_name : Gives the optional display name eg: in "Swati" < sak2144@columbia.edu >, we store Swati
to_addr_spec : Gives the second part ie. sak2144@columbia.edu
The display name is optional but the addr-spec is not. So, the primary key will be message-id that associates the entries with a particular message and the addr-spec.

TABLE 6 - REPLY-TO
create table reply-to (
reply-to-name varchar2(50),
reply-to_addr_spec varchar2(50) NOT NULL,
CONSTRAINT message_count_from_fk FOREIGN KEY(msgcount_id)
REFERENCES message(message-id),
PRIMARY KEY(msgcount,reply-to_addr_spec)
);
Explanation of fields:
The fields name and addr-spec are the same as for the to and from tables. The reply-to table is for storing the address of the mailbox(s) to which the reply is to be sent. If this is not present all the replies will be sent to "from" field mailbox.

TABLE 7 - HTTP_LINKS
create table htttp_links (
url varchar2(50),
link_id NOT NULL AUTO_INCREMENT,
CONSTRAINT message_count_from_fk FOREIGN KEY(msgcount_id)
REFERENCES message(message-id),
PRIMARY KEY(message-id,link_id)
);
Explanation of fields:
These fields are used for storing urls found in the message body. The link_id is just to uniquely determine the link.
msgcount associates the url with the message.

TABLE 8 - TRACE
tracecount INT NOT NULL,
received_from_host varchar2(50),
received_from_addr varchar2(50),
received_by_host varchar2(50),
received by_addr varchar2(50),
via varchar2(10),
with varchar2(10),
id varchar2(50),
for_display_name varchar2(50),
for_addr_spec varchar2(50),
CONSTRAINT message_count_from_fk FOREIGN KEY(msgcount_id)
REFERENCES message(message-id),
PRIMARY KEY(message-id,tracecount)
);
Explanation of fields:
received_from_host - Gives the hostname present in the "from" sub-field of the received fields.
received_from_addr - Specifies the IP address
Similarly received_by_host and received_by_addr represent the hostname and IP address in the "by" sub-field.
via - gives the protocols eg: TCP
with - gives additional details like with ESMTP
Each trace may/may not contain an id (unique)
for - name and addr-spec. eg: for "Swati"

TABLE 9 - MAILBOX
create table mailbox (
mailbox_display_name varchar2(50),
mailbox_addr_spec varchar2(50),
CONSTRAINT message_count_from_fk FOREIGN KEY(msgcount_id)
REFERENCES message(msgcount)
PRIMARY KEY(mailbox_addr_spec,messge-id)
);
Explanation of fields:
Each message is linked to a person/mailbox for which it is meant. This is a convenience table meant for fast sorting of messages
for a particular mailbox. eg: sak2144@columbia.edu received messages with msgcount 1,2,4,10,14.
The primary key is a combination of mailbox_name and message-id.
Using this table, we can get all the other details for the messages for a particular maibox.

The database was populated by using database connectivity class that uses jdbc connection, which was called by the parser every time a new message arrived. The implementation of the database, database connectivity class and the parser is detailed below.

The data flow diagram (DFD) in Figure 5.1 shows the parser and the DB Connectivity class. Insert_DB is a function inside DB Connectivity class that calls the appropriate function for inserting the values in the database. The flow of information between modules is as follows:
The parser parses the message calls the DB Connectivity class. This class uses the getter methods for all the values parsed. For example: The From field value is got by the database connectivity class by using getFrom() method of the parser. It then calls insert_from() method to insert the value in the FROMS table of the database. Also, for each messsage, a unique identifier is used called Message-ID using which all the tables and fields of the database can be accessed. This is passed by the parser to the controller class which uses it while calling the other modules.

Dynamic and Reachable Hosts - Design and Implementation: A Server Based Approach

Design

Figure 5.2.1

As shown in figure 5.2.1, the parser calls the Controller class and populates the database. The Controller class calls the module, that uses the populated data in the database and performs the checking. The module talks to a database connectivity class as database querying is involved.
The design is made as modular as possible, as querying to the database is separated from the business logic of performing the check and finding the results. The results are found for each mail, whose message-id - a unique identifier to the mail, is passed to the module by the controller.
The advantage gained by following this approach is that if, there is a change in the database schema or the database itself, it will not affect the actual implementation of the module.
The Controller is used to invoke the modules, becuase the mail server is always up and running, so if there is some modification in any of the modules, the Controller can be configured to not invoke that module till the change has been completed. Thus, the server need not be stopped and started to recompile the code, as that can be done remotely and added to the list of modules called by the Controller.

Implementation

The Controller class has the list of modules to be invoked in a array called moduleArray. It reads the array and invokes the appropriate module by calling the run() method of each module.
The run method of DhcpDslPing class contains an object of the database connectivity class. When the getBy() and getFrom() methods are called, the database is queried and "by" and "from" hosts of the trace field for that message are returned. These are used by the parseReceivedforPing() and parseReceivedforDhcp() methods in the same way as given in section 4.1 of the standalone program. The getResult() returns the result of the check which is then stored in a result file.
The JdbcConnection class is the database connectivity and querying class. It has methods like getBy() and getFrom() that are used as interface to retrieve the "by" and "from" dynmaic hosts.

The DFD for Dynamic and Reachable hosts is given as follows:

Figure 5.2.2

As shown in figure 5.2.2, the dynamic and reachable hosts module queries the FROMS and TRACE tables to get the host names. It then stores the results of its processing back to the database in a result file. The message id is given to it by the Controller, which also calls the module.

The concept of implementing the module for standalone approach and server based approach remains constant and so the server based approach can be used in future if the honey-pot attracts spam and the gathers enough data for analysis. In this approach, we are not limited by the population size or type and data can be stored permanently for further analysis. Thus, some of the drawbacks of the standalone program can be overcome by using the server based approach.

DKIM, SPF and Mailing List - Design and Implementation: A Server Based Approach

Design

The DKIM, SPF and Mailing List consisted of extracting fields from the mail header, which could be included in the parser. This is done because the parser is basically used to parse the message header into various parts and store them in the database. The module for DKIM, SPF and Mailing List just checks for the fields of DomainKey-Signature, Received-SPF and List-ID. The values that are parsed and stored in the database for DKIM, SPF and mailing lists are as follows:
DKIM - The header name is DomainKey-Signature, the corresponding value is the signing domain. Refer section 5 for a detailed explanation of the working of DKIM.
SPF - The header name is Received-SPF, the corresponding value is the reuslt like pass, fail, softfail, neutral and error. The domain for which the SPF was obtained was also stored
Mailing List - The header name is List-ID, the coressponding value is the mailing list domain name to which the user has subscribed.

Implementation

The implementation was based on parsing the appropriate header fields. This was done using getHeader(header_name) method. This function returns the entire value for the header and to parse this header we define methods like parseDkim(), parseSpf() and parseList(). The parsing is performed using string manipulation and regex expressions.. Thus, after the entire message has been parsed, the values are stored in local variables of the parser class. The database connectivity class retrieves these values using getter methods and stores them in the database. The parser roughly implements a bean structure for passing data to and from other classes, as it uses getter and setter methods for all the fields and values parsed.

The DKIM header is added to the mail by the sender's mail server, but the SPF checking is done at the receiving server. This was implemented at the sarp mail server by using policyd-spf, which is basically a tool that performs SPF checking and adds the Received-SPF header to the email with the result and domain name. The List-ID header will be present only if the mail is sent by a mailing list.

TOP

6. Results

Sample Data Set

The sample data set consists of 22 mailboxes on which the tests were performed. The details of these mailboxes are given in the following table:

Mailbox #	Mailbox Name	Ham Mails	Spam Mails	Total Mails
1	aditi_columbia	1818	0	1818
2	aditi_gmail	497	96	593
3	Deepti_columbia	1174	0	1174
4	Deepti_gmail	576	65	641
5	dhrumin_gmail	5002	103	5105
6	pinank_gmail	1418	264	1682
7	Preetinarayan_columbia	1230	0	1230
8	Preetinarayan_gmail	1788	204	1992
9	sneha_gmail	133	227	360
10	spinank_gmail	524	355	879
11	vasa_columbia	168	0	168
12	dms2169_columbia	1301	21	1322
13	nirav_gmail	1360	48	1408
14	nns_2108	934	0	934
15	manish_gmail	414	45	459
16	pragni_gmail	1999	184	2183
17	preetimalik_columbia	527	0	527
18	preetimalik_gmail	380	0	380
19	sak2144	749	0	749
20	shradha_columbia	140	0	140
21	shradha_gmail	1151	371	1522
22	vasa_gmail	2367	890	3257

Table 6.0

6.1 Dynamic and Reachable Hosts

The check for dynamic hosts and reachable hosts is performed for both ham and spam mails of a sample mailbox. There are cases when no spam messages exist for the mailbox and in this case the check is performed for ham mails only.
The results are represented using a scatter graph. The x axis represents the Mailbox number and the y axis represents the Percentage Mails that pass the check among ham and spam mails of all the sample mailboxes.

Dynamic Hosts Check

The table and description of each table column is shown below.

Column 1 - corresponds to the mailbox number

Column 2 - gives the percentage of mails among the ham mails for each mailbox that were sent from a dynamic host

Column 3 - gives the percentage of mails among the spam mails for each mailbox that were sent from a dynamic host. If the column value is No Spam, it means that there were no spam mails.

Mailbox # % Dynamic Hosts for Ham % Dynamic Hosts for Spam

1 28 No spam

2 1 10

3 24 No Spam

4 1 20

5 2 16

6 1 10

7 32 No Spam

8 4 7

9 0 15

10 0 0

11 11 No Spam

12 36 0

13 12 0

14 37 No Spam

15 0 9

16 1 1

17 19 No Spam

18 2 No Spam

19 25 No Spam

20 5 No Spam

21 1 6

22 0 7

Mailbox #	% Dynamic Hosts for Ham	% Dynamic Hosts for Spam
1	28	No spam
2	1	10
3	24	No Spam
4	1	20
5	2	16
6	1	10
7	32	No Spam
8	4	7
9	0	15
10	0	0
11	11	No Spam
12	36	0
13	12	0
14	37	No Spam
15	0	9
16	1	1
17	19	No Spam
18	2	No Spam
19	25	No Spam
20	5	No Spam
21	1	6
22	0	7

Table 6.1

The scatter graph representing the above data is given below:

dynamic host graph

Figure 6.1

The graph in Figure 6.1 indicates that for most of the mailboxes, the number of dynamic hosts for spam mails is more than ham mails. With the exception of mailbox 12 and 13, the red dots are above the blue dots. This tells us that the spam mails are more likely to be sent from dynamic hosts than ham mails. The percentage of dynamic hosts present also determine if this check can be used as an effective filter. All the figures for both ham and spam are below 40% implying that not many mails are sent from dynamic senders. Thus, the dynamic host check can be used as a filter but with lesser importance.

Reachable Hosts Check

The host names for both the "From" and "By" fields of the email header were checked if they were reachable. Thus, the statistics for the two fields were gathered separately.

The table for reachable hosts in "From" field and its column description is given below: Column 1 - corresponds to the mailbox number
Column 2 - gives the percentage of mails among the ham mails for each mailbox that responded to a ping request sent to the host of their "From" field and were reachable.
Column 3 - gives the percentage of mails among the spam mails for each mailbox that responded to a ping request sent to the host of their "From" field and were reachable. If the column value is No Spam, it means that there were no spam mails.

Mailbox # % Reachable Hosts in "From" field - Ham % Reachable Hosts in "From" field - Spam

1 28 No spam

2 21 18

3 30 No Spam

4 12 29

5 27 26

6 33 20

7 22 No Spam

8 22 15

9 7 22

10 30 31

11 18 No Spam

12 39 81

13 12 2

14 34 No Spam

15 7 18

16 19 6

17 0 No Spam

18 5 No Spam

19 24 No Spam

20 6 No Spam

21 6 14

22 30 25

Mailbox #	% Reachable Hosts in *"From" field* - Ham	% Reachable Hosts in *"From" field* - Spam
1	28	No spam
2	21	18
3	30	No Spam
4	12	29
5	27	26
6	33	20
7	22	No Spam
8	22	15
9	7	22
10	30	31
11	18	No Spam
12	39	81
13	12	2
14	34	No Spam
15	7	18
16	19	6
17	0	No Spam
18	5	No Spam
19	24	No Spam
20	6	No Spam
21	6	14
22	30	25

Table 6.2

The scatter graph representing the above data is given below:

reachable hosts from field

Figure 6.2

The graph in Figure 6.2 shows that the blue and red dots are randomly scattered and do not show any consistent pattern. For some mailboxes the red dots are more than the blue dots indicating that the percentage of hosts that could be reached by pinging them is more than that of ham mails. There can be many reasons for this behavior. The "From" field of the email header indicates the host machine of the sender, which may be a laptop or desktop and is not always reachable. Also, the machine may be situated in a secure network and can't be pinged. Thus, the host names in the "From" field should not be used as a classifier for spam filtering as they lead to inconsistent results.

The table for reachable hosts in "By" field and its column description is given below:
Column 1 - corresponds to the mailbox number
Column 2 - gives the percentage of mails among the ham mails for each mailbox that responded to a ping request sent to the host of their "By" field and were reachable.
Column 3 - gives the percentage of mails among the spam mails for each mailbox that responded to a ping request sent to the host of their "By" field and were reachable. If the column value is No Spam, it means that there were no spam mails.

Mailbox # % Reachable Hosts in "By" field - Ham % Reachable Hosts in "By" field - Spam

1 79 No spam

2 42 11

3 79 No Spam

4 48 20

5 62 10

6 52 24

7 85 No Spam

8 53 33

9 30 25

10 90 30

11 79 No Spam

12 24 5

13 48 79

14 78 No Spam

15 21 9

16 64 32

17 0 No Spam

18 32 No Spam

19 68 No Spam

20 46 No Spam

21 52 22

22 75 28

Mailbox #	% Reachable Hosts in *"By" field* - Ham	% Reachable Hosts in *"By" field* - Spam
1	79	No spam
2	42	11
3	79	No Spam
4	48	20
5	62	10
6	52	24
7	85	No Spam
8	53	33
9	30	25
10	90	30
11	79	No Spam
12	24	5
13	48	79
14	78	No Spam
15	21	9
16	64	32
17	0	No Spam
18	32	No Spam
19	68	No Spam
20	46	No Spam
21	52	22
22	75	28

Table 6.3

The scatter graph representing the above data is given below:

reachable hosts by

Figure 6.3

The graph in Figure 6.3 shows that the hosts in the "By" field of ham mails are much more reachable than spam mails. We can see the blue dots representing the reachable hosts for ham mails are concentrated on the upper portion of the graph and there is a distinct gap between the scatter of ham and spam mails that were found to be reachable. With the exception of mailbox 13, where the percentage of spam mails is greater than ham mails, the rest of the mailboxes display consistent results. Sometimes even if the mail server used for sending the mail, is reachable, it might not respond to ping requests to protect itself from ping attacks and this could be the reason for lesser ham mail hosts passing the reachable hosts check in mailbox 13. As the test displays consistent results with just one exception, it can be used as a strong filter to determine spam mails.

6.2 DKIM, Mailing List and SPF Check

DKIM Check

The DKIM test simply checks the ham mails for the presence of DomainKeys Signature and the spam mails for the absence of it. This is because, if the DomainKeys Signature header is present in the mail, it has already been authenticated and there is no need for further checks. Thus, the DKIM provides a very strong filter and should be used early on for segregation of ham and spam mails. If an email passes the DKIM test, then it is surely ham. If the header is not present, only then it should be subjected to further tests to classify it. The table for DKIM test data is given below:

Mailbox # % Ham Mails with DKIM Header % Spam Mails with DKIM Header

1 17 No spam

2 31 0

3 13 No Spam

4 23 0

5 70 1

6 56 5

7 11 No Spam

8 44 2

9 5 0

10 12 0

11 5 No Spam

12 4 1

13 33 0

14 8 No Spam

15 28 0

16 27 0

17 5 No Spam

18 25 No Spam

19 28 No Spam

20 3 No Spam

21 21 2

22 78 4

Mailbox #	% Ham Mails with DKIM Header	% Spam Mails with DKIM Header
1	17	No spam
2	31	0
3	13	No Spam
4	23	0
5	70	1
6	56	5
7	11	No Spam
8	44	2
9	5	0
10	12	0
11	5	No Spam
12	4	1
13	33	0
14	8	No Spam
15	28	0
16	27	0
17	5	No Spam
18	25	No Spam
19	28	No Spam
20	3	No Spam
21	21	2
22	78	4

Table 6.4

The graph of the percentage of ham mails that contain DomainKeys signature and are authenticated is shown below. This graph helps us to determine the popularity of the DKIM method as it depends on the domains that have published their public key.

Ham mails with DKIM signature

Figure 6.4

The graph in Figure 6.4 shows that only 3 mailboxes have greater than 50% mails that satisfy the DKIM check. On an average, 25% of mails have the DomainKeys Header. To further increase the scope of DKIM more domains need to register and publish their public key.

Mailing List Check

The ham and spam mails for all the mailboxes were checked for the mailing list headers. If the mail has a mailing list header, then it is not spam as the user has subscribed to the mailing list and thus, the mailing list domain is a known domain for the user. This can also be used as a strong filter to categorize the mails into ham and spam.
The table containing mailing list data is given below:

Mailbox # % Ham Mails with Mailing List Header % Spam Mails with Mailing List Header

1 36 No spam

2 12 0

3 30 No Spam

4 0 0

5 66 0

6 40 0

7 38 No Spam

8 42 0

9 0 0

10 2 0

11 8 No Spam

12 36 0

13 6 0

14 30 No Spam

15 0 0

16 22 0

17 10 No Spam

18 16 No Spam

19 10 No Spam

20 0 No Spam

21 0 0

22 38 0

Mailbox #	% Ham Mails with Mailing List Header	% Spam Mails with Mailing List Header
1	36	No spam
2	12	0
3	30	No Spam
4	0	0
5	66	0
6	40	0
7	38	No Spam
8	42	0
9	0	0
10	2	0
11	8	No Spam
12	36	0
13	6	0
14	30	No Spam
15	0	0
16	22	0
17	10	No Spam
18	16	No Spam
19	10	No Spam
20	0	No Spam
21	0	0
22	38	0

Table 6.5

Thus, we can see that none of the spam mails came from a mailing list. The table also indicates the percentage of mails sent from a mailing list for an average user which helps in determining the scope of the mailing list check. On an average 20% of the mails are sent from mailing list. But, the standard deviation from the mean is about 19 showing that the percentage of mails sent from mailing lists widely differs from one mailbox to the next and is subjective to the user. Nonetheless, mailing list check can be implemented as a good filter to classify mails.

SPF Check

The SPF test was performed on both ham and spam mails, but the mails sent from the same domain as the recipient were not included in the test as it not possible to determine the sender and receiver hosts after the mail had been received. These mails are known as domain internal mails. Also, the chances of a domain internal mail being spam is very minimal.
The table for the SPF results and column description is shown below:
Column 1: Mailbox number
Column 2: The total number of ham mails in each mailbox
Column 3: The number of mails on which the SPF check was performed. (excludes domain internal mails)
Column 4: The number of mails that produced the result "Fail" for the SPF check
Column 5: The number of mails that produced the result "Pass" for the SPF check
Column 6: The number of mails that produced the result "Error" for the SPF check
Column 7: The number of mails that produced the result "Neutral" for the SPF check
Column 8: The number of mails that produced the result "Softfail" for the SPF check
Column 9: The number of domain internal mails on which the SPF check was not performed

SPF Results for Ham mails:

Mailbox# Total ham mails Mails for SPF check Fail Pass Error Neutral Softfail Internal

1 1818 420 0 393 0 25 2 1398

2 497 288 5 221 0 49 13 209

3 1174 234 0 194 0 38 2 940

4 576 373 6 239 0 126 2 203

5 5002 4759 18 4468 1 189 83 243

6 1418 1314 3 1219 2 65 25 104

7 1230 192 0 179 0 13 0 1038

8 1788 1259 14 1108 0 129 8 529

9 133 51 0 27 1 22 1 82

10 524 521 7 426 10 76 2 3

11 168 48 0 39 0 8 1 120

12 1301 237 0 147 0 69 21 1064

13 1360 1154 13 865 0 258 18 206

14 934 269 0 208 0 57 4 665

15 414 290 80 240 0 31 1 124

16 1999 1527 86 1282 2 153 4 472

17 527 113 0 98 0 13 2 414

18 380 173 8 147 0 18 0 207

19 749 349 2 239 0 104 4 400

20 140 33 0 18 0 14 1 107

21 1151 649 12 296 2 322 17 502

22 2367 2282 56 1129 4 1093 0 85

TOTAL -> 25650 16535 310 13182 22 2872 211 9115

Table 6.6

The pie chart is used to represent the fraction of mails that are fail, pass, error, neutral and softfail. The percentages that are found using the column totals are

Fail - 2%
Pass - 80%
Error - 0%
Neutral - 17%
Softfail - 1%

$spf result fractions for ham mails$

Figure 6.5

Thus, from Figure 6.5 we can see that very few ham mails fail the SPF test. The results for SPF checking are consistent and SPF check can be used as a good classifier for ham mails.

The table for SPF results on spam mails is given below:

Mailbox# Total spam Mails for SPF Check Fail Pass Error Neutral Softfail Internal

1 0 0 0 0 0 0 0 0

2 96 96 7 0 8 81 0 0

3 0 0 0 0 0 0 0 0

4 65 65 14 0 1 48 2 0

5 103 103 5 7 2 86 3 0

6 264 260 46 9 11 183 11 4

7 0 0 0 0 0 0 0 0

8 204 204 22 8 5 164 5 0

9 227 227 29 1 11 172 14 0

10 355 355 0 312 4 39 0 0

11 0 0 0 0 0 0 0 0

12 21 21 0 1 0 0 20 0

13 48 48 10 0 2 34 2 0

14 0 0 0 0 0 0 0 0

15 45 45 10 0 0 33 2 0

16 184 184 15 45 1 117 6 0

17 0 0 0 0 0 0 0 0

18 0 0 0 0 0 0 0 0

19 0 0 0 0 0 0 0 0

20 0 0 0 0 0 0 0 0

21 371 369 29 69 5 255 11 2

22 890 890 43 491 31 285 40 0

TOTAL -> 2873 2867 230 943 81 1497 116 6

Table 6.7

The pie chart is used to represent the fraction of mails that are fail, pass, error, neutral and softfail. The percentages that are found using the column totals are

Fail - 8%
Pass - 33%
Error - 3%
Neutral - 52%
Softfail - 4%

$spf result fractions for spam mails$

Figure 6.6

The chart indicates that SPF check is not sufficient to categorize spam mails. Only 8% of the mails fail the SPF check, while 33% and 52% of the mails give pass and neutral as their result. Thus, we can use SPF check as a filter with medium weight. It should be used in conjunction with other filters for better performance.

Result Analysis Summary

Dynamic Hosts Check: The dynamic hosts check can be used a weak filter.

Reachable Hosts Check for the "From" field: This can not be reliably used as a filter due to inconsistent results.

Reachable Hosts Check for the "By" fields: This can be used a good filter.

DKIM Header Check: This is a strong filter with limited scope.

Mailing List Header Check: This is a strong filter subjective to the user.

SPF Check: This can be used as a boost for ham mails and as a medium weight filter in conjunction with other filters.

TOP

7. Problems Faced and their Solutions

One of the major problems faced with the server based approach was that it did not attract any spam mails and hence there was no sample data to work with and gather statistics. Thus, the entire approach shifted to standalone module, where instead of using a new server or honeypot to attract mails, already existing IMAP or POP enabled mail accounts on servers like columbia or gmail were used.

For a standalone program, IMAP based or POP based mail servers that are commonly used by people were needed. The limitation was in our ability to find a lot of spam mails as Columbia has an effective spam filter in place. Also, the population for which the data was gathered did not vary and represented a certain set of people namely, university students. The solution is pretty straightforward and that is to distribute the standalone program to more and diverse set of people.

Another major problem faced during the final stages of implementation was with MailStats.java. It was throwing MailBoxClosed exception for mailboxes containing more than 1000 mails. The solution was to open the mailbox only when the messages were to be read and not before that. After this bug was fixed, data could be gathered easily, since the mailbox size was not a limitation.

There were some problems with threads being used in the main module, due to which it seemed that the individual modules were not working, but later on this bug was fixed and threads could be used in parallel to update the progress bar as well as in individual module implementation.

In the standalone program the biggest concern was the amount of time it took for the ping module to complete because the messages needed to parsed and then the host name was extracted from them which was being pinged with packets. This took about 200 ms – 1 s for each message. To overcome this problem, the concept of threadpool was used.

Lastly, the user needed to be given constant feedback and for this a progress bar was implemented that showed the progress of the entire program as well as individual modules.

TOP

8.Appendix

The link to the source code is SpamTestLatest.zip. The files for the first module consist of the code. There are 3 files namely, DhcpDslPing.java, DkimSpfMailList.java and PingLookup.java.

The link to the result set page where all the cumulative results can be accessed is http://wiki.cs.columbia.edu:8080/display/spam/Resultset

TOP

9.Tools used

The tools used were as follows:

Navicat MySQL for creating the database
Netbeans IDE for writing code for the modules
OpenOffice for writing the report
Visual Paradigm for UML diagrams
OpenOffice excel for drawing graphs and representing data
Concept Draw Pro for DFD
Convert Excel Spreadsheet to HTML

TOP

10.References

RFC 2822 [http://tools.ietf.org/html/rfc2822]

RFC 2821 [http://tools.ietf.org/html/rfc2821]

JavaMail APIhttp://java.sun.com/products/javamail/]

Spam Analysis and Reputation Project

[http://wiki.cs.columbia.edu:8080/display/spam/Home]

SARP Modules

[http://wiki.cs.columbia.edu:8080/display/spam/IMAP+analyzer+modules]

Adrian Frei - Spam Analysis and Reputation Project: DNS Blacklists

Preethi Narayan - Spam Analysis and Reputation Project : Received Header Vs Sent and From Header

Tejas Nadkarni – Parser and Standalone Framework

Aditi Rajoriya - Spam Analysis and Reputation Project: IMAP Retrieval and To/Body Module.

Dhrumin Shah - Spam Analysis and Reputation Project: Domain Check and Image Analysis Modules.

Nirav Shah - Spam Analysis and Reputation Project: Email Source, Date/Time and Attachment Analysis

Wikipedia – SPF and DKIM

Professor Henning G. Schulzrinne – Project Advisor and Mentor

TOP

Spam Analysis and Reputation Project : Email Encryption Headers and Database Schema

Dynamic and Reachable Hosts - Design and Implementation

DKIM, SPF and Mailing Lists Module - Design and Implementation

Database Schema and Database Connectivity class - Design and Implementation: A Server Based Approach

Dynamic and Reachable Hosts - Design and Implementation: A Server Based Approach

Design

Implementation

DKIM, SPF and Mailing List - Design and Implementation: A Server Based Approach

Mailbox#	Total ham mails	Mails for SPF check	Fail	Pass	Error	Neutral	Softfail	Internal
1	1818	420	0	393	0	25	2	1398
2	497	288	5	221	0	49	13	209
3	1174	234	0	194	0	38	2	940
4	576	373	6	239	0	126	2	203
5	5002	4759	18	4468	1	189	83	243
6	1418	1314	3	1219	2	65	25	104
7	1230	192	0	179	0	13	0	1038
8	1788	1259	14	1108	0	129	8	529
9	133	51	0	27	1	22	1	82
10	524	521	7	426	10	76	2	3
11	168	48	0	39	0	8	1	120
12	1301	237	0	147	0	69	21	1064
13	1360	1154	13	865	0	258	18	206
14	934	269	0	208	0	57	4	665
15	414	290	80	240	0	31	1	124
16	1999	1527	86	1282	2	153	4	472
17	527	113	0	98	0	13	2	414
18	380	173	8	147	0	18	0	207
19	749	349	2	239	0	104	4	400
20	140	33	0	18	0	14	1	107
21	1151	649	12	296	2	322	17	502
22	2367	2282	56	1129	4	1093	0	85

TOTAL ->	25650	16535	310	13182	22	2872	211	9115

Mailbox#	Total spam	Mails for SPF Check	Fail	Pass	Error	Neutral	Softfail	Internal
1	0	0	0	0	0	0	0	0
2	96	96	7	0	8	81	0	0
3	0	0	0	0	0	0	0	0
4	65	65	14	0	1	48	2	0
5	103	103	5	7	2	86	3	0
6	264	260	46	9	11	183	11	4
7	0	0	0	0	0	0	0	0
8	204	204	22	8	5	164	5	0
9	227	227	29	1	11	172	14	0
10	355	355	0	312	4	39	0	0
11	0	0	0	0	0	0	0	0
12	21	21	0	1	0	0	20	0
13	48	48	10	0	2	34	2	0
14	0	0	0	0	0	0	0	0
15	45	45	10	0	0	33	2	0
16	184	184	15	45	1	117	6	0
17	0	0	0	0	0	0	0	0
18	0	0	0	0	0	0	0	0
19	0	0	0	0	0	0	0	0
20	0	0	0	0	0	0	0	0
21	371	369	29	69	5	255	11	2
22	890	890	43	491	31	285	40	0

TOTAL ->	2873	2867	230	943	81	1497	116	6