Main Page > Standalone Approach > Email Source Analysis

Spam Analysis and Reputation Project

2.1 Standalone Approach: Email Source Analysis

Design

This module analyzes emails on the basis of the reputation of the sender. Friend list of the user consists of all the addresses to which mail has been sent from that user's account. If a mail is received from an address that exists in the friend list then it is highly likely that it will not be spam. Here, two cases are implemented and interesting results are obtained. In the first case, when the test is being conducted for each message, the friend list consists of all the addresses to which mails have been sent. In the second case, the friend list consists of only those addresses to which the mail has been sent before the received date of the mail being tested.

2.1.1 Email Source Analysis: 1st Case (Considering all Sent Messages)

Implementation

This module is implemented in Java and the file name is FriendAnalyze.java
The main program MailStats.java passes javax.mail.Message arrays containing the spam, non-spam and sent messages to this module. The program first populates the friend list of the user whose account is being tested by using getAllRecipients() method on all the sent messages. The friend list is implemented using a hashset to speed up the performance. Senders of the non-spam messages are then obtained by using getFrom() method. Now each sender is classified as a friend or a non friend by comparing sender's address with all the entries in the friend list of the user. Number of friends and non friends for non-spam messages is calculated. Similarly, number of friends and non friends for spam messages are calculated using the array containing spam messages.

Results and Observations

Following results (in Table 2.1.1 (a)) show fraction of non-spam and spam messages which are from friends for each mailbox:

MailBox # # of Mails # of Non-Spam Mails # of Non-Spam
from Friends
(%) Non-Spam
from Friends
# of Spam mails # of Spam
from Friends
(%) Spam
from Friends
1 1818 1818 1105 60.78 0 0 No Spam
2 593 497 95 19.11 96 0 0
3 1174 1174 725 61.75 0 0 No Spam
4 641 576 450 78.13 65 3 4.62
5 5105 5002 1412 28.23 103 0 0
6 1682 1418 153 10.79 264 9 3.41
7 1230 1230 487 39.59 0 0 No Spam
8 1992 1788 880 49.22 204 8 3.92
9 360 133 84 63.16 227 0 0
10 879 524 10 1.91 355 0 0
11 168 168 91 54.17 0 0 No Spam
12 1322 1301 828 63.64 21 1 4.76
13 1408 1360 183 13.46 48 7 14.58
14 934 934 578 61.88 0 0 No Spam
15 459 414 144 34.78 45 0 0
16 2183 1999 1164 58.23 184 3 1.63
17 527 527 339 64.33 0 0 No Spam
18 380 380 308 81.05 0 0 No Spam
19 749 749 553 73.83 0 0 No Spam
20 140 140 25 17.86 0 0 No Spam
21 1522 1151 891 77.41 371 4 1.08
22 3316 2370 1647 69.49 946 5 0.53
Total 28582 25653 12152 47.37 2929 40 1.37
Table 2.1.1 (a): Email Source Analysis: Case 1

Following is the scatter plot that show fraction of non-spam and spam messages which are from friends for each mailbox:

Email Source Analysis: Case 1
Figure 2.1.1: Email Source Analysis: Case 1

From the above results we observe that about 47.37% of all the non-spam messages are from friends while only 1.37% of all the spam messages are from friends. We can now calculate the ratio of number of non-spam from friends to total number of mails from friends to find out the effectiveness of this test.

The results obtained are as follows:

MailBox # # of Mails A: # of Mails from Friends B: # of non-spam from Friends B/A
1 1818 1105 1105 1.00
2 593 95 95 1.00
3 1174 725 725 1.00
4 641 453 450 0.99
5 5105 1412 1412 1.00
6 1682 162 153 0.94
7 1230 487 487 1.00
8 1992 888 880 0.99
9 360 84 84 1.00
10 879 10 10 1.00
11 168 91 91 1.00
12 1322 829 828 1.00
13 1408 190 183 0.96
14 934 578 578 1.00
15 459 144 144 1.00
16 2183 1167 1164 1.00
17 527 339 339 1.00
18 380 308 308 1.00
19 749 553 553 1.00
20 140 25 25 1.00
21 1522 895 891 1.00
22 3316 1652 1647 1.00
Total 28582 12192 12152 1.00
Table 2.1.1 (b): Email Source Analysis: Case 1

If the feature is to be useful, it has to appear only in spam or almost exclusively in non-spam. This is because it is acceptable if a few spam messages have the feature. However, non-spam being classified as spam is certainly not acceptable.

We observe that most of the mails from friends are non-spam. Thus, this test is very useful to identify non-spam messages. This test can be run on already existing mailboxes to identify the mails from friends as this test takes all the sent messages into consideration.

2.1.2 Email Source Analysis: 2nd Case (Considering Sent Messages till Date)

Implementation

This module is implemented in Java and the file name is FriendTillDateAnalyze.java
The main program MailStats.java passes javax.mail.Message arrays containing the spam, non-spam and sent messages to this module. Date when the message was sent is obtained using getSentDate() method. A hashtable is used to store the earliest date when a mail was sent to an address where the address is the key and the earliest date is the value. Now, each sender of the non-spam message is searched in the hashtable. If there is no entry of that sender in the hash table then the sender is classified as a non friend. If there is an entry of that sender in the hashtable then the correspoding value of earliest sent date is compared with the received date of the current message being tested. If the earliest sent date is before the received date of the current message, the sender is categorized as a friend and otherwise as a non friend. Let us call them 'friends till date' and 'non friends till date' respectively. Number of friends till date and non friends till date for all the non-spam messages is calculated. Similarly, number of friends till date and non friends till date for spam messages are calculated using the array containing spam messages.

Results and Observations

Following results (in Table 2.1.2 (a)) show fraction of non-spam and spam messages that are from friends till date for each mailbox:

MailBox # # of Mails # of Non-Spam Mails # of Non-Spam from
Friends till date
(%) Non-Spam from
Friends till date
# of Spam mails # of Spam from
Friends till date
(%) Spam from
Friends till date
1 1818 1818 915 50.33 0 0 No Spam
2 593 497 18 3.62 96 0 0.00
3 1174 1174 580 49.40 0 0 No Spam
4 641 576 387 67.19 65 3 4.62
5 5105 5002 870 17.39 103 0 0.00
6 1682 1418 132 9.31 264 9 3.41
7 1230 1230 446 36.26 0 0 No Spam
8 1992 1788 715 39.99 204 8 3.92
9 360 133 0 0.00 227 0 0.00
10 879 524 7 1.34 355 0 0.00
11 168 168 81 48.21 0 0 No Spam
12 1322 1301 688 52.88 21 1 4.76
13 1408 1360 118 8.68 48 7 14.58
14 934 934 489 52.36 0 0 No Spam
15 459 414 134 32.37 45 0 0.00
16 2183 1999 969 48.47 184 3 1.63
17 527 527 281 53.32 0 0 No Spam
18 380 380 253 66.58 0 0 No Spam
19 749 749 450 60.08 0 0 No Spam
20 140 140 15 10.71 0 0 No Spam
21 1522 1151 745 64.73 371 4 1.08
22 3316 2370 1425 60.13 946 5 0.53
Total 28582 25653 9718 37.88 2929 40 1.37
Table 2.1.2 (a): Email Source Analysis: Case 2

Following is the scatter plot that show fraction of non-spam and spam messages which are from friends till date for each mailbox:

Email Source Analysis: Case 2
Figure 2.1.2: Email Source Analysis: Case 2

From the above results we observe that about 37.88% of all the non-spam messages are from friends till date while only 1.37% of all the spam messages are from friends till date. We can now calculate the ratio of number of non-spam from friends till date to total number of mails from friends till date to find out the effectiveness of this test.

The results obtained are as follows:

MailBox # # of Mails A: # of Mails from
Friends till date
B: # of Non-Spam from
Friends till date
B/A
1 1818 915 915 1.00
2 593 18 18 1.00
3 1174 580 580 1.00
4 641 390 387 0.99
5 5105 870 870 1.00
6 1682 141 132 0.94
7 1230 446 446 1.00
8 1992 723 715 0.99
9 360 0 0 N/A
10 879 7 7 1.00
11 168 81 81 1.00
12 1322 689 688 1.00
13 1408 125 118 0.94
14 934 489 489 1.00
15 459 134 134 1.00
16 2183 972 969 1.00
17 527 281 281 1.00
18 380 253 253 1.00
19 749 450 450 1.00
20 140 15 15 1.00
21 1522 749 745 0.99
22 3316 1430 1425 1.00
Total 28582 9758 9718 1.00
Table 2.1.2 (b): Email Source Analysis: Case 2

We observe that most of the mails from friends till date are non-spam. As discussed in the previous test, this test is very useful to identify non-spam messages as this feature is present almost exclusively in non-spam mails. Thus, this test can be used to identify whether any incoming mail is non-spam.

Next: Attachment Analysis


Last updated: 2008-08-19 by Nirav Shah