Spam Analysis and Reputation Project : Domain Check and Image Analysis

By:

Dhrumin Shah

Columbia University

Department of Computer Science

New York, NY 10027

USA

dms2169@columbia.edu


Abstract
    

The project aims at gathering statistical data about the various headers and body fields present in emails and hence differentiating between the two large collections of messages: Spam and Non-Spam (or Ham) . Based on the data gathered for the spam and ham mails, we will decide whether a particular field is good enough to be used for the required classification. The two modules covered in this report are the Domain Check module and the Image Test module If a particular field in the header or body of an email is a good indicator, it will have varying values (eg. Spam score) for the spam and ham mails. To gather the statistics we run our program on folders like Inbox, Spam and others like Sent that contain substantial number of messages and gather useful data about different header and body fields. After examining this data, we arrive at a conclusion that whether the particular field is a good metric for the required classification. The statistics can be generated by using parameters like the number of mails; comparison between the headers or the body fields; or from the individual results of the various sources like Blacklist or Friendlist, etc. These statistics when derived for a sufficient number of mailboxes and hence for a sufficient number of different kind of mails, can be used to classify a particular message as spam or ham. For example, if a mail has a domain name to which the receiver has never sent a mail, then the probability of the message being regarded as spam is high, but on the other hand, if the mail has a domain name to which the receiver has sent a mail before, then this mail will probably not be regarded as a spam. So, we find the count of such mails from the set of mailboxes which were used for testing, and based on these count values for spam and ham mails, we infer whether we can use the given field for the purpose of classification.

Table of Contents

i. Abstract

TOP

1. Project Overview    

The project is divided into modules for which statistics are to be gathered. These modules consist of the different header fields that seem important and can potentially be used for classification. The various modules are :

These are the main parts of the header and body on which data has to be gathered. They have been implemented as a joint effort by the team members.
The design of the project is briefly outlined below:

TOP

2. Introduction

Out of the modules listed in the previous section, this report mainly concentrates on the following two modules:

The above modules were chosen since they are used by the currently existing spam filters like spamassasin and hence can act as good classigiers of the incoming mail.The results of the test indicate whether the chosen parameters are good enough for suc a classification. To gather statistics, mailboxes of a small number of people(both, Columbia and non-columbia students) were used to provide the three message arrays, namely sent, mail and spam to all the modules. The statistics gathered can be further improved by increasing the accuracy of the test conducted. These statistics are then used to form the results of the test.

To gather such information for classifying the mails, two approaches were proposed: