Spring 2003 - IRT Project

Technical Analysis of E-mail Spam

In this reasearch I conducted a statistical analysis about spam, also known as unsolicited commercial email. The analysis mostly based on the certain spam characteristics. They are spam features extracted from header fields and body contents. The goal of the research is not merely finding spam identifications but rather conducting the statistical information about spam.

Contents of this page


Collecting Spam

  • Collecting spam from January 1997 to April 2003 at google's group
  • The source spam I used in this research was from google's group website at "Advanced Groups Search" session at google.com. Manually, we can get it by following the link: "http://groups.google.com/advanced_group_search?q=group:*&hl=en&lr=&ie=UTF-8&oe=UTF-8&group=*"

    Automatically, I used a small program to connect to google's site and downloaded about 100 spams for each month from January 1997 until December 2002 and 1,000 spams for each month in the year 2003. I grouped the spams into each year and used them for the entire research

    Spam Origin

  • Statistical report about spam origin
  • Getting spam origin information mostly based on the Received header. This is the only place in an email that provides a certain reliable information. Assuming a reasonably standard and recent sendmail setup, a Received header line normaly looks like:

    Received: from host1 (host2 [ww.xx.yy.zz]) by host3 (8.7.5/8.7.3) 
              with SMTP id MAA04298; Thu, 18 Jul 1996 12:18:06 -0600
    or
    Received: from ww.xx.yy.zz (HELO host1) (ww.xx.yy.zz) by host 3 
              with SMTP; 22 Apr 2003 06:35:19 -0700 (PDT)
    
    In either cases, the Received lines show four pieces of useful information (reading from back to front, in order of decreasing reliability):
     - The host that added the Received line (host3)
     - The IP address of the incoming SMTP connection (ww.xx.yy.zz)
     - The reverse-DNS lookup of that IP address (host2)
     - The name the sender used in the SMTP HELO command when they
       connected (host1).
    
    An important truth about Received field is that Received lines are like links in a chain. The message is passed from one computer to the next with no breaks in the chain (i.e. the "by host" at any line should match with the "from host at the above line.) If there is a break between Received header lines that means spammer had inserted the faked headers. Moreover, when an email is sent from one host to another host, there are at most 2 source hosts involve in the header lines. So if there are more than two source hosts we know that an open relay occurs.

    Based on that idea, I used a program to track down spam origin from Received header. The program parsed through each Received line, sellected host name or IP address, invoked unix "host" command and checked the consistency of hosts from line to line...

    Figure 1 illustrates the fractions of spam from direct source or from an open relay. The fraction of spam from open relay seems to decrease in the recent years. In fact, only 9% of spams that I got at my email account on last month were from the open relay. The source spam I used to plot this figure are from google site. I used about 1,000 spams for each year from 1997 to 2002 and 4,000 spams for the year 2003.
    Figure 1. Spam from direct source or open relay
    Note that "untraceable spam" are those perhaps directly sent from spam host by using completely faked Received header that made us unable to trace out the source host.

  • Spam from the US or from other countries
  • After classifying spams into direct source or open relay, I divided spam in each category into spam from the US or from foreign countries. Table 1 shows the fraction of spam sent directly from the US or from other countries, table 2 shows the fraction of US's open relay and other countries open relay

    1997

    1998

    1999

    2000

    2001

    2002

    2003

    From the US

    85.3%

    76.1%

    74.7%

    75.0%

    69.9%

    66.4%

    77.6%

    From foreign countries

    14.7%

    23.9%

    25.3%

    25.0%

    30.1%

    33.6%

    22.1%

    Popular Spam Hosts

    mindspring.com

    mindspring.com

    att.net

    itctel.com

    megabaud.fi

    yahoo.com

    blarg.net

    Popular foreign spam

    China

    China

    Hongkong

    Argentina

    Findland

    Brasil

    Brasil

    Table 1.Spam from the US or from foreign Countries

    1997

    1998

    1999

    2000

    2001

    2002

    2003

    From the US

    85.3%

    82.3%

    61.8%

    65.7%

    73.8%

    79.0%

    76.5%

    From foreign countries

    14.7%

    17.7%

    38.2%

    34.3%

    26.2%

    21.0%

    23.5%

    Popular open relay

    usit.net

    interlog.net

    std.com

    earthlink.net

    well.com

    bigfoot.com

    demon.net

    Popular foreign open relay

    Australia

    Germany

    Italy

    Japan

    China

    Denmark

    England

    Table 2. Open relay from the US or from foreign Countries
    Note that the fraction between US spam and other contries spam in this research are conducted from spam at google (i.e. spam distributed from different places). In reality, these fractions could be different at everyone email account, depending on where one is living.

    Spam with Faked Information at Header Fields

  • Spam with faked or real address at From header
  • Figure 2 illustrates the fraction of spam using faked or "appeared" real sender email address. It just proved thing that we all knew of how unreliable the information from spammer email address

    Figure 2. Spam with faked or real address at Form header

    To figure if the spammers address is real, I checked the address against source host that originated the spam (from Received header). If the address match with the host, it is likely a real address, otherwise, it must be faked.

    As we can see from Figure 2, the fraction of Spam address appeared real or being empty are slightly increasing by years while the percent of faked address dropped from 94% to 80%. This can be found to be due to the tighter constrain in federal law about spam. Spammers are likely willing to provide more accurate informaion about their spam to avoid trouble in law sue.

    Also, from the matter of fact that many email providers are having blocking or reporting utility at client email account (i.e. users can block a specific email address, or report spam to distributed center). Those utilities often base on From header. So emptying From header is one of the tactis that spammers are using to avoid being blocking or reporting from users

  • Spams with faked or real address at Reply-To header
  • Similarly to the use of faked address at From header. Spammers often use faked reply address at Reply-To header. Also, the statistic from table 3 shows that spam with real reply address are slightly increasing in the late years. This was found to be due to the similar reasons that I have just memtioned above.

    1997

    1998

    1999

    2000

    2001

    2002

    2003

    Faked

    96.5%

    92.9%

    84.0%

    82.4%

    86.8%

    67.7%

    78.2%

    Empty

    0.0%

    2.1%

    7.1%

    13.2%

    8.3%

    17.5%

    7.8%

    Appeard Real

    3.5%

    5.0%

    8.9%

    4.4%

    4.9%

    14.8%

    14.0%

    Table 3. Spam with faked or real address at Reply-To header
    Note that email without Reply-To header will use Return-path as route for replied email. Return-Path is generated by the mail transport service at the time of final deliver. Usually, Return-Path often route replied email to the sender by using the address at From header.

  • Destination address in format of individual vs. group vs. undisclosed-recipients
  • Since it will not be reliable if I used spam at google to investigate about spam destination address (i.e. I cannot tell who actually own the destination email address). I sorely used spams at my account to conduct the survey about To header. Most of the time, the destination address is my email address. The other times, the destination address contains the lists of email addresses that are similar but not exactly. It can also be a faked, invalid or hind address at To header (e.g. "undisclosed-recipients" or "@subscribers") or someone else address. Finaly, in some other spams my email address is not at appeared at the destination addresss but at CC address. Figure 3 shows information of using destination address of spams at my email account

    Figure 3. Destination address formats

    Content-base Analysis

    There are many content-based spam filtering softwares out there. They are all similar in the way of using signature algorithms to identify individual spam features and thusly using spam features to determine if any message is spam. Although they can stop some portion of spam but in general they all suffers from one draw back, signature based technique is effective against known spam, but unable to detect and prevent new spam.

    In this milestone I conduct a statistical approach in content-based analysis for the spam collection at google. I want to show that spam has change a lot in recent years Indeed, the "spam score" system like the one at SpamAssassin is not the best ideal in fighting spam

    I started by scanning the entire content including header and body of each spam in each year. I currently consider only English-alphabet characters and ignoring case. Any character not in the English alphabet would be treated as a token separator. Of course, alternative perspective could be applied in future.

    Then I computed the spam frequency of each word occur in each year by taking the count of each word (i.e., the number of spam that contain each word, regardless of how many times each word appears in each spam) over the total number of spam I used in this year.

  • Most interesting spam words
  • SpamAssassin are using spam words as one of the key features in its scoring rule. However, spams are containing less and less spam words than they were in 5 years ago. Instead, spammers are changing spams to have similar format with legitimate email. In some cases spam are mixed the spam content with a regular document format. In other cases they are short message including links to commercial site. Table 4 introduces the frequency of use of some of the popular spam words that we meet everyday at spam in our spam bulk. The results show that the use of those spam words are decreasing in over the past few years.

    Spam words / Year

    1997

    1998

    1999

    2000

    2001

    2002

    2003

    Financial

    0.060

    0.108

    0.062

    0.097

    0.093

    0.083

    0.063

    Money

    0.366

    0.295

    0.250

    0.307

    0.252

    0.203

    0.179

    Dollars

    0.250

    0.112

    0.098

    0.115

    0.102

    0.116

    0.077

    Business

    0.231

    0.290

    0.250

    0.268

    0.252

    0.206

    0.137

    Order

    0.140

    0.227

    0.216

    0.236

    0.187

    0.154

    0.145

    Credit

    0.052

    0.149

    0.119

    0.145

    0.179

    0.150

    0.076

    Payment

    0.023

    0.063

    0.039

    0.066

    0.067

    0.051

    0.039

    Legal

    0.267

    0.098

    0.074

    0.101

    0.079

    0.058

    0.038

    Fees

    0.022

    0.048

    0.032

    0.039

    0.037

    0.022

    0.017

    Hardcore

    0.054

    0.045

    0.015

    0.021

    0.027

    0.025

    0.014

    Table 4. Some most popular spam words

    As part of the support for the claim that spam words have now been used less often than in the past, I picked the word "Money" to plot its frequency of use in Figure 4

    Figure 4. The use of spam word "Money" are reducing by years

  • Subject header format
  • For most of us, spam is easily recognizable. Normaly, we never have to open an email to know it is spam. By loking at an email subject, we have little trouble to recognize spam. And thus, email subject is an interesting feature that most of signature-based spam filter try to use. Spammers, on the other hand have lot of different format for spam's Subject. One tactic that spammers often use to diliver their spam is sending out the same spam in many times just by changing the subject a little bit. For example: we might get spam with subjet "GOOD NEWS" in the first day. Then "GOOD NEWS YOU CAN USE" in the second day and finaly "GOOD NEWS, The Good News Electronic Journal" in the third day.

    Table 5 shows some of the other formats of spam subject header.

    1997

    1998

    1999

    2000

    2001

    2002

    2003

    Randomizing with extra letters/digits

    0%

    2%

    6%

    5%

    10%

    6%

    7%

    Using all Capital

    30%

    20%

    17%

    15%

    12%

    9%

    6%

    Using many !, $, # or other symbols

    26%

    20%

    16%

    15%

    14%

    13%

    8%

    Not using Alphabet

    0%

    1.4%

    1.6%

    1.2%

    1.9%

    1.8%

    4%

    Table 5. Spam Subject header format

  • Using HTML in spam
  • As I mentioned earlier, Spammer have been increasingly using HTML for spam message. Content-based spam filtering software considered HTML as an important spam feature to recognize spam. Figure 5 illustrates the rapid change in using HTML over plain text in spam over the past 7 years.

    Figure 5. Spam with plain text or HTML

    Table 6 gives more details about the statistic information about spam features related to HTML

    1997

    1998

    1999

    2000

    2001

    2002

    2003

    Spam using HTML format

    14%

    19%

    21%

    28%

    42%

    75%

    66%

    Using font color

    2%

    4%

    4%

    7%

    18%

    35%

    28%

    Containing images

    0.2%

    0.7%

    2%

    3%

    10%

    26%

    30%

    Having URL links

    3%

    8%

    9%

    12%

    27%

    53%

    58%

    Having "click" on something

    15%

    19%

    22%

    27%

    33%

    50%

    45%

    Using HTML table

    2%

    2.2%

    3%

    6%

    16%

    35%

    30%

    Using input form

    3%

    10%

    11%

    12%

    16%

    14%

    9%

    Table 6. Spam with HTML format

  • Spams concern about privacy policy
  • Finaly, as we often see at the bottom of each spam, I am ending this page by presenting the statistic value of how spammers used unsubscribe or removal instruction in spam. The matter of fact that giving users information about removal or unsubscribe is a trick that spammers often use to get confirm that users email address is real. Many people have been so naive to submit their email address to spammer through those look real removal systems. Surely, they would receive more spam once spammers receive their feedback. Table 7 presents statistic data of how spammers have used unsubscribe or removal instruction in spam. We can see that the fraction of spam having removal instruction are increasing by years.

    1997

    1998

    1999

    2000

    2001

    2002

    2003

    Having removal information

    30%

    48%

    54%

    60%

    73%

    65%

    46%

    Having unsubscribe information

    0.2%

    1%

    2%

    4%

    7%

    17%

    15%

    Talking about privacy

    1%

    3%

    5%

    6%

    5%

    5%

    5%

    Claim you were on the list

    32%

    29%

    24%

    24%

    30%

    29%

    22%

    Table 7. Spam with removal instruction

    Conclusions and Thought

    The statistic information I presented in this research is perhaps some of the most important information about spam characteristics. Having those statistics information is important to understand the nature of spam so that we will find the best way to deal with spam. I think conducting statistical report should be first step to approach before we actually writing any spam filtering software. By doing this research, I also want to shows that spams are changing a lot in the past few years in order to get around bloking spam systems. By going along, I have had chance to investigate some different spam filtering softwares using content-based techniques (SpamAssassin), distributed notification systems (SpamNet at CloudMark), blacklist providers (MAPS), or Bayesian algorithm (suggested by Paul Graham) They are fine techniques and are used widely in the market. However, all of them remain the same problem in capability of fighting unknown spam. I believe that a good anti-spam system will not only find yesterday's spam, but also will evolve and help to find tomorrow's spam. We still get spam because we don't have an effective algorithm to recognize the new tactics of spammer. In other words, we cannot get fast enough update information about spam and thus we are always behind spammers.

    Source codes

  • GetSpam.java : Collect spam from Google's group website
  • Parser.java : Extract Spams header fileds and body contents
  • Received.java: Get spam origin from Received header
  • Origin.java : Compute statistic value of spam origin
  • Address.java : Check if sender or reply address is real or faked
  • SpamWord.java: Compute the most common spam words in each year
  • Words.java : Compute the used frequency of input word in each year
  • Subject.java : Subject header analysis
  • References

  • RFC 821RFC 822RFC 2045RFC 2046
  • SpamAssassin.org
  • Cloudmark SpamNet
  • Figuring out fake E-Mail & Posts
  • Paul Graham, A Plan for Spam
  • David Mertz, Six approaches to eliminating unwanted e-mail
  • Brandon M. Browning, Getting Rid of Spam

    Linh Bui, May 17, 2003