Spamfighters

One sad side effect of online life is that the longer you stick around, the more spam you get. This is particularly true if your email address appears online in any clear, unobscured form [Why Am I Getting All This Spam? (3/2003 report by CDT)].

After being online at more or less the same address for about five years, the amount of junk mail I was receiving finally reached ridiculous proportions earlier this summer. I was getting hundreds of spam messages per day, and just deleting these messages, easy as it was, took too much time. It was time to put a stop to it. Over the past few months, I tried four spam-fighting strategies, with varying success.

One: Spam filtering with Opera

The web browser Opera 7 [Opera Software] includes an email client, called M2. It’s got built-in spam filtering features, so I decided to test that first, since I was already using the browser and had been impressed by its speed.

The spam filtering in Opera is extremely easy to set up: All you need to do is specify, on a drop-down box, whether you want the internal filter set to strong, medium, or off. Opera takes it from there, redirecting any messages that it deems suspicious to a “spam” folder. Since I was getting hundreds of junk messages per day, I set this filter to “strong” and smiled smugly as I watched the folder fill up with dozens of messages.

My happiness was short-lived, however – turns out that strong filtering in Opera really is strong, and it had shunted nearly all of my mail, including a good number of legitimate messages, into the spam category. I reset the filter to “medium” and got somewhat better results, but I still had to comb through the spam folder every day, looking for messages that were inadvertently tagged as suspicious. Also, whether set to strong or medium, the filter let many spam messages through, so I was still deleting junk from my inbox — though not nearly so much as before.

This points out the biggest problem with spam avoidance schemes that are based on filtering. No matter how good the filter, you’re going to have some false positives and some false negatives. That means you still have to dirty your hands dealing with spam, deleting stray junkmail from your inbox and rescuing mis-categorized messages from your “spam” folder. This is not much of an improvement.

To top it off, Opera’s mail-filtering software isn’t very smart, and it doesn’t learn from its mistakes. If you find a legitimate message in the spam folder, you can push a button to tag it as “not spam” but all this does is add the sender to your list of contacts. It does nothing to increase the efficiency of the filter itself.

Two: Training Popfile

I decided to take another approach, by installing filtering software that you can “train” so its efficiency improves over time. I downloaded Popfile, a free, open-source email filtering program that does just this. [Popfile]

Popfile acts as a proxy mail server, running in the background on your own computer. You need to change the settings within your mail client so that it points to the Popfile program. From then on, whenever you check for mail, Popfile retrieves them from your ISP and then hands them off to your mail client. In the process, Popfile examines the content of each message, guesses how to categorize it, and adds a flag to the message’s headers indicating the appropriate categorization. You then set up filters within your mail client to file each incoming message according to the headers that have been added by Popfile.

As installed, Popfile has zero filtering intelligence – you need to train it by first downloading a few messages, then telling it how those messages should be categorized. Popfile analyzes the words contained in each message and figures out the statistical correlation between each word and its likely categorization – for instance, if you put a few messages containing the word “viagra” into the spam category, Popfile figures that any message containing this word is more likely to be spam. This calculation extends to almost every word in almost every message, resulting in fairly sophisticated automated classification.

You can set up any number of “buckets” into which Popfile should put messages, but I used just two – “inbox” and “spam.” At first, it threw everything into the inbox. I went into the program’s Web-based interface and told it how to categorize each of several dozen messages, then downloaded the next batch of incoming mail from my server. This time, Popfile’s accuracy rose to more than 90%. After two days, it was categorizing spam messages with a pretty remarkable 98% accuracy.

However, like Opera, Popfile’s filtering doesn’t eliminate the need for me to look at spam. I’ve still got to pick legitimate messages out of my spam box, and vice versa. Worse, I need to take the additional step of going into the Popfile interface, finding the miscategorized messages, and telling Popfile what categories they really belong to. In the end, Popfile is only a marginal time-saver – though it certainly does cut down significantly on the stress of seeing a couple hundred spam messages in my inbox.

Three: Learning to love Oddpost

Oddpost is a web-based email service that costs $30 per year [Oddpost]. It’s got a much more Windows-like interface than most Web applications I’ve seen, and works quite well for basic email. Because it’s web-based, it’s useful to anyone who needs to read and send email from more than one location (such as work and home). It has a built-in RSS newsfeed reader, making it a convenient tool for keeping up with news and weblogs.

Oddpost also has built-in spam protection. No configuration required here – it automatically files suspicious messages right into the spam folder. If a junk message sneaks into your inbox, you can “nuke” it by clicking on a button labeled with a little mushroom cloud. To rescue legitimate messages from the spam folder, you click on a peace symbol button; this moves them back into your inbox.

The spam filter here is remarkably effective, and best of all, it has very few false positives – that is, legitimate messages almost never get miscategorized as spam. That’s especially nice, because you can ignore the spam folder for days at a time without fear of missing something important – something you cannot do when using Popfile or Opera’s spam filter.

In addition to being a webmail service, Oddpost also provides IMAP access to its servers. If you have an IMAP-compatible email client, you can use it to connect to your Oddpost account – and the spam filtering continues to work. Very nice.

Four: Changing my address

Based on my informal experiments, I have a few conclusions.

First, for content filtering systems, false positives (mistakenly tagging messages as spam) are much worse than false negatives (mistakenly letting spam through the filter). If a spam filter catches 98% of the spam messages without mistakenly trashing a single legitimate message, it has saved the user a lot of time. On the other hand, if it mistakenly trashes just 1% of the legitimate messages, you might as well have no filtering at all, because the user has to comb through all the spam that was filtered out, looking for legitimate mail embedded among the trash.

Second, spam filtering has to be tightly integrated into the email client so that users can trash spam messages – and train the filtering algorithms, if possible – with a single click. Anything else is just too much of a hassle.

Finally, spam control based on content filtering is ultimately a losing battle. Spammers are getting more and more clever about making the content of their messages look like legitimate mail. Besides, even baldfaced advertising will slip past if it contains content that is new to the filter, before the filter has been trained or adjusted to catch it. Over the long term, this means that antispam content filtering will always remain an arms race with the spammers, with no decisive victories to be won by either side.

Ultimately, I suspect, the only way to get rid of spam entirely is to use a “white list” system, in which the only people who can send messages directly to you are those you’ve already identified as legitimate senders. Senders who are not on your white list will have to validate themselves before their messages can go through, either by replying to an automated message or by doing something only humans can do, such as reading partially obscured text in a GIF image. Spam tools such as Matador use exactly this approach, and although it’s annoying – and doesn’t work perfectly [Spamfighters

Link broken? Try the Wayback Machine.