August 05, 2002

Rick's Spam Filters Very

Rick's Spam Filters Very interesting. Simple filters that one can use in Eudora to eliminate a lot of spam.

It's a late night, and I just can't stop thinking about possible techniques to eliminate spam. I'm a believer. I believe that spam can be filtered effectively. Maybe it will take some sophisticated artificial intelligence, but I believe it can be done.

Consider the random tags that spammers put at the end of the subject line. It's a pretty clear give away that the email is spam. Why do they put those tags there? My guess is to randomize the subject line. Probably there are server operated filters that decide a message is spam if it sees the same subject line more than N times, where N is some very large number. Perhaps the filter computes a checksum and the random string alters the checksum. So, is it possible to detect these random tags? Absolutely! Apply basic information theory. Compute the frequency of all two letter combinations in English words. Then use these frequencies to compute the "information" in the last word of the subject line. Because of the unusually low frequencies of the two letter combinations of the random string, the "information" will be very great. In plain English, this just means that there is not sufficient redundancy in the random string to make it look like an English word.

But subject lines won't do. How about those spammers that just put no subject line at all? Or else their subject line is "hi"? To really fight spam, we must look at the text of the message.

Surely, there are lots of tricky things we could do to detect spam. For example, if the message is malformed, it's spam. If there are a lot of recipients in the TO or CC line, then it's spam. If it contains an HTML table, then it's spam. The problem with all these "tricks", is that once spammers catch on to them, they just change their messages to defeat the spam detectors. It's spy vs. spy.

So, we have to be smarter. We want to find spam detectors that are very difficult to defeat, because they do not rely on any "tricks". To do that, we have to look at the text of the message.

But first, let's consider that every spam filter should have a white list. Email that comes from your co-workers or relatives should always get through. Let's consider the possibility that spammers that have amassed huge lists of email could forge the sender, and make the sender an email address from the same domain as the recipient. So, maybe we need to also consider a super whitelist that contains senders who are authenticated via digital IDs. The whitelist will always take precedence over other decisions.

Next, let's consider that spammers often use fake sender addresses. So, a good spam filter should try to verify the sender's email address. If it's not a valid email address, let's declare it to be spam. We can verify the email address by connecting to the SMTP server for the sender's domain and proceding to where we would send the DATA command. A that point, we send an RSET and QUIT. If the SMTP server will not accept the RCPT command, then the email address is invalid. An alternative is to use the VRFY command. Some ISPs block port 25, and some ISPs redirect it to their own SMTP server, so it may not be possible to verify the email address directly. Perhaps an email address verifier would make a good web service.

Finally, we get down to looking at the text of the message. In order for a spammer to succeed, he has to ask you to do something in response to the message. Before he asks you, he probably tries to convince you. And, of course, many spammers also excuse themselves. So, there are two required sections: a convincincing section (he tries to convince you of the value of whatever he is offering) and an action section (he asks you to do something in response). And there is one optional section: an excuse section (he tells you this is not spam because you put your name on an opt-in list, he tells you how to get off the list, he explains he is in compliance with the law, he asks your forgiveness for the intrusion, etc).

The easiest section to deal with must certainly be the action section. Most commonly, the action is to click on a hyperlink. I think this could be handled in two different ways. One way is to get everyone to comply. The spam filter software should make a request for the URL. With a little luck, the volume of requests would overwhelm the server and take it down. It certainly would cost the spammer more in fees to the hosting provider. A big problem with this approach, however, is that smart people could use it to get information that we may not want them to get. They might send just a few messages knowing that a request will automatically be made on that URL. (What a great way to know that your message was received!)

A different way to handle the action hyperlink would be to check the hostname of the URL against a list of hostnames. This could be done with a centralized server. Or it could be done via a P2P dissemination of the list. There could be a whitelist of hostnames and a blacklist. The whitelist means that if I mail a URL to a friend for an interesting article on the web, that the message won't be classified as spam. The blacklist means that if a URL contains a blacklisted hostname, it is classified as spam.

I know there are efforts underway to create checksums of messages and compare them to a list. I think that if we compare just the hostnames of any URLs against a list, that should be sufficient to filter spam.

Certainly, there are many techniques we could use to filter spam. Ultimately, though, the best filters will try to make sense of the actual meaning of the message. In order to make progress toward this goal, I think breaking the message down into the three sections that I mentioned above is the starting point. There are only so many words that are used to try to convince someone of a point. There are only so many words that are used to tell someone how to take an action. Can we find a way to analyze the words in the message to discern a convincing section and an action section?

Posted by Doug Sauder at August 5, 2002 04:40 AM