More on my ideas to build effective spam filters...
What we want, is to identify the text of the convincing section, the action section, and the excuse section. How can we do this?
Here's my idea.
From a selection of normal English language text -- however that is defined -- create a database that contains a count of the frequencies of one-, two-, or three-word combinations. Create a similar database from spam text. We use these databases to compute the "surprise" value of the one-, two-, and three-word combinations in the text we are testing. Common word combinations have a low surprise value. The word "the", for example, has almost no surprise value. The word "viagra" has high surprise value in normal text, but low surprise value in spam text. (This is basic information theory.) When we test the text of a message, we look for those word combinations that have a high surprise value against normal English text, but that have a low surprise value against spam text.
As protection against false positives, we could also update a third database, getting the word combinations from mail received from known non-spam sources. There would be many word combinations that have a high surprise value against normal English text -- first names of your friends, relatives, or co-workers, for example -- and a low surprise value against the text in legitimate mail. If the spam detector finds too many of these word combinations, then it would classify the mail as legitimate. Of course, this part of the spam filter works best if it is used for individuals, rather than the general public. However, it may also be effective for certain groups, such as family group or a work group. The more that group has in common, the more effective this phase of the detector will be.
We could refine this basic technique. If we manually "teach" the detector, we could be careful to separate the spam text into convincing text, action text, and excuse text, then create a separate database for each section. Then, we would have separate computed numbers for matching the convincing text, the action text, and the excuse text. We could weight these numbers differently when computing the overall score.
This is all theory. In practice, a lot of work must be done to tune the parameters. My guess is that this kind of analysis in a spam detector would work pretty well, assuming the learning phase was high quality.
Posted by Doug Sauder at August 5, 2002 02:14 PM