SpamAssassin Rules

SpamAssassin
SpamAssassin is technique to block spam, based on multiple rules.

Bayes Training
SpamAssassin uses Bayesian filtering of email to stop spam. The first step is to learn Spamassassin the difference between spam and non-spam ("ham") mails. You do so by giving it a large sample set (at least 100 mails), which it examins for word usage. It may learn that a spam mail often contains words like viagra, while ham mails contains your full name. When a new mail arrives, it looks at the word usages, and based on that gives a propability score wether the mail is spam or not.

Mail boxes
I have a few mail boxes devoted to spam:


 * Spam/incoming: SpamAssassin will store mail which it suspect is spam, but is not 100% sure. I will have to check these manually (if it is sure, it simply deletes them).
 * Spam/confirmed: After confirming something is spam, I put it in this box.
 * Spam/negatives: The confirmed mail is often stored as an attachment to another mail with the spamassassin report. I run a cron job to extract the confirmed spam from the report and store it in here.
 * Spam/falsenegatives: I put here spam mail that slipped through.
 * Spam/falsepositives: I put here non-spam mails that were accidentilly marked as spam.

Training set
The cron job I run is: extract-spam.php

HOME=/home/freek PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/home/freek/bin LC_ALL=C extract-spam.php >> $HOME/log/spam.log sa-learn --spam $HOME/Maildir/.Spam.negatives/cur/     >> $HOME/log/spam.log sa-learn --spam $HOME/Maildir/.Spam.falsenegatives/cur/ >> $HOME/log/spam.log sa-learn --ham $HOME/Maildir/.Spam.falsepositives/cur/ >> $HOME/log/spam.log sa-learn --ham $HOME/Maildir/cur/                      >> $HOME/log/spam.log
 * 1) !/bin/bash

The last line assumes that all mail in the default inbox is non-spam. This is a big assumption, so I do not run it automatically, but have to remember to run it manualy every once in a while.

Custom rules
In addition to the Bayes rules, SpamAssassin has many other rules. For example, HTML has a higher change of being spam, so I mark it as such. A lot of people have written their custom rules for Bayes.

Included rules
The rules included with SpamAssassin should in most cases be enough to block all spam. You shoul regularly update these rules. Since version 3.1 this can be done with:

sa-update

rule-get
I manage some custom rules with the "rule-get" script.

I do not recommend using it anymore. For one thing, badly scored custom rules may result in a lot of false positives. You may get some inspiration, but in general it is better to create some custom rules yourself based on your specific mail behaviour (e.g. based on language or mailing list topic) rather then using these untested generic rules.

Also, the rule-get project seems mostly dead. For one thing URL with rule specifications moved from http://airmex.nerim.net/rule-get/rules.ini to http://maxime.ritter.eu.org/Spam/rules.ini, while the first URL is still hard-coded in the rule-get script.

An alternative location is to look at the SpamAssassin wiki: http://wiki.apache.org/spamassassin/CustomRulesets

Custom rules
Since my mail profile is different, I added a few manual rules to SpamAssassin, mostly to cope with language specific characteristics, as well as mailing list specific characteristics.

It is important to verify the efficiency of these rules to avoid false positives.

An example of a custom rule is:

header DEBIAN_LIST             List-Id =~ /lists.debian.org/ describe DEBIAN_LIST           Debian Mailinglist ID score DEBIAN_LIST               -3.0 tflags DEBIAN_LIST             nice

Note that the tflags is not used in regular operation, but it must be set to "nice" for negative scoring rules, in case you want to verify the efficiency later on.

Tools
Download SpamAssassin. In the masses subdirectory, you will find miscellaneous tools to test the efficiency of all rules.

First of all, you should run mass-check, which takes your messages, and counts which rules are matched:

Important: You must first verify that the mails in these folders is indeed spam or ham. Incorrect mails may skew the results significantly.

Note: --lint was only introduces in mass-check 3.1 and a ~ in the -p option is not recognized.

./mass-check -c=/usr/share/spamassassin -p=/home/freek/.spamassassin --progress --lint \ spam:dir:~/Maildir/.Spam.negatives/cur \ spam:dir:~/Maildir/.Spam.falsenegatives/cur \ ham:dir:~/Maildir/.Spam.falsepositives/cur \ ham:dir:~/Maildir/cur \ ham:dir:~/Maildir/.Computers.Apple/cur \ ham:dir:~/Maildir/.Computers.Debian/cur \ ham:dir:~/Maildir/.Computers.bugreports/cur \ ham:dir:~/Maildir/.Computers.macports/cur \ ham:dir:~/Maildir/.Computers.zeroconf/cur \ ham:dir:~/Maildir/.Hobbies.scatterlings/cur \ ham:dir:~/Maildir/.News.jobs/cur \ ham:dir:~/Maildir/.News.jobs/cur \ ham:dir:~/Maildir/.News.maillist/cur \ ham:dir:~/Maildir/.News.netmags/cur \ ham:dir:~/Maildir/.News.security/cur \ ham:dir:~/Maildir/.News.Store/cur \ ham:dir:~/Maildir/.Organisation.ABOU/cur \ ham:dir:~/Maildir/.Organisation.Hostel/cur \ ham:dir:~/Maildir/.Personal.financial/cur \ ham:dir:~/Maildir/.Personal.trouwen/cur \ ham:dir:~/Maildir/.Reference.Shops/cur \ ham:dir:~/Maildir/.Reference.Sites/cur \ ham:dir:~/Maildir/.Reference.Software/cur \ ham:dir:~/Maildir/.Websites.security/cur \ ham:dir:~/Maildir/.Websites.boek/cur \ ham:dir:~/Maildir/.Websites.omroephumor/cur \ ham:dir:~/Maildir/.Websites.internlnet/cur

This generates the files ham.log and spam.log

Hit-frequencies
Hit-frequencies is a simple script that count the number of hits of each rule for both spam.log and ham.log.

For example:

./hit-frequencies OVERALL       SPAM         HAM  NAME 10528       8847        1681  (all messages) 345        345           0  DRUGS_PAIN 250        250           0  RATWARE_ZERO_TZ 229        229           0  DRUGS_ANXIETY 214        214           0  SUBJECT_DRUG_GAP_C 357        356           1  DRUGS_ERECTILE_OBFU ...        ...         ...  ...

A good rule should match either most spam and never ham, or most ham and never spam. If a rule matches most spam, but sometimes ham, it is vulnerable for false positives, which is unwanted.

Verify falsepositives and false negatives
Important: This may be a good time to verify that there are no false positives and false negatives. For example, the single ham hit for DRUGS_ERECTILE_OBFU might in fact be a spam message. grep DRUGS_ERECTILE_OBFU ham.log will list the message in question:

grep DRUGS_ERECTILE ham.log | cut -b 6- | cut -d ' ' -f1 | xargs head -50 grep BAYES_99 ham.log | cut -b 6- | cut -d ' ' -f1 | xargs grep -E "^(Subject|From|Date)"

Another method, as listed on the SpamAssassin wiki, is to make a list of the 100 lowest-scoring spams and 100 highest-scoring non-spam mails:

cd masses sort -n +1 spam.log | head -100 > id.low-spam ./mboxget < id.low-spam > low-scoring-ham.mbox sort -rn +1 spam.log | head -100 > id.high-ham ./mboxget < id.high-ham > low-scoring-ham.mbox

You can then check these mailboxes.

Bayes
You can check the total figures for bayes rules:

grep BAYES_ ham.log | sed -e 's/.*\(BAYES_[0-9][0-9]\).*/\1/' | sort | uniq -c grep BAYES_ spam.log | sed -e 's/.*\(BAYES_[0-9][0-9]\).*/\1/' | sort | uniq -c

perceptron
Perceptron is a fast processor that uses an algorithm to determine the best scores, reducing false positives and false negatives.

There are many intermediate steps to get the result, perceptron.scores. In SpamAssassin 3.1 and up it is relatively easy:

Edit Makefile. Change RULES=         ../rules to the location of your custom rules. For example RULES=         /home/freek/.spamassassin

make perceptron . config ./perceptron -p $HAM_PREFERENCE -t $THRESHOLD -e $EPOCHS

Running perceptron results in a file perceptron.rules which contains suggestions for scores.

make perceptron does the following steps:

mkdir tmp parse-rules-for-masses -d /home/freek/.spamassassin # created tmp/rules.pl hit-frequencies -c /home/freek/.spamassassin -x -p > freqs score-ranges-from-freqs /home/freek/.spamassassin < freqs # creates tmp/ranges.data logs-to-c --cffile=/home/freek/.spamassassin # creates tmp/tests.h and tmp/scores.h gcc -g -O2 -Wall -c -o perceptron.o perceptron.c gcc -o perceptron perceptron.o -lm

After you run perceptron, you can compare freqs (the current scores) with perceptron.scores (the suggested scores).

Note: If perceptron suggests a score of 0, while you would expect it to be negative, you may want to check if "tflags SCORE_NAME nice" was set in the rule definition file.