SpamAssassin Rules

From Exterior Memory
Jump to: navigation, search

SpamAssassin

SpamAssassin is technique to block spam, based on multiple rules.

Bayes Training

SpamAssassin uses Bayesian filtering of email to stop spam. The first step is to learn Spamassassin the difference between spam and non-spam ("ham") mails. You do so by giving it a large sample set (at least 100 mails), which it examins for word usage. It may learn that a spam mail often contains words like viagra, while ham mails contains your full name. When a new mail arrives, it looks at the word usages, and based on that gives a propability score wether the mail is spam or not.

Mail boxes

I have a few mail boxes devoted to spam:

Spam/incoming
SpamAssassin will store mail which it suspect is spam, but is not 100% sure. I will have to check these manually (if it is sure, it simply deletes them).
Spam/confirmed
After confirming something is spam, I put it in this box.
Spam/negatives
The confirmed mail is often stored as an attachment to another mail with the spamassassin report. I run a cron job to extract the confirmed spam from the report and store it in here.
Spam/falsenegatives
I put here spam mail that slipped through.
Spam/falsepositives
I put here non-spam mails that were accidentilly marked as spam.

Training set

The cron job I run is: extract-spam.php

#!/bin/bash

HOME=/home/freek
PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/home/freek/bin
LC_ALL=C

extract-spam.php >> $HOME/log/spam.log
sa-learn --spam $HOME/Maildir/.Spam.negatives/cur/      >> $HOME/log/spam.log
sa-learn --spam $HOME/Maildir/.Spam.falsenegatives/cur/ >> $HOME/log/spam.log
sa-learn --ham  $HOME/Maildir/.Spam.falsepositives/cur/ >> $HOME/log/spam.log
sa-learn --ham  $HOME/Maildir/cur/                      >> $HOME/log/spam.log

The last line assumes that all mail in the default inbox is non-spam. This is a big assumption, so I do not run it automatically, but have to remember to run it manualy every once in a while.

Custom rules

In addition to the Bayes rules, SpamAssassin has many other rules. For example, HTML has a higher change of being spam, so I mark it as such. A lot of people have written their custom rules for Bayes.

Included rules

The rules included with SpamAssassin should in most cases be enough to block all spam. You shoul regularly update these rules. Since version 3.1 this can be done with:

sa-update

rule-get

I manage some custom rules with the "rule-get" script.

I do not recommend using it anymore. For one thing, badly scored custom rules may result in a lot of false positives. You may get some inspiration, but in general it is better to create some custom rules yourself based on your specific mail behaviour (e.g. based on language or mailing list topic) rather then using these untested generic rules.

Also, the rule-get project seems mostly dead. For one thing URL with rule specifications moved from http://airmex.nerim.net/rule-get/rules.ini to http://maxime.ritter.eu.org/Spam/rules.ini, while the first URL is still hard-coded in the rule-get script.

An alternative location is to look at the SpamAssassin wiki: http://wiki.apache.org/spamassassin/CustomRulesets

Custom rules

Since my mail profile is different, I added a few manual rules to SpamAssassin, mostly to cope with language specific characteristics, as well as mailing list specific characteristics.

It is important to verify the efficiency of these rules to avoid false positives.

An example of a custom rule is:

header DEBIAN_LIST              List-Id =~ /lists.debian.org/
describe DEBIAN_LIST            Debian Mailinglist ID
score DEBIAN_LIST               -3.0
tflags DEBIAN_LIST              nice

Note that the tflags is not used in regular operation, but it must be set to "nice" for negative scoring rules, in case you want to verify the efficiency later on.

Verifying effectivity of rules

Tools

Download SpamAssassin. In the masses subdirectory, you will find miscellaneous tools to test the efficiency of all rules.

First of all, you should run mass-check, which takes your messages, and counts which rules are matched:

Important: You must first verify that the mails in these folders is indeed spam or ham. Incorrect mails may skew the results significantly.

Note: --lint was only introduces in mass-check 3.1 and a ~ in the -p option is not recognized.

./mass-check -c=/usr/share/spamassassin -p=/home/freek/.spamassassin --progress --lint \
    spam:dir:~/Maildir/.Spam.negatives/cur \
    spam:dir:~/Maildir/.Spam.falsenegatives/cur \
    ham:dir:~/Maildir/.Spam.falsepositives/cur \
    ham:dir:~/Maildir/cur \
    ham:dir:~/Maildir/.Computers.Apple/cur \
    ham:dir:~/Maildir/.Computers.Debian/cur \
    ham:dir:~/Maildir/.Computers.bugreports/cur \
    ham:dir:~/Maildir/.Computers.macports/cur \
    ham:dir:~/Maildir/.Computers.zeroconf/cur \
    ham:dir:~/Maildir/.Hobbies.scatterlings/cur \
    ham:dir:~/Maildir/.News.jobs/cur \
    ham:dir:~/Maildir/.News.jobs/cur \
    ham:dir:~/Maildir/.News.maillist/cur \
    ham:dir:~/Maildir/.News.netmags/cur \
    ham:dir:~/Maildir/.News.security/cur \
    ham:dir:~/Maildir/.News.Store/cur \
    ham:dir:~/Maildir/.Organisation.ABOU/cur \
    ham:dir:~/Maildir/.Organisation.Hostel/cur \
    ham:dir:~/Maildir/.Personal.financial/cur \
    ham:dir:~/Maildir/.Personal.trouwen/cur \
    ham:dir:~/Maildir/.Reference.Shops/cur \
    ham:dir:~/Maildir/.Reference.Sites/cur \
    ham:dir:~/Maildir/.Reference.Software/cur \
    ham:dir:~/Maildir/.Websites.security/cur \
    ham:dir:~/Maildir/.Websites.boek/cur \
    ham:dir:~/Maildir/.Websites.omroephumor/cur \
    ham:dir:~/Maildir/.Websites.internlnet/cur

This generates the files ham.log and spam.log

Hit-frequencies

Hit-frequencies is a simple script that count the number of hits of each rule for both spam.log and ham.log.

For example:

./hit-frequencies
  OVERALL        SPAM         HAM  NAME
    10528        8847        1681  (all messages)
      345         345           0  DRUGS_PAIN
      250         250           0  RATWARE_ZERO_TZ
      229         229           0  DRUGS_ANXIETY
      214         214           0  SUBJECT_DRUG_GAP_C
      357         356           1  DRUGS_ERECTILE_OBFU
      ...         ...         ...  ...

A good rule should match either most spam and never ham, or most ham and never spam. If a rule matches most spam, but sometimes ham, it is vulnerable for false positives, which is unwanted.

Verify falsepositives and false negatives

Important: This may be a good time to verify that there are no false positives and false negatives. For example, the single ham hit for DRUGS_ERECTILE_OBFU might in fact be a spam message. grep DRUGS_ERECTILE_OBFU ham.log will list the message in question:

grep DRUGS_ERECTILE ham.log | cut -b 6- | cut -d ' ' -f1 | xargs head -50
grep BAYES_99 ham.log | cut -b 6- | cut -d ' ' -f1 | xargs grep -E "^(Subject|From|Date)"

Another method, as listed on the SpamAssassin wiki, is to make a list of the 100 lowest-scoring spams and 100 highest-scoring non-spam mails:

cd masses
sort -n +1 spam.log | head -100 > id.low-spam
./mboxget < id.low-spam > low-scoring-ham.mbox
sort -rn +1 spam.log | head -100 > id.high-ham
./mboxget < id.high-ham > low-scoring-ham.mbox

You can then check these mailboxes.

Bayes

You can check the total figures for bayes rules:

grep BAYES_ ham.log  | sed -e 's/.*\(BAYES_[0-9][0-9]\).*/\1/' | sort | uniq -c
grep BAYES_ spam.log | sed -e 's/.*\(BAYES_[0-9][0-9]\).*/\1/' | sort | uniq -c

perceptron

Perceptron is a fast processor that uses an algorithm to determine the best scores, reducing false positives and false negatives.

There are many intermediate steps to get the result, perceptron.scores. In SpamAssassin 3.1 and up it is relatively easy:

Edit Makefile. Change

RULES=          ../rules

to the location of your custom rules. For example

RULES=          /home/freek/.spamassassin
make perceptron
. config
./perceptron -p $HAM_PREFERENCE -t $THRESHOLD -e $EPOCHS

Running perceptron results in a file perceptron.rules which contains suggestions for scores.

make perceptron does the following steps:

mkdir tmp
parse-rules-for-masses -d /home/freek/.spamassassin  # created tmp/rules.pl
hit-frequencies -c /home/freek/.spamassassin -x -p > freqs
score-ranges-from-freqs /home/freek/.spamassassin < freqs  # creates tmp/ranges.data
logs-to-c --cffile=/home/freek/.spamassassin  # creates tmp/tests.h and tmp/scores.h
gcc -g -O2 -Wall -c -o perceptron.o perceptron.c
gcc -o perceptron perceptron.o -lm

After you run perceptron, you can compare freqs (the current scores) with perceptron.scores (the suggested scores).

Note: If perceptron suggests a score of 0, while you would expect it to be negative, you may want to check if "tflags SCORE_NAME nice" was set in the rule definition file.