SpamAssassin is technique to block spam, based on multiple rules.
SpamAssassin uses Bayesian filtering of email to stop spam. The first step is to learn Spamassassin the difference between spam and non-spam ("ham") mails. You do so by giving it a large sample set (at least 100 mails), which it examins for word usage. It may learn that a spam mail often contains words like viagra, while ham mails contains your full name. When a new mail arrives, it looks at the word usages, and based on that gives a propability score wether the mail is spam or not.
I have a few mail boxes devoted to spam:
- SpamAssassin will store mail which it suspect is spam, but is not 100% sure. I will have to check these manually (if it is sure, it simply deletes them).
- After confirming something is spam, I put it in this box.
- The confirmed mail is often stored as an attachment to another mail with the spamassassin report. I run a cron job to extract the confirmed spam from the report and store it in here.
- I put here spam mail that slipped through.
- I put here non-spam mails that were accidentilly marked as spam.
The cron job I run is: extract-spam.php
#!/bin/bash HOME=/home/freek PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/home/freek/bin LC_ALL=C extract-spam.php >> $HOME/log/spam.log sa-learn --spam $HOME/Maildir/.Spam.negatives/cur/ >> $HOME/log/spam.log sa-learn --spam $HOME/Maildir/.Spam.falsenegatives/cur/ >> $HOME/log/spam.log sa-learn --ham $HOME/Maildir/.Spam.falsepositives/cur/ >> $HOME/log/spam.log sa-learn --ham $HOME/Maildir/cur/ >> $HOME/log/spam.log
The last line assumes that all mail in the default inbox is non-spam. This is a big assumption, so I do not run it automatically, but have to remember to run it manualy every once in a while.
In addition to the Bayes rules, SpamAssassin has many other rules. For example, HTML has a higher change of being spam, so I mark it as such. A lot of people have written their custom rules for Bayes.
The rules included with SpamAssassin should in most cases be enough to block all spam. You shoul regularly update these rules. Since version 3.1 this can be done with:
I manage some custom rules with the "rule-get" script.
I do not recommend using it anymore. For one thing, badly scored custom rules may result in a lot of false positives. You may get some inspiration, but in general it is better to create some custom rules yourself based on your specific mail behaviour (e.g. based on language or mailing list topic) rather then using these untested generic rules.
Also, the rule-get project seems mostly dead. For one thing URL with rule specifications moved from http://airmex.nerim.net/rule-get/rules.ini to http://maxime.ritter.eu.org/Spam/rules.ini, while the first URL is still hard-coded in the rule-get script.
An alternative location is to look at the SpamAssassin wiki: http://wiki.apache.org/spamassassin/CustomRulesets
Since my mail profile is different, I added a few manual rules to SpamAssassin, mostly to cope with language specific characteristics, as well as mailing list specific characteristics.
It is important to verify the efficiency of these rules to avoid false positives.
An example of a custom rule is:
header DEBIAN_LIST List-Id =~ /lists.debian.org/ describe DEBIAN_LIST Debian Mailinglist ID score DEBIAN_LIST -3.0 tflags DEBIAN_LIST nice
Note that the tflags is not used in regular operation, but it must be set to "nice" for negative scoring rules, in case you want to verify the efficiency later on.
Verifying effectivity of rules
Download SpamAssassin. In the masses subdirectory, you will find miscellaneous tools to test the efficiency of all rules.
First of all, you should run mass-check, which takes your messages, and counts which rules are matched:
Important: You must first verify that the mails in these folders is indeed spam or ham. Incorrect mails may skew the results significantly.
Note: --lint was only introduces in mass-check 3.1 and a ~ in the -p option is not recognized.
./mass-check -c=/usr/share/spamassassin -p=/home/freek/.spamassassin --progress --lint \ spam:dir:~/Maildir/.Spam.negatives/cur \ spam:dir:~/Maildir/.Spam.falsenegatives/cur \ ham:dir:~/Maildir/.Spam.falsepositives/cur \ ham:dir:~/Maildir/cur \ ham:dir:~/Maildir/.Computers.Apple/cur \ ham:dir:~/Maildir/.Computers.Debian/cur \ ham:dir:~/Maildir/.Computers.bugreports/cur \ ham:dir:~/Maildir/.Computers.macports/cur \ ham:dir:~/Maildir/.Computers.zeroconf/cur \ ham:dir:~/Maildir/.Hobbies.scatterlings/cur \ ham:dir:~/Maildir/.News.jobs/cur \ ham:dir:~/Maildir/.News.jobs/cur \ ham:dir:~/Maildir/.News.maillist/cur \ ham:dir:~/Maildir/.News.netmags/cur \ ham:dir:~/Maildir/.News.security/cur \ ham:dir:~/Maildir/.News.Store/cur \ ham:dir:~/Maildir/.Organisation.ABOU/cur \ ham:dir:~/Maildir/.Organisation.Hostel/cur \ ham:dir:~/Maildir/.Personal.financial/cur \ ham:dir:~/Maildir/.Personal.trouwen/cur \ ham:dir:~/Maildir/.Reference.Shops/cur \ ham:dir:~/Maildir/.Reference.Sites/cur \ ham:dir:~/Maildir/.Reference.Software/cur \ ham:dir:~/Maildir/.Websites.security/cur \ ham:dir:~/Maildir/.Websites.boek/cur \ ham:dir:~/Maildir/.Websites.omroephumor/cur \ ham:dir:~/Maildir/.Websites.internlnet/cur
This generates the files ham.log and spam.log
Hit-frequencies is a simple script that count the number of hits of each rule for both spam.log and ham.log.
./hit-frequencies OVERALL SPAM HAM NAME 10528 8847 1681 (all messages) 345 345 0 DRUGS_PAIN 250 250 0 RATWARE_ZERO_TZ 229 229 0 DRUGS_ANXIETY 214 214 0 SUBJECT_DRUG_GAP_C 357 356 1 DRUGS_ERECTILE_OBFU ... ... ... ...
A good rule should match either most spam and never ham, or most ham and never spam. If a rule matches most spam, but sometimes ham, it is vulnerable for false positives, which is unwanted.
Verify falsepositives and false negatives
Important: This may be a good time to verify that there are no false positives and false negatives. For example, the single ham hit for DRUGS_ERECTILE_OBFU might in fact be a spam message. grep DRUGS_ERECTILE_OBFU ham.log will list the message in question:
grep DRUGS_ERECTILE ham.log | cut -b 6- | cut -d ' ' -f1 | xargs head -50 grep BAYES_99 ham.log | cut -b 6- | cut -d ' ' -f1 | xargs grep -E "^(Subject|From|Date)"
Another method, as listed on the SpamAssassin wiki, is to make a list of the 100 lowest-scoring spams and 100 highest-scoring non-spam mails:
cd masses sort -n +1 spam.log | head -100 > id.low-spam ./mboxget < id.low-spam > low-scoring-ham.mbox sort -rn +1 spam.log | head -100 > id.high-ham ./mboxget < id.high-ham > low-scoring-ham.mbox
You can then check these mailboxes.
You can check the total figures for bayes rules:
grep BAYES_ ham.log | sed -e 's/.*\(BAYES_[0-9][0-9]\).*/\1/' | sort | uniq -c grep BAYES_ spam.log | sed -e 's/.*\(BAYES_[0-9][0-9]\).*/\1/' | sort | uniq -c
Perceptron is a fast processor that uses an algorithm to determine the best scores, reducing false positives and false negatives.
There are many intermediate steps to get the result, perceptron.scores. In SpamAssassin 3.1 and up it is relatively easy:
Edit Makefile. Change
to the location of your custom rules. For example
make perceptron . config ./perceptron -p $HAM_PREFERENCE -t $THRESHOLD -e $EPOCHS
Running perceptron results in a file perceptron.rules which contains suggestions for scores.
make perceptron does the following steps:
mkdir tmp parse-rules-for-masses -d /home/freek/.spamassassin # created tmp/rules.pl hit-frequencies -c /home/freek/.spamassassin -x -p > freqs score-ranges-from-freqs /home/freek/.spamassassin < freqs # creates tmp/ranges.data logs-to-c --cffile=/home/freek/.spamassassin # creates tmp/tests.h and tmp/scores.h gcc -g -O2 -Wall -c -o perceptron.o perceptron.c gcc -o perceptron perceptron.o -lm
After you run perceptron, you can compare freqs (the current scores) with perceptron.scores (the suggested scores).
Note: If perceptron suggests a score of 0, while you would expect it to be negative, you may want to check if "tflags SCORE_NAME nice" was set in the rule definition file.