28 jun 2010
i’ve had lots of trouble with mail the last two years with lots of spam still passing spamassassin. but that wasn’t so bad since we already had greylisting [1] running. that however still meant 8 spam mails per day with subjects as:
however, that changed after i started to collect the spam to train the bayesian filter [2] used by spamassassin. this posting is about how i’ve done it!
in contrast: after using bayes i receive one spam mail every 4th day.
i still have to monitor the spam, which is filed in a folder called ‘spam’ in one of my ‘imap’ folders but since it can be identified as spam by 100% most of the time it’s simply a copy’n’past operation of all files in ‘spam’ to ‘spam_train’. the ‘spam-train’ folder is used in order to train spamassassin. of course there are false-positives and false-negatives as well. but handling those is very easy.
no i don’t think so at all. however: most sites do encourage admins not to use bayes filters per vpopmail account. and because of that and other factors there is not much documentation of how to do so. since i had some issues with how to call spamassassin from procmail it did not work for a very long time. recently i bought my nokia n900 mobile phone and using mail on this device is very nice. receiving a spam mail per hour means a notification on the n900 which IS annoying if the notification mostly contains spam slogans.
# qmail Lazydog procmailrc file
SHELL="/bin/bash"
VHOME="/var/vpopmail/domains/lastlog.de/js"
VERBOSE="no"
LOGFILE="/var/vpopmail/domains/lastlog.de/js/procmail.log"
:0fw
**| spamassassin --siteconfigpath=/var/vpopmail/domains/lastlog.de/js/.spamassassin/ -p /var/vpopmail/domains/lastlog.de/js/.spamassassin/local.cf**
:0:
* ^X-Spam-Flag: YES
/var/vpopmail/domains/lastlog.de/js/.maildir/.spam/new
:0:
* ^(From|Cc|To).*news@aktuell.conrad.de
/dev/null
# (other rules below)
# cat local.cf
required_score 3.2
rewrite_header subject _SCORE(0)_ |
use_bayes 1
use_dcc 1
use_pyzor 1
use_razor2 1
bayes_auto_learn 1
bayes_path /var/vpopmail/domains/lastlog.de/js/.spamassassin/db/bayes
bayes_file_mode 0777
report_safe 1
#add_header all Flag _YESNOCAPS_
ok_languages en de
ok_locales en de
# Google SafeBrowsing Plugin
loadplugin Mail::SpamAssassin::Plugin::GoogleSafeBrowsing
#body GOOGLE_SAFEBROWSING eval:check_google_safebrowsing_blocklists()
google_safebrowsing_apikey ABQIAA#VJ#J#VAQFbnTQ4uqKBRgArBR3gWufhSOf_c-vzV4UEN0steDDKD
google_safebrowsing_dir /var/cache/spamassassin
# scores for each url hit in message body
google_safebrowsing_blocklist goog-black-hash 0.3
google_safebrowsing_blocklist goog-malware-hash 0.5
no blog or documentation talks about this but i think it’s very important. say you have a spam mail and you want to test your spamassassin configuration do this:
ls -la /var/vpopmail/domains/lastlog.de/js/.maildir/cur
-rw-rwx---+ 1 vpopmail vpopmail 2630 26. Dez 2009 msg.zQJ68:2,RS*
-rw-rwx---+ 1 vpopmail vpopmail 21905 16. Dez 2009 msg._zr78:2,S*
-rw-rwx---+ 1 vpopmail vpopmail 7433 9. Mär 2009 msg.zssm8:2,S*
-rw-rwx---+ 1 vpopmail vpopmail 2332 6. Mai 17:27 msg.ZuFm8:2,S*
-rw-rwx---+ 1 vpopmail vpopmail 3418 21. Nov 2008 msg.ZvFh8:2,S*
-rw-rwx---+ 1 vpopmail vpopmail 3237 23. Feb 13:07 msg.zZcg9:2,S*
let’s pick one of these messages: ‘msg.zf8y8:2,S’
now, let’s see what spamassassin thinks about this message with:
cat msg.zf8y8:2,S | spamassassin -t -D -p /var/vpopmail/domains/lastlog.de/js/.spamassassin/local.cf >mail 2>debug
-t -D are only important for debugging
also have a look at:
tail -n 200 /var/vpopmail/domains/lastlog.de/js/procmail.log
the message is processed by spamassassin and written to a file called ‘mail’ and the debug output going to stderr is written to a file called ‘debug’. this helped me a lot to verify that spamassassin was reading the right config files.
cat msg.zf8y8:2,S | spamassassin -t -D --siteconfigpath=/var/vpopmail/domains/lastlog.de/js/.spamassassin/ -p /var/vpopmail/domains/lastlog.de/js/.spamassassin/local.cf >mail 2>debug
watch for things like this:
[7582] dbg: conf: finish parsing
[7582] dbg: plugin: Mail::SpamAssassin::Plugin::GoogleSafeBrowsing=HASH(0x14e3a10) implements 'finish_parsing_end', priority 0
[7582] dbg: bayes: tie-ing to DB file R/O /var/vpopmail/domains/lastlog.de/js/.spamassassin/db/bayes_toks
[7582] dbg: bayes: tie-ing to DB file R/O /var/vpopmail/domains/lastlog.de/js/.spamassassin/db/bayes_seen
[7582] dbg: bayes: found bayes db version 3
[7582] dbg: bayes: DB journal sync: last sync: 0
[7582] dbg: config: score set 3 chosen.
[7582] dbg: message: main message type: multipart/mixed
check: no loaded plugin implements ‘check_main’: cannot scan! at /usr/lib64/perl5/vendor_perl/5.8.8/Mail/SpamAssassin/PerMsgStatus.pm line 164.
it seems that if i use –siteconfigpath spamassassin expects all config files to be in ‘/var/vpopmail/domains/lastlog.de/js/.spamassassin/’ but usually they are in ‘/etc/spamassassin/’. so when using –siteconfigpath make sure you copy all files from ’/etc/spamassassin/*’ to your own configuration directory ‘/var/vpopmail/domains/lastlog.de/js/.spamassassin/’.
WARNING: do not overwrite your customized files as there might be a local.cf in both paths!
to check if the right bayes path is used, look at the debug file we created:
cat debug | grep bayes
[20792] dbg: bayes: tie-ing to DB file R/O /var/vpopmail/domains/lastlog.de/js/.spamassassin/db/bayes_toks
[20792] dbg: bayes: tie-ing to DB file R/O /var/vpopmail/domains/lastlog.de/js/.spamassassin/db/bayes_seen
[20792] dbg: bayes: found bayes db version 3
[20792] dbg: bayes: DB journal sync: last sync: 0
in this case everything is right.
also look if the mail is quallified as spam, just have a look at the ‘mail’ file:
Subject: 16.6 | VIAGRA ® Official Site -95%
X-Spam-Flag: YES
X-Spam-Checker-Version: SpamAssassin 3.2.1 (2007-05-02) on
bonker.serverkommune.de
X-Spam-Level: ****************
X-Spam-Status: Yes, score=16.6 required=5.0 tests=AWL,BAYES_99,
HTML_IMAGE_ONLY_08,HTML_MESSAGE,HTML_SHORT_LINK_IMG_1,MISSING_DATE,
MISSING_MID,MISSING_SUBJECT,NO_RELAYS,URIBL_AB_SURBL,URIBL_BLACK,
URIBL_JP_SURBL,URIBL_OB_SURBL,URIBL_WS_SURBL shortcircuit=no
autolearn=unavailable version=3.2.1
MIME-Version: 1.0
since i’ve been learning the db wrong for some time i decided to relearn everything with:
cd /var/vpopmail/domains/lastlog.de/js/.spamassassin/db/
ls -la
-rw-rw----+ 1 joachim users 7824 21. Jun 15:17 bayes_journal
-rw-rwx---+ 1 joachim users 643072 21. Jun 15:09 bayes_seen*
-rw-rwx---+ 1 joachim users 5177344 21. Jun 15:09 bayes_toks*
just remove all 3 files (or create backups if in doubt) and then use sa-learn:
cat spamcommand
echo "=== <SPAM Train> ===="
echo "/var/vpopmail/domains/lastlog.de/js/.maildir/.spam_train/cur"
nice sa-learn -D 5 -u vpopmail --dbpath /var/vpopmail/domains/lastlog.de/js/.spamassassin/db --siteconfigpath /var/vpopmail/domains/lastlog.de/js/.spamassassin/ -p /var/vpopmail/domains/lastlog.de/js/.spamassassin/user_prefs -C /var/vpopmail/domains/lastlog.de/js/.spamassassin --spam --dir /var/vpopmail/domains/lastlog.de/js/.maildir/.spam_train/cur
echo "=== </SPAM Train> ==="
echo ""
echo "=== <HAM Train> ==="
for i in cur .js@dune2.de/cur .notice/cur; do
echo "learning from: /var/vpopmail/domains/lastlog.de/js/.maildir/${i}"
nice sa-learn -D 5 -u vpopmail --dbpath /var/vpopmail/domains/lastlog.de/js/.spamassassin/db --siteconfigpath /var/vpopmail/domains/lastlog.de/js/.spamassassin/ -p /var/vpopmail/domains/lastlog.de/js/.spamassassin/user_prefs -C /var/vpopmail/domains/lastlog.de/js/.spamassassin --ham --dir "/var/vpopmail/domains/lastlog.de/js/.maildir/${i}"
echo "=== </HAM Train> ==="
i execute this script every night using a cronjob, and so far it is working great! handling false positives and false negatives is easy as well since you only have to copy the wrong mails into the right folder and the next cronjob probably learns it right. i’m not sure on this but so far it is working.
./spamcommand
==== <SPAM Train> ===
/var/vpopmail/domains/lastlog.de/js/.maildir/.spam_train/cur
[7913] info: archive-iterator: skipping large message
...
Learned tokens from 2515 message(s) (2531 message(s) examined)
=== </SPAM Train> ===
=== <HAM Train> ===
[22465] info: archive-iterator: skipping large message
...
Learned tokens from 1100 message(s) (1171 message(s) examined)
[19782] info: archive-iterator: skipping large message
....
Learned tokens from 191 message(s) (199 message(s) examined)
=== </HAM Train> ===
./spamcommand 291,01s user 12,00s system 12% cpu 40:05,37 total
the first time this script run about 40 minutes
since there might be multiple users accessing the bayes dbs you have to check permissions. most likely these users are: vpopmail and your login name.
remove -t and -D if you have finished debugging as it will flood your logs. also have a look at [3], [4] and [5].