[[!summary chatGPT: The author discusses their experience with using Bayesian filtering with SpamAssassin to reduce spam emails, including their setup, configuration, and training of the filter. They provide details on their procmail configuration, local.cf file, and how they experiment with spam. They also mention the importance of using the correct configuration files and permissions, and provide links to additional resources.]] [[!meta date="2010-06-28 09:30"]] [[!tag linux mail usability]] [[!img media/spamassassin-logo-320x136.png alt="copied from http://spamassassin.apache.org/logo" style="float: right"]] # motivation i've had lots of trouble with mail the last two years with lots of spam still passing spamassassin. but that wasn't so bad since we already had greylisting [1] running. that however still meant **8 spam mails per day** with subjects as: * Hot news for js! 70% off all June! * Totaler Ausverkauf von Top Zeitmesser * Yes, js, today -80% to all prices. Office accessible free * This crysis never ends * VIAGRA ® Official Site -95% however, that changed after i started to collect the spam to train the bayesian filter [2] used by spamassassin. this posting is about how i've done it! **in contrast: after using bayes i receive one spam mail every 4th day**. i still have to monitor the spam, which is filed in a folder called 'spam' in one of my 'imap' folders but since it can be identified as spam by 100% most of the time it's simply a copy'n'past operation of all files in 'spam' to 'spam_train'. the 'spam-train' folder is used in order to train spamassassin. of course there are false-positives and false-negatives as well. but handling those is very easy. # is my setup special in any way? no i don't think so at all. **however: most sites do encourage admins not to use bayes filters per vpopmail account.** and because of that and other factors there is not much documentation of how to do so. since i had some issues with how to call spamassassin from procmail it did not work for a very long time. recently i bought my nokia n900 mobile phone and using mail on this device is very nice. receiving a spam mail per hour means a notification on the n900 which IS annoying if the notification mostly contains spam slogans. # my .procmailrc # qmail Lazydog procmailrc file SHELL="/bin/bash" VHOME="/var/vpopmail/domains/lastlog.de/js" VERBOSE="no" LOGFILE="/var/vpopmail/domains/lastlog.de/js/procmail.log" :0fw **| spamassassin --siteconfigpath=/var/vpopmail/domains/lastlog.de/js/.spamassassin/ -p /var/vpopmail/domains/lastlog.de/js/.spamassassin/local.cf** :0: * ^X-Spam-Flag: YES /var/vpopmail/domains/lastlog.de/js/.maildir/.spam/new :0: * ^(From|Cc|To).*news@aktuell.conrad.de /dev/null # (other rules below) # my local.cf file, the only config file i changed: # cat local.cf required_score 3.2 rewrite_header subject _SCORE(0)_ | use_bayes 1 use_dcc 1 use_pyzor 1 use_razor2 1 bayes_auto_learn 1 bayes_path /var/vpopmail/domains/lastlog.de/js/.spamassassin/db/bayes bayes_file_mode 0777 report_safe 1 #add_header all Flag _YESNOCAPS_ ok_languages en de ok_locales en de # Google SafeBrowsing Plugin loadplugin Mail::SpamAssassin::Plugin::GoogleSafeBrowsing #body GOOGLE_SAFEBROWSING eval:check_google_safebrowsing_blocklists() google_safebrowsing_apikey ABQIAA#VJ#J#VAQFbnTQ4uqKBRgArBR3gWufhSOf_c-vzV4UEN0steDDKD google_safebrowsing_dir /var/cache/spamassassin # scores for each url hit in message body google_safebrowsing_blocklist goog-black-hash 0.3 google_safebrowsing_blocklist goog-malware-hash 0.5 # how to experiment with spam no blog or documentation talks about this but i think it's very important. say you have a spam mail and you want to test your spamassassin configuration do this: ls -la /var/vpopmail/domains/lastlog.de/js/.maildir/cur -rw-rwx---+ 1 vpopmail vpopmail 2630 26. Dez 2009 msg.zQJ68:2,RS* -rw-rwx---+ 1 vpopmail vpopmail 21905 16. Dez 2009 msg._zr78:2,S* -rw-rwx---+ 1 vpopmail vpopmail 7433 9. Mär 2009 msg.zssm8:2,S* -rw-rwx---+ 1 vpopmail vpopmail 2332 6. Mai 17:27 msg.ZuFm8:2,S* -rw-rwx---+ 1 vpopmail vpopmail 3418 21. Nov 2008 msg.ZvFh8:2,S* -rw-rwx---+ 1 vpopmail vpopmail 3237 23. Feb 13:07 msg.zZcg9:2,S* let's pick one of these messages: '**msg.zf8y8:2,S**' now, let's see what spamassassin thinks about this message with: cat msg.zf8y8:2,S | spamassassin -t -D -p /var/vpopmail/domains/lastlog.de/js/.spamassassin/local.cf >mail 2>debug **-t -D** are only important for debugging also have a look at: tail -n 200 /var/vpopmail/domains/lastlog.de/js/procmail.log # creating the configuration for vpopmail usage of spamassassin the message is processed by spamassassin and written to a file called 'mail' and the debug output going to stderr is written to a file called 'debug'. this helped me a lot to verify that spamassassin was reading the right config files. cat msg.zf8y8:2,S | spamassassin -t -D --siteconfigpath=/var/vpopmail/domains/lastlog.de/js/.spamassassin/ -p /var/vpopmail/domains/lastlog.de/js/.spamassassin/local.cf >mail 2>debug watch for things like this: [7582] dbg: conf: finish parsing [7582] dbg: plugin: Mail::SpamAssassin::Plugin::GoogleSafeBrowsing=HASH(0x14e3a10) implements 'finish_parsing_end', priority 0 [7582] dbg: bayes: tie-ing to DB file R/O /var/vpopmail/domains/lastlog.de/js/.spamassassin/db/bayes_toks [7582] dbg: bayes: tie-ing to DB file R/O /var/vpopmail/domains/lastlog.de/js/.spamassassin/db/bayes_seen [7582] dbg: bayes: found bayes db version 3 [7582] dbg: bayes: DB journal sync: last sync: 0 [7582] dbg: config: score set 3 chosen. [7582] dbg: message: main message type: multipart/mixed **check: no loaded plugin implements 'check_main': cannot scan! at /usr/lib64/perl5/vendor_perl/5.8.8/Mail/SpamAssassin/PerMsgStatus.pm line 164.** it seems that if i use --siteconfigpath spamassassin expects all config files to be in '/var/vpopmail/domains/lastlog.de/js/.spamassassin/' but usually they are in '/etc/spamassassin/'. so when using --siteconfigpath make sure you copy all files from '/etc/spamassassin/*' to your own configuration directory '/var/vpopmail/domains/lastlog.de/js/.spamassassin/'. **WARNING: do not overwrite your customized files as there might be a local.cf in both paths!** to check if the right bayes path is used, look at the debug file we created: cat debug | grep bayes [20792] dbg: bayes: tie-ing to DB file R/O /var/vpopmail/domains/lastlog.de/js/.spamassassin/db/bayes_toks [20792] dbg: bayes: tie-ing to DB file R/O /var/vpopmail/domains/lastlog.de/js/.spamassassin/db/bayes_seen [20792] dbg: bayes: found bayes db version 3 [20792] dbg: bayes: DB journal sync: last sync: 0 in this case everything is right. also look if the mail is quallified as spam, just have a look at the 'mail' file: Subject: 16.6 | VIAGRA ® Official Site -95% X-Spam-Flag: YES X-Spam-Checker-Version: SpamAssassin 3.2.1 (2007-05-02) on bonker.serverkommune.de X-Spam-Level: **************** X-Spam-Status: Yes, score=16.6 required=5.0 tests=AWL,BAYES_99, HTML_IMAGE_ONLY_08,HTML_MESSAGE,HTML_SHORT_LINK_IMG_1,MISSING_DATE, MISSING_MID,MISSING_SUBJECT,NO_RELAYS,URIBL_AB_SURBL,URIBL_BLACK, URIBL_JP_SURBL,URIBL_OB_SURBL,URIBL_WS_SURBL shortcircuit=no autolearn=unavailable version=3.2.1 MIME-Version: 1.0 # initial bayes, sa-learn since i've been learning the db wrong for some time i decided to relearn everything with: cd /var/vpopmail/domains/lastlog.de/js/.spamassassin/db/ ls -la -rw-rw----+ 1 joachim users 7824 21. Jun 15:17 bayes_journal -rw-rwx---+ 1 joachim users 643072 21. Jun 15:09 bayes_seen* -rw-rwx---+ 1 joachim users 5177344 21. Jun 15:09 bayes_toks* just remove all 3 files (or create backups if in doubt) and then use sa-learn: cat spamcommand echo "=== ====" echo "/var/vpopmail/domains/lastlog.de/js/.maildir/.spam_train/cur" nice sa-learn -D 5 -u vpopmail --dbpath /var/vpopmail/domains/lastlog.de/js/.spamassassin/db --siteconfigpath /var/vpopmail/domains/lastlog.de/js/.spamassassin/ -p /var/vpopmail/domains/lastlog.de/js/.spamassassin/user_prefs -C /var/vpopmail/domains/lastlog.de/js/.spamassassin --spam --dir /var/vpopmail/domains/lastlog.de/js/.maildir/.spam_train/cur echo "=== ===" echo "" echo "=== ===" for i in cur .js@dune2.de/cur .notice/cur; do echo "learning from: /var/vpopmail/domains/lastlog.de/js/.maildir/${i}" nice sa-learn -D 5 -u vpopmail --dbpath /var/vpopmail/domains/lastlog.de/js/.spamassassin/db --siteconfigpath /var/vpopmail/domains/lastlog.de/js/.spamassassin/ -p /var/vpopmail/domains/lastlog.de/js/.spamassassin/user_prefs -C /var/vpopmail/domains/lastlog.de/js/.spamassassin --ham --dir "/var/vpopmail/domains/lastlog.de/js/.maildir/${i}" echo "=== ===" i execute this script every night using a cronjob, and so far it is working great! handling false positives and false negatives is easy as well since you only have to copy the wrong mails into the right folder and the next cronjob probably learns it right. i'm not sure on this but so far it is working. ./spamcommand ==== === /var/vpopmail/domains/lastlog.de/js/.maildir/.spam_train/cur [7913] info: archive-iterator: skipping large message ... Learned tokens from 2515 message(s) (2531 message(s) examined) === === === === [22465] info: archive-iterator: skipping large message ... Learned tokens from 1100 message(s) (1171 message(s) examined) [19782] info: archive-iterator: skipping large message .... Learned tokens from 191 message(s) (199 message(s) examined) === === ./spamcommand 291,01s user 12,00s system 12% cpu 40:05,37 total the first time this script run about **40 minutes** # rights management since there might be multiple users accessing the bayes dbs you have to check permissions. most likely these users are: vpopmail and your login name. # final words remove **-t** and **-D** if you have finished debugging as it will flood your logs. also have a look at [3], [4] and [5]. # links * [1] * [2] * [3] * [4] * [5]