[[!summary chatGPT: The author discusses their experience with using Bayesian filtering with SpamAssassin to reduce spam emails, including their setup, configuration, and training of the filter. They provide details on their procmail configuration, local.cf file, and how they experiment with spam. They also mention the importance of using the correct configuration files and permissions, and provide links to additional resources.]]
[[!meta date="2010-06-28 09:30"]]
[[!tag linux mail usability]]

[[!img media/spamassassin-logo-320x136.png alt="copied from http://spamassassin.apache.org/logo" style="float: right"]]


# motivation
i've had lots of trouble with mail the last two years with lots of spam still passing spamassassin. but that wasn't so bad since we already had ﻿greylisting [1] running. that however still meant **8 spam mails per day** with subjects as:


  * Hot news for js! 70% off all June!
  * Totaler Ausverkauf von Top Zeitmesser
  * Yes, js, today -80% to all prices. Office accessible free
  * This crysis never ends
  * VIAGRA ® Official Site -95%

however, that changed after i started to collect the spam to train the bayesian filter [2] used by spamassassin. this posting is about how i've done it!

**in contrast: after using bayes i receive one spam mail every 4th day**.

i still have to monitor the spam, which is filed in a folder called 'spam' in one of my 'imap' folders but since it can be identified as spam by 100% most of the time it's simply a copy'n'past operation of all files in 'spam' to 'spam_train'. the 'spam-train' folder is used in order to train spamassassin. of course there are false-positives and false-negatives as well. but handling those is very easy.


# is my setup special in any way?


no i don't think so at all. **however: most sites do encourage admins not to use bayes filters per vpopmail account.** and because of that and other factors there is not much documentation of how to do so. since i had some issues with how to call spamassassin from procmail it did not work for a very long time. recently i bought my nokia n900 mobile phone and using mail on this device is very nice. receiving a spam mail per hour means a notification on the n900 which IS annoying if the notification mostly contains spam slogans.


# my .procmailrc

    # qmail Lazydog procmailrc file
    SHELL="/bin/bash"
    VHOME="/var/vpopmail/domains/lastlog.de/js"
    VERBOSE="no"
    LOGFILE="/var/vpopmail/domains/lastlog.de/js/procmail.log"
    :0fw
    **| spamassassin --siteconfigpath=/var/vpopmail/domains/lastlog.de/js/.spamassassin/ -p /var/vpopmail/domains/lastlog.de/js/.spamassassin/local.cf**
    :0:
    * ^X-Spam-Flag: YES
    /var/vpopmail/domains/lastlog.de/js/.maildir/.spam/new
    :0:
    * ^(From|Cc|To).*news@aktuell.conrad.de
    /dev/null
    # (other rules below)

# my local.cf file, the only config file i changed:

    # cat local.cf
    required_score          3.2
    rewrite_header subject _SCORE(0)_ |
    use_bayes 1
    use_dcc 1
    use_pyzor 1
    use_razor2 1
    bayes_auto_learn 1
    bayes_path /var/vpopmail/domains/lastlog.de/js/.spamassassin/db/bayes
    bayes_file_mode 0777
    report_safe 1
    #add_header all Flag _YESNOCAPS_
    ok_languages            en de
    ok_locales              en de
    # Google SafeBrowsing Plugin
    loadplugin Mail::SpamAssassin::Plugin::GoogleSafeBrowsing
    #body GOOGLE_SAFEBROWSING eval:check_google_safebrowsing_blocklists()
    google_safebrowsing_apikey ABQIAA#VJ#J#VAQFbnTQ4uqKBRgArBR3gWufhSOf_c-vzV4UEN0steDDKD
    google_safebrowsing_dir /var/cache/spamassassin
    # scores for each url hit in message body
    google_safebrowsing_blocklist goog-black-hash 0.3
    google_safebrowsing_blocklist goog-malware-hash 0.5

# how to experiment with spam

no blog or documentation talks about this but i think it's very important. say you have a spam mail and you want to test your spamassassin configuration do this:


    ls -la /var/vpopmail/domains/lastlog.de/js/.maildir/cur
    -rw-rwx---+ 1 vpopmail vpopmail     2630 26. Dez 2009  msg.zQJ68:2,RS*
    -rw-rwx---+ 1 vpopmail vpopmail    21905 16. Dez 2009  msg._zr78:2,S*
    -rw-rwx---+ 1 vpopmail vpopmail     7433  9. Mär 2009  msg.zssm8:2,S*
    -rw-rwx---+ 1 vpopmail vpopmail     2332  6. Mai 17:27 msg.ZuFm8:2,S*
    -rw-rwx---+ 1 vpopmail vpopmail     3418 21. Nov 2008  msg.ZvFh8:2,S*
    -rw-rwx---+ 1 vpopmail vpopmail     3237 23. Feb 13:07 msg.zZcg9:2,S*


let's pick one of these messages: '**msg.zf8y8:2,S**'

now, let's see what spamassassin thinks about this message with:

    cat msg.zf8y8:2,S | spamassassin -t -D -p /var/vpopmail/domains/lastlog.de/js/.spamassassin/local.cf >mail 2>debug

**-t -D** are only important for debugging

also have a look at:

    tail -n 200 /var/vpopmail/domains/lastlog.de/js/procmail.log

# creating the configuration for vpopmail usage of spamassassin


the message is processed by spamassassin and written to a file called 'mail' and the debug output going to stderr is written to a file called 'debug'. this helped me a lot to verify that spamassassin was reading the right config files.

    cat msg.zf8y8:2,S | spamassassin -t -D --siteconfigpath=/var/vpopmail/domains/lastlog.de/js/.spamassassin/ -p /var/vpopmail/domains/lastlog.de/js/.spamassassin/local.cf >mail 2>debug

watch for things like this:

    [7582] dbg: conf: finish parsing
    [7582] dbg: plugin: Mail::SpamAssassin::Plugin::GoogleSafeBrowsing=HASH(0x14e3a10) implements 'finish_parsing_end', priority 0
    [7582] dbg: bayes: tie-ing to DB file R/O /var/vpopmail/domains/lastlog.de/js/.spamassassin/db/bayes_toks
    [7582] dbg: bayes: tie-ing to DB file R/O /var/vpopmail/domains/lastlog.de/js/.spamassassin/db/bayes_seen
    [7582] dbg: bayes: found bayes db version 3
    [7582] dbg: bayes: DB journal sync: last sync: 0
    [7582] dbg: config: score set 3 chosen.
    [7582] dbg: message: main message type: multipart/mixed

**check: no loaded plugin implements 'check_main': cannot scan! at /usr/lib64/perl5/vendor_perl/5.8.8/Mail/SpamAssassin/PerMsgStatus.pm line 164.**

it seems that if i use --siteconfigpath spamassassin expects all config files to be in '/var/vpopmail/domains/lastlog.de/js/.spamassassin/' but usually they are in '/etc/spamassassin/'. so when using --siteconfigpath make sure you copy all files from '/etc/spamassassin/*' to your own configuration directory '/var/vpopmail/domains/lastlog.de/js/.spamassassin/'.

**WARNING: do not overwrite your customized files as there might be a local.cf in both paths!**

to check if the right bayes path is used, look at the debug file we created:

    cat debug | grep bayes
    [20792] dbg: bayes: tie-ing to DB file R/O /var/vpopmail/domains/lastlog.de/js/.spamassassin/db/bayes_toks
    [20792] dbg: bayes: tie-ing to DB file R/O /var/vpopmail/domains/lastlog.de/js/.spamassassin/db/bayes_seen
    [20792] dbg: bayes: found bayes db version 3
    [20792] dbg: bayes: DB journal sync: last sync: 0

in this case everything is right. 

also look if the mail is quallified as spam, just have a look at the 'mail' file:

    Subject: 16.6 | VIAGRA ® Official Site -95%
    X-Spam-Flag: YES
    X-Spam-Checker-Version: SpamAssassin 3.2.1 (2007-05-02) on
    bonker.serverkommune.de
    X-Spam-Level: ****************
    X-Spam-Status: Yes, score=16.6 required=5.0 tests=AWL,BAYES_99,
    HTML_IMAGE_ONLY_08,HTML_MESSAGE,HTML_SHORT_LINK_IMG_1,MISSING_DATE,
    MISSING_MID,MISSING_SUBJECT,NO_RELAYS,URIBL_AB_SURBL,URIBL_BLACK,
    URIBL_JP_SURBL,URIBL_OB_SURBL,URIBL_WS_SURBL shortcircuit=no
    autolearn=unavailable version=3.2.1
    MIME-Version: 1.0

# initial bayes, sa-learn

since i've been learning the db wrong for some time i decided to relearn everything with:


    cd /var/vpopmail/domains/lastlog.de/js/.spamassassin/db/
    
    ls -la
    -rw-rw----+ 1 joachim users    7824 21. Jun 15:17 bayes_journal
    -rw-rwx---+ 1 joachim users  643072 21. Jun 15:09 bayes_seen*
    -rw-rwx---+ 1 joachim users 5177344 21. Jun 15:09 bayes_toks*


just remove all 3 files (or create backups if in doubt) and then use sa-learn:

    cat spamcommand

    echo "=== <SPAM Train> ===="
    
    echo "/var/vpopmail/domains/lastlog.de/js/.maildir/.spam_train/cur"
    
    nice sa-learn -D 5 -u vpopmail --dbpath /var/vpopmail/domains/lastlog.de/js/.spamassassin/db --siteconfigpath /var/vpopmail/domains/lastlog.de/js/.spamassassin/ -p /var/vpopmail/domains/lastlog.de/js/.spamassassin/user_prefs -C /var/vpopmail/domains/lastlog.de/js/.spamassassin --spam --dir /var/vpopmail/domains/lastlog.de/js/.maildir/.spam_train/cur
    
    echo "=== </SPAM Train> ==="
    
    echo ""
    
    echo "=== <HAM Train> ==="
    
    for i in cur .js@dune2.de/cur .notice/cur; do
    
    echo "learning from: /var/vpopmail/domains/lastlog.de/js/.maildir/${i}"
    
    nice sa-learn -D 5 -u vpopmail --dbpath /var/vpopmail/domains/lastlog.de/js/.spamassassin/db --siteconfigpath /var/vpopmail/domains/lastlog.de/js/.spamassassin/ -p /var/vpopmail/domains/lastlog.de/js/.spamassassin/user_prefs -C /var/vpopmail/domains/lastlog.de/js/.spamassassin --ham --dir "/var/vpopmail/domains/lastlog.de/js/.maildir/${i}"
    
    echo "=== </HAM Train> ==="
    
    
i execute this script every night using a cronjob, and so far it is working great! handling false positives and false negatives is easy as well since you only have to copy the wrong mails into the right folder and the next cronjob probably learns it right. i'm not sure on this but so far it is working.

    ./spamcommand

    ==== <SPAM Train> ===
    
    /var/vpopmail/domains/lastlog.de/js/.maildir/.spam_train/cur
    
    [7913] info: archive-iterator: skipping large message
    
    ...
    
    Learned tokens from 2515 message(s) (2531 message(s) examined)
    
    === </SPAM Train> ===
    
    === <HAM Train> ===
    
    [22465] info: archive-iterator: skipping large message
    
    ...
    
    Learned tokens from 1100 message(s) (1171 message(s) examined)
    
    [19782] info: archive-iterator: skipping large message
    
    ....
    
    Learned tokens from 191 message(s) (199 message(s) examined)
    
    === </HAM Train> ===
    
    ./spamcommand  291,01s user 12,00s system 12% cpu 40:05,37 total

the first time this script run about **40 minutes**


# rights management

since there might be multiple users accessing the bayes dbs you have to check permissions. most likely these users are: vpopmail and your login name.

# final words

remove **-t** and **-D** if you have finished debugging as it will flood your logs. also have a look at [3], [4] and [5].

# links

* [1] <http://de.wikipedia.org/wiki/Greylisting>
* [2] <http://wiki.apache.org/spamassassin/BayesFaq>
* [3] <http://wiki.apache.org/spamassassin/BayesInSpamAssassin>
* [4] <http://standbytux.blogspot.com/2005/07/using-spamassassin-and-procmail-to.html>
* [5] <http://faisal.com/docs/salearn>