pydspam
Bayesian Message Filtering for Python
or
Integrating Python with DSPAM
and including
RPMs for DSPAM
Downloads,Bugs,
Header Triage
This project provides Python
support for fast sophisticated bayesian
message filtering. It is based on the excellent
DSPAM project
provided by
Jonathan A. Zdziarski. It also provides easy to use RPMs for
dspam and dspam-python.
Neither BMS or Stuart Gathman are affiliated with Jonathan Zdziarski
or Network Dweebs, except as
enthusiastic users of their free product. Dspam was chosen because
it provides a library with a C API in addition to a complete MDA based
spam filtering application. Python applications use the C API through
an extension module.
What is DSPAM? Here is an excerpt from
the DSPAM project README:
DSPAM is an
open-source, freely available anti-spam solution designed to combat
unsolicited commercial email using Baye's theorem of combined probabilities.
The result is an administratively maintenance free system capable of learning
each user's email behaviors with very few false positives.
DSPAM can be implemented in one of two ways:
- The DSPAM mailer-agent provides server-side spam filtering, quarantine
box, and a mechanism for forwarding spams into the system to be automatically
analyzed.
- Developers may link their projects to the dspam core engine (libdspam) in
accordance with the GPL license agreement. This enables developers to
incorporate libdspam as a "drop-in" for instant spam filtering within their
applications - such as mail clients, other anti-spam tools, and so on.
Many of the ideas incorporated into this agent were contributed by Paul
Graham's excellent
white paper on combatting SPAM.
Many new approaches have also been implemented by DSPAM.
Dspam RPM
To make using pydspam as convenient as possible, I provide
an RPM for dspam, which uses the source code from Network Dweebs largely
unchanged. RPM by its nature uses pristine sources from the vendor,
and applies patches for any necessary local changes.
I found it necessary to add an entry point for tokenizing
a message. The patches included in the RPM have this change and
some bug fixes not yet fixed in the official source. In addition,
there are some C unit tests to make sure bugs stay fixed.
The C unit tests use the
check project. The RPM build
procedure does not attempt to build or run the unit tests, so the check
framework is not needed to build the RPM. If you wish to verify
dspam, you need to install the source RPM and build from the spec
file. Then go to the build directory and run make -f maketest
.
Configuring DSPAM after installing the RPM
The RPM automatically installs cron entries for dspam_purge and dspam_clean
in the /etc/cron.weekly
and /etc/cron.daily
directories.
Activating DSPAM to work with sendmail
The RPM installs a 'dspam' local mailer macro for sendmail-cf. To activate
dspam for the version of sendmail included with RedHat, simply replace
MAILER(local)
with MAILER(dspam)
in /etc/mail/sendmail.mc
, then
regenerate sendmail.cf
(instructions are in the comments at the
top of sendmail.mc
).
Activating the DSPAM CGI script
The RPM installs the CGI interface in the /var/www/cgi-bin/dspam
directory. A wrapper script is installed as
/var/www/cgi-bin/dspam.cgi
. The wrapper script runs the
DSPAM CGI interface as the dspam
user - which is also a member
of the mail
group.
To enable the CGI interface, you need to add an authorization entry
to /etc/httpd/conf/httpd.conf
. For example,
ScriptAlias /cgi-bin/ "/var/www/cgi-bin/"
#
# "/var/www/cgi-bin" should be changed to whatever your ScriptAliased
# CGI directory exists, if you have that configured.
#
<Directory "/var/www/cgi-bin">
AuthName Dspam
AuthType Basic
AuthUserFile /etc/httpd/conf/passwd
AuthGroupFile /etc/httpd/conf/group
Require group dspam
AllowOverride None
Options None FollowSymLinks
Order allow,deny
Allow from all
</Directory>
DSPAM RPM support for Python
The pydspam project is included as the dspam-python
sub-package which is built from the pydspam source. If you don't wish to build
the python package, set the build_python
macro to 0
at the top of the RPM spec file in the source RPM. The dspam-python binary RPM
provides a Python module which wraps the dspam
core engine (libdspam). Some of the dspam command line tools are reimplemented
in Python to illustrate use of the library. (Installed as documentation by the
RPM.)
A new tool, pydspam_anal.py, shows the contribution each token of a
message makes to the total DSPAM score.
For a really powerful mail filtering system, combine the DSPAM Python
module with sendmail and
Python Milter. For instance, here is
a simple change to milter-0.5.5 I am testing:
Patch to bms.py from milter-0.5.5.
The dictionary is the one maintained by the dspam delivery agent installed
with the dspam package. Scanning the headers in the milter allows us
to REJECT common spams without a lot of processing.
To show just how bad the spam problem is, here are statistics for our
domain with just 6 users. Two users (including me) are published on
the web with HTML encoding. I also use my real email when posting
to newsgroups. Because my email is acessible, I receive welcome email
from fellow techies all over the world.
Statistics for Jul 15 |
1139 | Messages from known spamming domains refused by
sendmail. |
160 | Messages REJECTED by milter because of banned
keywords like 'viagra'. |
169 | Messages REJECTED by milter because of high
Dspam scores for headers. |
261 | Messages quarrantined by Dspam
mail delivery agent. |
40 | Actual email received for 6 users. |
We do not use a black hole list for known spamming IPs / domains. This
is because some of our customers use blacklisted ISPs because they
are the only broadband available in their area. Black hole lists like
to blacklist entire ISPs, including innocent customers who have no
other choice (other than dialup) for connectivity.
With a little python programming to collect data, DSPAM will allow us to
automate building the list of banned IPs / domains.
The header triage feature will be in milter-0.5.6. I envision a complete
milter based implementation of dspam which appoints selected
email destinations as 'moderators'. The MDA approach currently used
by dspam requires all users to diligently classify their email to train
the filter. In the new approach, moderators will do this work, and
the resulting dspam dictionary used to filter mail for other users
in their group.
Learning Decay
Here I address a problem encountered with the Dspam approach.
There needs to be some sort of decay of learned messages. Otherwise,
adaptation gets less and less with each message until we're effectively not
learning any more. One approach would be to periodically divide all hit counts
by 2. For instance, when total messages (Spam + Innocent) reaches 4000 (or
some other number substantially bigger than 1000), then divide all hits and
totals in the dictionary by 2. This will give the next 2000 messages double
the weight of the previous 4000. And messages 6001-8000 will have four times
the weight of 1-4000, and twice the weight of 4001-6000.
Dspam_purge would be a good place to implement the decay algorithm.
We might then want to add a new totals record, e.g. '_GTOT'. This
would keep the real (not scaled) totals that humans are interested in.
Database Scrubbing
I have had dspam_purge in an infinite loop because of loops (corruption)
in the dictionary. I created a python version of dspam_purge that checks for
encountering the same record again. This effectively cleaned the
dictionary. Both purge and clean need to check for encountering
the same record again while reading the old database. This is easily
done by checking for dups while writing the new database. Dspam already
rebuilds each dictionary and signature database by copying all records
to a new file during each dspam_purge and dspam_clean cycle.
Extended Signature State
A user can get confused when changing their mind about whether a
message is spam. It is hard to remember whether you've already
done an ADDSPAM or FALSEPOSITIVE and which one you did last.
In my python milter based on libdspam, I plan to add a flag to the
signature database to record the last
action for a signature. The states will be NEW,SPAM,INNOCENT
The milter would set the state to SPAM or INNOCENT. Then
doing the equivalent of "dspam -d user --addspam" would do nothing if the
message was already in the spam state, and the equivalent of
"--falsepositive" would do nothing if the message was already in the INNOCENT
state. It would be nice for the user to query the current state given a
signature id.
I am considering having a NEW state for signatures that have not
yet been added to the statistics either way. This would be useful
for users that are not diligent in classifying all email.
Mozilla/Netscape Bundles Forwards
It is natural for users to select all their spam, then forward it
to the spam alias. Unfortunately, Mozilla combines all the messages
into a single message for forwarding. The dspam MDA finds only the first
signature tag in the combined message.
My suggestion is that the Dspam MDA should look for multiple DSPAM tags in
the email. Or perhaps, recursively scan rfc822 attachments.
In the meantime, users should use pine, or forward each spam individually
to the spam alias.
Pick one of the following. The binary RPM is the easiest, and will run
on Red Hat 7.2 or 7.3 (and probably later versions). The source RPM
contains all the required source and patches, and can be recompiled to match
your distribution. And finally, you can grab the original sources and my
patches and do it yourself.
Release 2.6.2-3 splits python support into a sub package, adds unit test and
fix for CORPUS bug.
Release 2.6.2.02-1 tracks dspam-2.6.2.02 from Network Dweebs. The python
support is moved to a separate source tarball (pydspam-1.0). Network Dweebs
does not want it added to the Dspam source tree. The binary package
is still called dspam-python (and is not required to use only
the C Dspam programs).
Release 2.6.2.02-2 fixes space printing loop in dspam_stats. No unit test yet.
Release 2.6.3-2 now installs CGI script for access to quarantined mail.
The RPM creates a 'dspam' user which is in the 'mail' group, and the
CGI script runs as the dspam user.
Release 2.6.4 includes an optional smart-alias feature for reporting spam. To
use, add an alias like the following:
spam: "|/usr/local/bin/addspam"
And edit /usr/local/bin/addspam to check for your local domain.
Release 2.6.4.01 includes a patch for empty input, and a fix for
boundaries with space chars.
Release 2.6.5 adds better decoding of multi-part messages.
Binary RPMs
RedHat 7.2
RedHat 7.3
AIX 4.x
Source RPMs contain the sources, patches, and spec file to build
a release of dspam from source. They can be recompiled to match your
distribution. To disable building the python package,
install the source RPM and edit the spec file.
Sources
Check RPMs
The check project provides
a simple unit testing framework for C programs. You need this to build
the DSPAM unit tests provided with the patches.
Send Spam