Viewable With Any Browser Should we end government education? Vote YES Vote NO

Bayesian Message Filtering for Python

by Stuart D. Gathman
This web page is written by Stuart D. Gathman
and
sponsored by Business Management Systems, Inc.
Last updated Jul 18, 2003

Update: Configure patch available separately from SRPM.
Downloads,Bugs, Header Triage

This project adds Python support, bug fixes, and additional utilities to the excellent DSPAM project provided by Jonathan A. Zdziarski. Neither BMS or Stuart Gathman are affiliated with Jonathan Zdziarski or Network Dweebs, except as enthusiastic users of their free product.

What is DSPAM? Here is an excerpt from the DSPAM project README:

DSPAM is an open-source, freely available anti-spam solution designed to combat unsolicited commercial email using Baye's theorem of combined probabilities. The result is an administratively maintenance free system capable of learning each user's email behaviors with very few false positives.

DSPAM can be implemented in one of two ways:

  1. The DSPAM mailer-agent provides server-side spam filtering, quarantine box, and a mechanism for forwarding spams into the system to be automatically analyzed.
  2. Developers may link their projects to the dspam core engine (libdspam) in accordance with the GPL license agreement. This enables developers to incorporate libdspam as a "drop-in" for instant spam filtering within their applications - such as mail clients, other anti-spam tools, and so on.
Many of the ideas incorporated into this agent were contributed by Paul Graham's excellent white paper on combatting SPAM. Many new approaches have also been implemented by DSPAM.

Dspam RPMs

A Python The RPMs for dspam-2.6.2 available from this web page include a Python module which wraps the dspam core engine (libdspam). Many of the dspam command line tools are reimplemented in Python to illustrate use of the library. (The Python versions are not installed by the RPM.) The RPMs also include patches to add a new libdspam entry point: _ds_tokenize(), and a new tool, dspam_anal.py, is implemented in Python to illustrate its use.

The RPMs also include patches for many bug fixes to the original dspam code, and both C and Python unit tests to make sure they stay fixed. Python includes a built-in unit testing framework. The C tests use the check project. The RPM build procedure does not attempt to build or run the unit tests, so the check framework is not needed to build the RPM.

Cron entries for dspam_purge and dspam_clean are provided as is a 'dspam' local mailer. To activate dspam with these RPMs for the version of sendmail included with RedHat, simply replace MAILER(local) with MAILER(dspam) in /etc/mail/sendmail.mc, then regenerate sendmail.cf (instructions are in the comments at the top of sendmail.mc).

Header Triage with Dspam and Python Milter

For a really powerful mail filtering system, combine the DSPAM Python module with sendmail and Python Milter. For instance, here is a simple change to milter-0.5.5 I am testing: Patch to bms.py from milter-0.5.5.

The dictionary is the one maintained by the dspam delivery agent installed with the dspam package. Scanning the headers in the milter allows us to REJECT spam connections before they've wasted all our bandwidth.

To show just how bad the spam problem is, here are statistics for our domain with just 6 users. Two users (including me) are published on the web with HTML encoding. I also use my real email when posting to newsgroups. Because my email is acessible, I receive welcome email from fellow techies all over the world.

Statistics for Jul 15
1139 Messages from known spamming domains refused by sendmail.
160 Messages REJECTED by milter because of banned keywords like 'viagra'.
169 Messages REJECTED by milter because of high Dspam scores for headers.
261 Messages quarrantined by Dspam mail delivery agent.
40 Actual email received for 6 users.

We do not use a black hole list for known spamming IPs / domains. This is because some of our customers use blacklisted ISPs because they are the only broadband available in their area. Black hole lists like to blacklist entire ISPs, including innocent customers who have no other choice (other than dialup) for connectivity. With a little python programming to collect data, DSPAM will allow us to automate building the list of banned IPs / domains.

The header triage feature will be in milter-0.5.6. I envision a complete milter based implementation of dspam which appoints selected email destinations as 'moderators'. The MDA approach currently used by dspam requires all users to diligently classify their email to train the filter. In the new approach, moderators will do this work, and the resulting dspam dictionary used to filter mail for other users in their group.

Bugs

Learning Decay

Here I address a problem encountered with the Dspam approach. There needs to be some sort of decay of learned messages. Otherwise, adaptation gets less and less with each message until we're effectively not learning any more. One approach would be to periodically divide all hit counts by 2. For instance, when total messages (Spam + Innocent) reaches 4000 (or some other number substantially bigger than 1000), then divide all hits and totals in the dictionary by 2. This will give the next 2000 messages double the weight of the previous 4000. And messages 6001-8000 will have four times the weight of 1-4000, and twice the weight of 4001-6000.

Dspam_purge would be a good place to implement the decay algorithm. We might then want to add a new totals record, e.g. '_GTOT'. This would keep the real (not scaled) totals that humans are interested in.

Database Scrubbing

I have had dspam_purge in an infinite loop because of loops (corruption) in the dictionary. I created a python version of dspam_purge that checks for encountering the same record again. This effectively cleaned the dictionary.

The dspam_clean utility should work like dspam_purge - copy records to be retained to a new database, then delete and rename. This will clean any glitches from bugs in libdb, or abnormal terminations of the dspam MDA. Also, both purge and clean need to check for encountering the same record again while reading the old database. This is easily done by checking for dups while writing the new database.

I have had the dspam MDA in an infinite loop while trying to delete a signature because the sig database was corrupted - probably because of the empty body crasher bug in libdspam (now fixed in my version). Again, a quick python script to copy the records to a new DB did the trick. I will create a full python replacement for dspam_clean after my vacation.

Extended Signature State

A user can get confused when changing their mind about whether a message is spam. It is hard to remember whether you've already done an ADDSPAM or FALSEPOSITIVE and which one you did last. I will add a flag to the signature database to record the last action for a signature. The states will be NEW,SPAM,INNOCENT The dspam MDA would always set the state to SPAM or INNOCENT. Then dspam --addspam would do nothing if the message was already in the spam state, and --falsepositive would do nothing if the message was already in the INNOCENT state. It would be nice for the user to query the current state given a signature id.

I am considering having a NEW state for signatures that have not yet been added to the statistics either way. This would be useful for users that are not diligent in classifying all email.

Mozilla/Netscape Bundles Forwards

It is natural for users to select all their spam, then forward it to the spam alias. Unfortunately, Mozilla combines all the messages into a single message for forwarding. The dspam MDA finds only the first signature tag in the combined message.

My suggestion is that the Dspam MDA should look for multiple DSPAM tags in the email. Or perhaps, recursively scan rfc822 attachments.

In the meantime, users should use pine, or forward each spam individually to the spam alias.

Downloads

Pick one of the following. The binary RPM is the easiest, and will run on Red Hat 7.2 or 7.3 (and probably later versions). The source RPM contains all the required source and patches, and can be recompiled to match your distribution. And finally, you can grab the original sources and my patches and do it yourself.

Release 3 splits python support into a sub package, adds unit test and fix for CORPUS bug.

Binary RPMs

RedHat 7.2

  • dspam-2.6.2-3 RedHat 7.2 binary RPM
  • dspam-devel-2.6.2-3.i386.rpm Development headers and static library
  • dspam-python-2.6.2-3.i386.rpm Python module and utilities
  • RedHat 7.3

  • dspam-2.6.2-3 RedHat 7.3 binary RPM
  • dspam-devel-2.6.2-3.i386.rpm Development headers and static library
  • dspam-python-2.6.2-3.i386.rpm Python module and utilities
  • AIX 4.x

  • dspam-2.6.2-3.ppc.rpm AIX 4.x binary RPM
  • dspam-devel-2.6.2-3.ppc.rpm Development headers and static library
  • dspam-python-2.6.2-3.ppc.rpm Python module and utilities
  • Source and Source RPMs

  • dspam-2.6.2-3 Source RPM (tested on RedHat and AIX 4.1.5).
  • Patches against the original dspam-2.6.2 source Fixes bugs and adds python directory. (Will move python directory to separate pydspam project shortly.)
  • Patches to configure to compile with db3 This is in the Source RPM, but those downloading the raw source might need it also.
  • dspam-2.6.2 source (dspam site does not have archives)
  • Check RPMs

  • check-0.8.4 RedHat 7.x RPM
  • check-0.8.4 AIX 4.x RPM
  • check-0.8.4 source RPM

  •  [ Valid HTML 3.2! ]  [ Powered By Red Hat Linux ]

    Send Spam