Viewable With Any Browser Your vote? Vote NO Vote YES

libdspam

Bayesian Message Filtering
or
RPMs for DSPAM
with
support for libdspam

by Stuart D. Gathman
This web page is written by Stuart D. Gathman
and
sponsored by Business Management Systems, Inc.
Last updated Dec 04, 2003

Downloads,Bugs

This project maintains RPM packages for the excellent DSPAM project provided by Jonathan A. Zdziarski, and attempts to support the libdspam API. It has been split off from a project to wrap libdspam for Python. Neither BMS or Stuart Gathman are affiliated with Jonathan Zdziarski or Network Dweebs, except as enthusiastic users of their free product. Dspam was chosen because it provides a library with a C API in addition to a complete LDA based spam filtering application. Python applications use the C API through an extension module.

What is DSPAM? Here is an excerpt from the DSPAM project README:

DSPAM is an open-source, freely available anti-spam solution designed to combat unsolicited commercial email using Baye's theorem of combined probabilities. The result is an administratively maintenance free system capable of learning each user's email behaviors with very few false positives.

DSPAM can be implemented in one of two ways:

  1. The DSPAM mailer-agent provides server-side spam filtering, quarantine box, and a mechanism for forwarding spams into the system to be automatically analyzed.
  2. Developers may link their projects to the dspam core engine (libdspam) in accordance with the GPL license agreement. This enables developers to incorporate libdspam as a "drop-in" for instant spam filtering within their applications - such as mail clients, other anti-spam tools, and so on.
Many of the ideas incorporated into this agent were contributed by Paul Graham's excellent white paper on combatting SPAM. Many new approaches have also been implemented by DSPAM.

Dspam RPM

To make using dspam as convenient as possible, I provide an RPM for dspam, which uses the source code from Network Dweebs largely unchanged. RPM by its nature uses pristine sources from the vendor, and applies patches for any necessary local changes. In dspam-2.6, I added an entry point for tokenizing a message. The patches included in the RPM have this change (not yet added to 2.8) and some bug fixes not yet fixed in the official source. In addition, there are some C unit tests to make sure bugs stay fixed. The C unit tests use the check project. The RPM build procedure does not attempt to build or run the unit tests, so the check framework is not needed to build the RPM. If you wish to verify dspam, you need to install the source RPM and build from the spec file. Then go to the build directory and run make -f maketest.

Configuring DSPAM after installing the RPM

The RPM automatically installs cron entries for dspam_purge and dspam_clean in the /etc/cron.weekly and /etc/cron.daily directories. There are two versions of dspam installed. The name dspam is symlinked to dspam.optout by default. Dspam processing is disabled for user 'bob' when there is a file name bob.nodspam in /var/lib/dspam. If dspam is symlinked to dspam.optin instead, then dspam always delivers mail without despamming unless the name bob.dspam exists.

Activating DSPAM to work with sendmail

The RPM installs a 'dspam' local mailer macro for sendmail-cf. To activate dspam for the version of sendmail included with RedHat, simply replace MAILER(local) with MAILER(dspam) in /etc/mail/sendmail.mc, then regenerate sendmail.cf (instructions are in the comments at the top of sendmail.mc).

Dspam users report missed spams and false positives to a mail alias. For sendmail, aliases are typically in /etc/aliases or /etc/mail/aliases. The RPM installs two scripts which can be used for generic aliases. Add two lines like the following to sendmail aliases and run newaliases:

spam: "|/usr/local/bin/addspam"
ham: "|/usr/local/bin/falsepositive"

Using DSPAM with procmail

Dspam can be used as a filter by passing it the '--stdout' option. This can be used in .procmailrc as an alternate form of "optin".

Activating the DSPAM CGI script

The RPM installs the CGI interface in the /var/www/cgi-bin/dspam directory. A wrapper script is installed as /var/www/cgi-bin/dspam.cgi. The wrapper script runs the DSPAM CGI interface as the dspam user - which is also a member of the mail group.

To enable the CGI interface, you need to add an authorization entry to /etc/httpd/conf/httpd.conf. For example,

    ScriptAlias /cgi-bin/ "/var/www/cgi-bin/"

    #
    # "/var/www/cgi-bin" should be changed to whatever your ScriptAliased
    # CGI directory exists, if you have that configured.
    #
    <Directory "/var/www/cgi-bin">
	AuthName Dspam
	AuthType Basic
	AuthUserFile /etc/httpd/conf/passwd
	AuthGroupFile /etc/httpd/conf/group
	Require group dspam
        AllowOverride None
        Options None FollowSymLinks
        Order allow,deny
        Allow from all
    </Directory>
If you wish to use the alternate Python based CGI script from pydspam, edit the wrapper script to run dspamcgi.py.

DSPAM RPM support for Python

The dspam-python sub-package has been moved to its own pydspam RPM.

Bugs

Jonathan is focused on the dspam LDA application, and so is unwilling to consider bug reports against libdspam unless they affect the operation of the LDA application, or he is in a really good mood. If you only use the dspam LDA, then report bugs to Jonathan. However, if you use the libdspam library, you should send test cases to me also so that I can add them to the unit tests for libdspam, and include a fix in the RPMs.

Bugs in libdspam for dspam-2.6.5.2

All known bugs are fixed in the RPM, except for the media skip bug. This bug causes dspam-2.6 to attempt to tokenize large binary attachments (despite code purporting to prevent this). As a result, dspam spends an inordinate amount of time processing 100s of thousands of tokens, and mail grinds to a halt. This makes dspam-2.6.5.2 unusable unless binary attachments are blocked by other means.

Current bugs in libdspam for dspam-2.8

The media skip bug is fixed in dspam-2.8, but it is still too buggy to use in applications other than the supplied LDA (the multiple contexts bug is a showstopper for my milter application using dspam). The current list of known bugs in dspam-2.8 and their status is as follows:
Description Testcase? Status
Memory Leak when dspam_init fails N Fixed in 2.8.beta.2-1 and 2.8.rc.1
CLASSIFY modifies memory totals Y Fixed in 2.8.rc.1
CLASSIFY returns garbage for signature Y Fixed in 2.8.beta.2-1 and 2.8.rc.1
signature not initialized in dspam_init N Fixed in 2.8.rc.1-1
Opening multiple contexts for the same user core dumps in dspam_destroy() Y Unresolved. Workaround: preliminary debugging shows that the problem is in libdb3_drv. Try another database driver.
Attempting CLASSIFY for first time user corrupts memory. N Workaround: call dspam_init,dspam_destroy with PROCESS to create user before using CLASSIFY.
No quarantine_lock in libdspam N Workaround: copy function from dspam.c into application. Since libdspam doesn't do anything with implementing quarantine, it probably shouldn't have this function.
_ds_tokenize() not implemented Y Will reimplement
FEATURE: USERDIR hook for testing Y Added _ds_setuserdir() to simplify testing
BROKEN: adding a signature corpus returns an error Y Broken in dspam-2.8.rc.1 and dspam-2.8 stable.

Ideas

Learning Decay

Here I address a problem encountered with the Dspam approach. There needs to be some sort of decay of learned messages. Otherwise, adaptation gets less and less with each message until we're effectively not learning any more. One approach would be to periodically divide all hit counts by 2. For instance, when total messages (Spam + Innocent) reaches 4000 (or some other number substantially bigger than 1000), then divide all hits and totals in the dictionary by 2. This will give the next 2000 messages double the weight of the previous 4000. And messages 6001-8000 will have four times the weight of 1-4000, and twice the weight of 4001-6000.

Dspam_purge would be a good place to implement the decay algorithm. We might then want to add a new totals record, e.g. '_GTOT'. This would keep the real (not scaled) totals that humans are interested in.

Database Scrubbing

I have had dspam_purge in an infinite loop because of loops (corruption) in the dictionary. I created a python version of dspam_purge that checks for encountering the same record again. This effectively cleaned the dictionary. Both purge and clean need to check for encountering the same record again while reading the old database. This is easily done by checking for dups while writing the new database. Dspam already rebuilds each dictionary and signature database by copying all records to a new file during each dspam_purge and dspam_clean cycle.

Extended Signature State

A user can get confused when changing their mind about whether a message is spam. It is hard to remember whether you've already done an ADDSPAM or FALSEPOSITIVE and which one you did last. In my python milter based on libdspam, I plan to add a flag to the signature database to record the last action for a signature. The states will be NEW,SPAM,INNOCENT The milter would set the state to SPAM or INNOCENT. Then doing the equivalent of "dspam -d user --addspam" would do nothing if the message was already in the spam state, and the equivalent of "--falsepositive" would do nothing if the message was already in the INNOCENT state. It would be nice for the user to query the current state given a signature id.

I am considering having a NEW state for signatures that have not yet been added to the statistics either way. This would be useful for users that are not diligent in classifying all email.

Mozilla/Netscape Bundles Forwards

It is natural for users to select all their spam, then forward it to the spam alias. Unfortunately, Mozilla combines all the messages into a single message for forwarding. The dspam MDA finds only the first signature tag in the combined message.

My suggestion is that the Dspam MDA should look for multiple DSPAM tags in the email. Or perhaps, recursively scan rfc822 attachments.

In the meantime, users should use pine, or forward each spam individually to the spam alias.

Downloads

Pick one of the following. The binary RPM is the easiest, and will run on Red Hat 7.2 or 7.3 (and probably later versions). The source RPM contains all the required source and patches, and can be recompiled to match your distribution. And finally, you can grab the original sources and my patches and do it yourself.

Release 2.8.beta.2-1 is the first release of 2.8 that passes unit testing (except for the bugs listed above, but they should not affect the dspam LDA).

Release 2.6.5.2-4 includes pydspam-1.1.4, and increments the missed count when adding a spam corpus via signature. Has the media skip bug, which may be a showstopper.

Binary RPMs

RedHat 7.2

  • dspam-2.8-1.i386.rpm RedHat 7.2 binary RPM
  • dspam-devel-2.8-1.i386.rpm Development headers and static library
  • dspam-2.8.rc.1-1.i386.rpm RedHat 7.2 binary RPM
  • dspam-devel-2.8.rc.1-1.i386.rpm Development headers and static library
  • dspam-2.8.beta.2-1.i386.rpm RedHat 7.2 binary RPM
  • dspam-devel-2.8.beta.2-1.i386.rpm Development headers and static library
  • dspam-2.6.5.2-4.i386.rpm RedHat 7.2 binary RPM
  • dspam-devel-2.6.5.2-4.i386.rpm Development headers and static library
  • dspam-python-2.6.5.2-4.i386.rpm Python module and utilities for pydspam-1.1.4
  • RedHat 7.3

  • dspam-2.8.beta.2-1.i386.rpm RedHat 7.3 binary RPM
  • dspam-devel-2.8.beta.2-1.i386.rpm Development headers and static library
  • dspam-2.6.5.2-2.i386.rpm RedHat 7.3 binary RPM
  • dspam-devel-2.6.5.2-2.i386.rpm Development headers and static library
  • dspam-python-2.6.5.2-2.i386.rpm Python module and utilities
  • AIX 4.x

  • dspam-2.6.5.2-2.ppc.rpm AIX 4.x binary RPM
  • dspam-devel-2.6.5.2-2.ppc.rpm Development headers and static library
  • dspam-python-2.6.5.2-2.ppc.rpm Python module and utilities
  • Source RPMs

    Source RPMs contain the sources, patches, and spec file to build a release of dspam from source. They can be recompiled to match your distribution.
  • dspam-2.8-1.src.rpm Source RPM (tested on RedHat 7.x)
  • dspam-2.8.rc.1-1.src.rpm Source RPM (tested on RedHat 7.x)
  • dspam-2.8.beta.2-1.src.rpm Source RPM (tested on RedHat 7.x)
  • dspam-2.6.5.2-4.src.rpm Source RPM (tested on RedHat 7.x and AIX 4.1.5) with pydspam-1.1.4
  • Patches

  • Patches against the original dspam-2.8 source.
  • Patches against the original dspam-2.8.rc.1 source.
  • Patches against the original dspam-2.8.beta.2 source, including a CVS snapshot from the DSPAM page to fix some CLASSIFY bugs.
  • Patches against the original dspam-2.6.5.2 source
  • Patches to configure to compile with any version of db >= 3 beginning with dspam-2.6.5 This is in the Source RPM, but those downloading the raw source might need it also.
  • Check RPMs

    The check project provides a simple unit testing framework for C programs. You need this to build the DSPAM unit tests provided with the patches.
  • check-0.8.4 RedHat 7.x RPM
  • check-0.8.4 AIX 4.x RPM
  • check-0.8.4 source RPM

  •  [ Valid HTML 3.2! ]  [ Powered By Red Hat Linux ]

    Send Spam