This project provides Python support for fast sophisticated bayesian message filtering. It is based on the excellent DSPAM project provided by Jonathan A. Zdziarski. I have moved RPMS for dspam to a separate project. Neither BMS or Stuart Gathman are affiliated with Jonathan Zdziarski or Network Dweebs, except as enthusiastic users of their free product. Dspam was chosen because it provides a library with a C API in addition to a complete MDA based spam filtering application. Python applications use the C API through an extension module. Using a C library is faster than a pure Python bayesian filter.
What is DSPAM? Here is an excerpt from the DSPAM project README:
DSPAM is an open-source, freely available anti-spam solution designed to combat unsolicited commercial email using Baye's theorem of combined probabilities. The result is an administratively maintenance free system capable of learning each user's email behaviors with very few false positives.DSPAM can be implemented in one of two ways:
Many of the ideas incorporated into this agent were contributed by Paul Graham's excellent white paper on combatting SPAM. Many new approaches have also been implemented by DSPAM.
- The DSPAM mailer-agent provides server-side spam filtering, quarantine box, and a mechanism for forwarding spams into the system to be automatically analyzed.
- Developers may link their projects to the dspam core engine (libdspam) in accordance with the GPL license agreement. This enables developers to incorporate libdspam as a "drop-in" for instant spam filtering within their applications - such as mail clients, other anti-spam tools, and so on.
build_python
macro to 0
at the top of the RPM spec file in the source RPM.
Beginning with pydspam-1.1.5, pydspam is its own RPM which obsoletes dspam-python. The dspam-python or pydspam binary RPM provides a Python module which wraps the dspam core engine (libdspam). Some of the dspam command line tools are reimplemented in Python to illustrate use of the library. (Installed as documentation by the RPM.) A new tool, pydspam_anal.py, shows the contribution each token of a message makes to the total DSPAM score.
In dspam-2.8, pydspam has its own RPM.
The dictionary is the one maintained by the dspam delivery agent installed with the dspam package. Scanning the headers in the milter allows us to REJECT common spams without a lot of processing.
To show just how bad the spam problem is, here are statistics for our domain with just 6 users. Two users (including me) are published on the web with HTML encoding. I also use my real email when posting to newsgroups. Because my email is acessible, I receive welcome email from fellow techies all over the world.
Statistics for Jul 15 | |
---|---|
1139 | Messages from known spamming domains refused by sendmail. |
160 | Messages REJECTED by milter because of banned keywords like 'viagra'. |
169 | Messages REJECTED by milter because of high Dspam scores for headers. |
261 | Messages quarrantined by Dspam mail delivery agent. |
40 | Actual email received for 6 users. |
We do not use a black hole list for known spamming IPs / domains. This is because some of our customers use blacklisted ISPs because they are the only broadband available in their area. Black hole lists like to blacklist entire ISPs, including innocent customers who have no other choice (other than dialup) for connectivity. With a little python programming to collect data, DSPAM will allow us to automate building the list of banned IPs / domains.
The header triage feature will be in milter-0.5.6. I envision a complete milter based implementation of dspam which appoints selected email destinations as 'moderators'. The MDA approach currently used by dspam requires all users to diligently classify their email to train the filter. In the new approach, moderators will do this work, and the resulting dspam dictionary used to filter mail for other users in their group.
/var/www/cgi-bin/dspam
directory. A wrapper script is installed as
/var/www/cgi-bin/pydspam.cgi
. The wrapper script runs the
DSPAM CGI interface as the dspam
user - which is also a member
of the mail
group.
To enable the CGI interface, you need to add an authorization entry
to /etc/httpd/conf/httpd.conf
. For example,
ScriptAlias /cgi-bin/ "/var/www/cgi-bin/" # # "/var/www/cgi-bin" should be changed to whatever your ScriptAliased # CGI directory exists, if you have that configured. # <Directory "/var/www/cgi-bin"> AuthName Dspam AuthType Basic AuthUserFile /etc/httpd/conf/passwd AuthGroupFile /etc/httpd/conf/group Require group dspam AllowOverride None Options None FollowSymLinks Order allow,deny Allow from all </Directory>You must also modify the script at
/var/www/cgi-bin/dspam/dspamcgi.py
to change the DOMAIN
configuration to your domain at a minimum.