Spam detection using Naive Bayes

Saturday, November 24 2007, 4:27
For over a year now, my website has been running rather effective anti-spam software written for the most part by yours truly. The amount of data it had accumulated of that course of time, however, was filling up my database quota and something needed to be done about it.
I patched up the way the system works—it is based on Naive Bayes statistical analysis—to make it distinguish between HTML markup and normal text. I had noticed that the classifier had been mixing up completely unrelated things because of this generalization.
Spam statistics
To better keep an eye on the current spam trends, I set up a nifty little page that the tracker now shows the status of the unceasing battle against spam here at
