Spam Attack? Bayesian Filters to the Rescue!
Spam today represents as much as 85% of all e-mail sent and received, according to the U.N.'s International Telecommunications Union, up from approximately 35% just one year ago. Spam is a true global phenomenon, not limited to the U.S. If an e-mail address exists, sooner or later the spammers will find it.
To help fight spam, the industry has come up with a long list of solutions, some very good, others less so (see Sidebar 1 below). An excellent %96 and free %96 solution has emerged from the work of an English minister who died in 1761, the Reverend Thomas Bayes. I have spent months testing commercial products and services based on what is today known as Bayesian Filtering. I have found these free solutions to be not only less expensive, but also more accurate, more reliable, and -- at least in some cases -- less taxing on system resources than other types of spam-fighting products.
Bayesian Filtering 101
The man behind Bayesian Filtering, Thomas Bayes, was born in London in 1702. He received an education in logic and theology, became a Presbyterian minister, and died in 1761. Two years after his death, the Philosophical Transactions of the Royal Society of London published an essay that Bayes had written many years earlier. The subject was probability inference, a way to calculate the probability that an event will occur in the future, based on the number of times that event has not occurred in the past.
Bayes' work remains controversial to this very day, but one thing is certain: Although we have no idea why Thomas Bayes was trying to determine rules of probability, nearly 250 years later, his theory has become the cheapest and most effective means available to determine whether an e-mail message is spam or a legitimate message.
Other spam filters must be shown a picture of each spam message by a human being before they can recognize it -- this is the same way that anti-virus software works. By contrast, Bayesian spam filters "learn" based on human feedback. The more often you correct a Bayesian filter today, the less often you'll need to correct it in the future.
Because the products I recommend here are completely free, you probably won't find them reviewed in trade journals, and you certainly won't see them advertised. But if you're looking for a free, effective way to protect your customers' systems against spam attacks, read on.
The following software packages all use Bayesian filters to help block spam. The products below are completely free, with no spyware, adware or other hidden agendas. They offer equivalent levels of protection and have similar features. The main difference between them is the platform they are designed to work with and the interface. For example, SpamBayes integrates with Microsoft Outlook better than the others do, while PopFile integrates with any POP3-based e-mail software. Mozilla offers an alternative to Microsoft's virus-prone e-mail software and also includes a Bayesian filter.
Because Bayesian filters "learn," you can teach them how to handle all of your e-mail. For example, you could train your Bayesian filter to treat messages identified by your anti-virus software as infected, the same way it treats spam, automatically removing it from your inbox.
While they all work from the same guiding principle, they differ in various ways. Let's look at each separately.
SpamBayes
SpamBayes, written by Paul Graham, is a free Bayesian e-mail filter designed to work with Mac OS, Linux/Unix, and Microsoft Outlook. While it's possible to get SpamBayes to work with Outlook Express, some technical skill will be required to install it correctly. For that reason, I recommend that Outlook Express users instead use PopFile or switch to the Mozilla e-mail program, mentioned below.
The SpamBayes interface and installation for Microsoft Outlook is the most intuitive and easiest to use of the free filters I've tested that are designed to run within Microsoft Outlook.
SpamBayes' author, Paul Graham, claims he has been working with the Bayesian algorithm to improve its accuracy. But I have found all Bayesian filtering techniques in the products mentioned in this article to work equally well when using the latest versions offered of each product.
SpamBayes is also the main ingredient in many other products, including commercial e-mail filters. One example is the $28 Inboxer software, which can be used to determine whether a message is spam.
You can find the fully-functional and completely free SpamBayes software here.
PopFile
For people who check their e-mail using Outlook Express, Outlook, Netscape, Incredimail, Eudora, or any other POP3-based software, PopFile is a great free piece of software that utilizes Bayesian filtering techniques. PopFile, written by my new hero, John Graham-Cumming, contains no advertising or spyware. It is completely and totally free.
PopFile runs in the background and sits "virtually" between your e-mail program and your e-mail server. Modifying PopFile settings requires opening a short-cut that launches your Internet browser. The PopFile interface may seem awkward at first, but its learning curve is quick to both the user and the filter.
PopFile also offers statistical information through its HTML-based interface. In the example below, you can see that PopFile has mistakenly classified only 275 messages out of a total 26,809 messages it checked. That's an accuracy rate of just under 99 percent.
PopFile, along with instructions on how to install it, can be found in the POPfile store. For even more detailed installation instructions visit this Mezz.com POPfile page.
Outclass
Because many people find the interface to PopFile to be awkward, another gentleman, A. Gandhi, wrote an add-on program called Outclass, which integrates PopFile into Outlook's toolbar so the user no longer has to exit or minimize Outlook to train PopFile.
If you add the Outclass add-in to Outlook, the default html-based PopFile interface is disabled, which also disables the statistical information. All PopFile controls are included in the Outclass interface, which replaces them.
Outclass can be finicky to install and configure. So I recommend learning how to use PopFile before attempting to install this Outlook add-in.
Outclass, a great, free add-on to PopFile, can be found here.
Mozilla
Many of my customers want to stay away from Microsoft's Outlook and Outlook Express. They've seen, heard of, and experienced many e-mail viruses that target those ultra-popular programs. As a result, many have chosen to use Mozilla as a means to check their e-mail. The good news is that since version 1.3, Mozilla has included a built-in Bayesian e-mail filter. As of this writing, Mozilla is currently at version 1.7.
Mozilla is an Internet Explorer/Outlook Express replacement that is -- you guessed it -- completely free of charge. Mozilla can be found here.
Sidebar 1: Other Ideas %96 Both Good and Bad -- for Spam Prevention
I. Some Bad Ideas
Many companies have proposed ways to prevent spam, including a "caller ID for e-mail" system, which would require the entire world to change their servers to be compliant. But what are the odds of that happening?
Other proposals include charging a small fee per e-mail sent, say 1/4 of a penny, or $0.0025 per e-mail. The idea is that this fee would mean spammers would no longer find it cost-effective to send out millions of messages. The fee would create a bill larger than any profits resulting from sales related to the spam messages. At the same time, the logic goes, ordinary users won't mind paying a dime or a quarter for a month's worth of messages if it means no more spam. But the problem with this model is that it assumes spammers are honest, which many are not. Many spam messages are sent using stolen accounts or using false accounts created with stolen credit cards. Not only will spam not stop, but you'll get a bill for it! The second problem to this proposal is "who receives the money?"
Also, I am a firm believer that a person should not have to pay in order to not receive something. Typically in life, you pay only when you want something in return. I realize the governments of the United States and Europe pay farmers to not grow certain kinds of crops. I've told the government I promise to not grow any grain, but I've yet to receive a check! So, accordingly, I don't believe anyone should have to pay to not receive spam messages they don't want and never asked for in the first place!
Another proposal is to have a PC solve a complex equation each time a message is sent. When the PC sends one or two e-mails, the equation would simply add a few seconds to the transmission time. But when the PC is used to send, say, a million messages, the computer would be so busy solving these equations, it would take days, weeks, or even months before all the messages were sent. Of course, as computers get faster, this is a proposal that simply won't work. People with older technology trying to send legitimate e-mail will experience severe delays as their old Pentium II PC struggles to resolve an equation that would take seconds to resolve on a Pentium 5 or Pentium 6 future platform. Failing to make the equations harder to keep up with new technology defeats its own purpose. Spam messages will once again start to fly.
II. The Best Proposal Yet?
The only reasonable solution to prevent the spam from even attempting to arrive in your inbox is also the simplest, cheapest, easiest to implement, completely transparent to the end-user, and does not need to be constantly updated to keep up with new technology.
Here's the idea: Place a limit on the number of e-mails any one person can send in a 24-hour period of time. Hotmail implemented this procedure last year and set a limit of 100 e-mails per person, per day. I personally feel that 100 outgoing messages may be too low for some people who send a lot of e-mail throughout the day. So I would recommend an outgoing message limit of 1,000. Either way, the underlying point is the same.
When a spammer sends 1 million messages and gets 33 sales in return, they do so with the push of a single button. That one button inconveniences 999,967 people, but the profits on 33 sales make it worth the spammer's while.
If the per-day cap were in effect, that same spammer, in order to send that same 1 million messages, would need to steal or create 1,000 accounts, assuming each account could send a maximum of 1,000 messages per day. That's a lot of work for 33 sales. It would also require much more time, skill, and effort. The likelihood of your e-mail address being included in a million spam messages is much greater than the likelihood of it being included in a thousand.
Sidebar 2: Why It's Called 'Spam'
The term spam was taken from the 1970s British TV-comedy troupe, Monty Python. One Python sketch featured a husband and wife visiting a restaurant in which everything on the menu contains Spam, the processed meat product. As the waitress explains all the Spam-enhanced meals to the husband and wife, a group of men, absurdly dressed as Vikings, begin to sing about Spam. The wife says she doesn't like Spam and attempts to ask the waitress if the restaurant offers any dishes that don't contain Spam. But the Vikings start singing louder and louder, drowning out the voices of everyone around them. (This may not sound funny unless you understand British humor.)
The first encounters of spam messages occurred in online newsgroups. Large amounts of unwanted advertisements -- in many cases, the same ads repeated over and over -- interrupted the normal flow of conversation. These ads made further discussion difficult, even impossible. Users inconvenienced by these ads remembered Monty Python's Vikings singing about Spam so often and so loudly that other people in the restaurant could not carry on a conversation. Thus, everyone started referring to these junk messages as spam.
Early on, some people wanted to give junk e-mail an official name, such as UCE (unsolicited commercial e-mail) or UBE (unsolicited bulk e-mail). The folks at Hormel, the company that makes Spam meat products, aren't pleased that their brand name was borrowed to describe an e-mail nuisance. But the term spam has stuck.
CAREY HOLZMAN is president of Discount Computer Repair in Glendale, Ariz., and the author of The Healthy PC: Preventative Care And Home Remedies For Your Computer (McGraw-Hill Osborne, 2003).
What do you think? Discuss this and other Recipes with other system builders. Visit TechBuilder's Recipe Forum today.
