Email Security Products





Bayesian Filtering

Implementing Bayesian Classifier with GEE Whiz

Bayesian Filtering is the most sophisticated spam detection heuristic algorithm available. Bayesian Filtering is based on mathematical algorithms that project the "probability" of an email being spam by comparing a known corpus of SPAM (bad email) and HAM (good email). What makes Bayesian Filtering so effective and valuable is the fact that Bayesian Classifier Filtering can be customised to use your specific email libraries rather than a "general" library. There is a wealth of information about Bayesian Filtering on the Internet. For more information on Bayesian filtering, do a search on Google or go to: http://www.paulgraham.com/

Using the Default Set of Bayesian Tokens

GEE Whiz ships a default set of Bayesian tokens based on a library of known SPAM and HAM used by Spam Assassin. By enabling the "Use Default Bayes Token" option, you can immediately take advantage of enhanced spam detection. When you enable the default Bayesian option, GEE Whiz applies the information in the DBAYES.DAT file from the SYS:GEE\TMPLTS directory. To read the DBAYES.DAT file, open it with Internet Explorer. You can revert to the Default Tokens at any time by selecting the "Use Default Bayes Token" link.

  1. Enable Bayesian Classifier in the GEE Whiz administration interface.
  2. Click the "Use Default Bayes Tokens" link.
  3. Submit.
  4. You can see the number of tokens (approx. 44,000) and the default Spam Assassin values that are assigned to Bayesian Filtering.  The default Spam Assassin values that are applied to Bayesian Filtering might not be appropriate for your environment. To change the default spam scores that are assigned by Bayesian Filtering, go to the "Ruleset" link, do a search (Ctrl F) for the word "Bayes" and make the necessary changes. You will need to do this for each of the Bayes values that you will find. We have found that increasing the positive values (closer to your Spam detection threshold) and decreasing (bringing closer to zero) the negative values has been quite useful in fine-tuning GEE Whiz' ability to catch spam.
  5. Do NOT use the "Teach Bayesian Classifier" option unless you have read the information below about how to customise Bayesian Classifier Filtering.  

Customising Advanced Textual Classifier

To customise Bayesian Classifier Filtering, you will need to have a library of known good mail (HAM) and known bad mail (SPAM). This mail must be in raw MIME format. The best way to gather these emails is to create shared folders called SPAM and HAM and ask a limited number of "trusted" users to move emails into the appropriate folders. The users need to be "trusted" in that you want them to put the right types of emails into the right shared folders! You don't what them putting SPAM in the HAM folder and vice versa. Another way to gather the emails is to provide them with one of the freeware utilities that are available to export emails from GroupWise and have your users export their emails directly to a shared SPAM and HAM directory on your server.

Following are links to two GroupWise email export programmes. ExportSpam was developed by Michael Bell, the developer of Guinevere. The second programme, GWMime822, was sent to us by email without acknowledging the name of the developer.

After gathering a sufficient number of SPAM/HAM emails (500 each)

  1. Create the following directories in the GEE Whiz installation directory (by default SYS:GEE)
     
    SYS:GEE\BAYES
    SYS:GEE\BAYES\HAM
    SYS:GEE\BAYES\SPAM
  2. Export/copy the SPAM emails to the SPAM directory and the HAM emails to the HAM directory. Our testing has shown that you require a minimum of 2,500 SPAM and HAM for the Bayesian Classifier to be effective.  The more emails you have in the corpus (especially the HAM emails), the more accurate the Bayesian Classifier filtering will be.
  3. Select the "Teach Bayesian Classifier" link in the GEE Whiz administration interface. This causes the GEE Whiz Bayesian Learner to replace the default BAYES tokens with the custom tokens based on the emails you have made available to GEE Whiz. GEE Whiz scans the emails in the SPAM and HAM directories and creates a new set of tokens. Depending on the speed of your server, the number of emails, and how busy your server is, this may take from 30 seconds to five minutes.  quite some time.  Email continues to be processed while the Advanced Textual Classifier Learning occurs.  from 30 seconds to five minutes. It took five minutes to read 10,000 SPAM and 2,500 HAM emails on a 667 MHz PIII with 512 Megabytes of RAM.

Each time you select the "Teach Bayesian Classifier" link, GEE Whiz replaces the previous set of tokens, reads the SPAM and HAM emails and creates a new BAYES.DAT file that contains the token information (SYS:GEE\TMPLTS\BAYES.DAT). To read the file, open it with Internet Explorer. You can update the tokens by going through steps 2 and 3 above.

Note: When you "Teach Bayesian Classifier", you replace the previous Bayesian Tokens (either the default tokens or the previous "taught" tokens). Do not "Teach Classifier" if you do not have emails in the GEE\BAYES\SPAM and HAM directories. You can always go back to the default set of Bayesian Tokens by selecting the "Use Default Bayes Tokens" link. This will replace your Custom Bayes tokens with the default set that ships with GEE Whiz.  

Enable Auto-Bayes

Do NOT use Auto-Bayes if you are using the Default Bayes Tokens or have implemented the Teach Classifier process. The Auto-Bayes option has been removed from GEE Whiz 2.0.  Although Auto-Bayes might be a good way to "prime" your Bayesian filtering, be aware that when you apply Auto-Bayes, it will replace the default tokens or the "taught" tokens.  Auto-Bayes automatically schedules the "Teach Bayesian Classifier" to add the new emails to the tokens once a night.


© 2010 Omni Technology Solutions, Inc. All Rights Reserved. All trademarks are property of their respective owners.
Omni Technology Solutions Inc.   •   #1200, Bell Tower  •  10104 - 103 Avenue  •  Edmonton  •  Alberta  •  Canada  •  T5J 0H8
Tel +1 780.423.4200  •  Fax +1 780.423.4711  •  Send an Email