| Technical description of how SpamDefy works SpamDefy is a spam filtering system that collects email from a user's POP3 mailbox, filters it, and stores the filtered email in its own POP3 mailbox for the user to retrieve. The user therefore simply changes their email client's settings to download email from SpamDefy instead of from their original mailbox, tells SpamDefy the details of their original mailbox, and filtering is then transparent. Users can also specify up to 3 additional POP3 mailboxes for SpamDefy to download email from, allowing advanced users to have greater flexibility. The actual filtering that SpamDefy performs is broken up into three parts: virus filtering, whitelisting, and spam filtering. Virus filtering is performed on all incoming email. Any email that is flagged as containing a virus is simply discarded. This is because the vast majority of virus-laden emails are sent without the sender's knowledge or consent, and most of them in fact have forged or invalid sender addresses, so attempting to notify the sender causes more problems than it solves. Whitelisting is the next step. If the sender address is listed in the SpamDefy user's "whitelist" of acceptable email addresses, the email is delivered with no further checking. This whitelist is maintained automatically by SpamDefy's spam filter, but can also be manually modified by the user through the web interface. The final step is the spam filter, which consists of several layers. All incoming email is first checked against a reliable blacklist to make sure it has not been sent from a machine that is either operated by a known spammer or known to have been compromised by a trojan or other malware. If the email fails this check, it is immediately discarded. Next, the message is checked against the user's "blocked words" list; if it contains any words or word pairs found in this list, it is treated as possible spam. Finally, the email is passed through an advanced Bayesian statistical filter which makes an estimate of whether the message is spam or not. The Bayesian filter understands MIME, and intelligently deals with non-textual parts of messages. It also understands HTML, and identifies several tell-tale ways in which spammers use HTML to try and fool normal Bayesian filters. Once the filter has broken a message down into plain text, stripped the HTML, and converted the binary parts into plaintext checksums, it then tokenises the remainder into words, word pairs, and "special" tokens (which identify tricks typical of spammers, such as nonsense words, references to external images, HTML comments within words, etc). These tokens are then looked up in both the system-wide spam database and the user's own individual spam database, and statistical analysis is performed to determine the "spamminess" of the message. If the filter says that the message looks like spam, then it is put into the "discarded" queue instead of being delivered. If a message is delivered from the "discarded" queue because of user intervention, the Bayesian filter is retrained automatically so that it learns that that sort of message is not spam. Conversely, messages which expire or are discarded from the "discarded" queue automatically train the Bayesian filter in the other direction so it knows that sort of message is definitely spam. At any point the SpamDefy user can view the contents of the "discarded" queue using the web interface, and manually mark messages as being spam or not spam. They may also review their recently delivered email (even after downloading and removing it from their POP3 mailbox), and mark any of those as spam, retraining the filter - thus the user can tell SpamDefy about any spam that "slipped through". The system-wide spam database is automatically updated from time to time by analysing the activity of all users, and so even if an individual user never updates their own personal spam filter by retraining SpamDefy, they will still receive automatic updates over time "behind the scenes". However, any user's own database always takes precedence over the system-wide one, so the user always has the final say over what is and is not spam when retraining their filter. |