Antispam filtering is a "must-have" on a mail server nowadays because of the amount of spam and viruses that flood our mailbox. There are several techniques used to filter spam; each with its advantages and disadvantages.
As with any Antispam system, there is a risk of false positives (rejecting good messages). Our implementation doesn't discard any messages without notifying the sender or the recipient. Messages are filtered during the SMTP session instead of post-SMTP to avoiding bouncing messages to innocent bystanders.
Although our implementation is sendmail specific, most of techniques mentioned here are applicable to mail servers running Exim, Postfix or Qmail. A multi-stage approach was adopted to reduce the processing overhead.
At the first stage, we enforce some RFC compliance to catch misbehaving clients. Some spam engines do not wait for the SMTP banner before sending commands. A pre-greeting check catches that:
This add a three second delay before the mail server displays the banner.
Setting the needmailhelo flag blocks SMTP clients which do not send an EHLO or HELO.
The SMTP session fails if the SMTP clients sends an unqualified client name or our server name as the EHLO/HELO argument.
The sender's email address must resolve in DNS. Furthermore, the connection is rejected if the sender's email address resolves to a "bad" MX record.
It is a waste of resources to process messages if the recipient email address is invalid. This also causes backscatter when mail filtering is performed on a mail gateway or backup MX. We wrote a milter called Scam-backscatter to verify mailboxes on the remote server where the messages will be delivered.
Most mail servers use DNSBLs (Domain Name System Block Lists) to reject SMTP connections from spammers. The choice of which DNSBL to use is important as we are letting a third party decide which messages our mail server should accept. If the DNSBL is not working correctly, this can cause mail delays or in the worse case loss of mail.
We chose ZEN which is a combination of all Spamhaus blocklists (SBL, XBL, PBL).
FEATURE(`dnsbl', `zen.spamhaus.org', `"Rejected - See http://www.spamhaus.org/ZEN"')
This may also block mail from our customers. We can prevent that by requiring all customers to use SMTP authentication (SMTP AUTH). Mail from our customers is submitted through the MSA port ( TCP 587).
Most Greylisting implementations use a tuple of the sender and recipient address and the IP address from which the SMTP connection originated. This may induce delays in the mail delivery. Some spammers have adapted to Greylisting by retrying delivery. This makes Greylisting less effective than before. However, it is still useful as it reduces the load caused by message filtering.
We developed a Greylisting milter called Scam-grey aimed at detecting botnets (compromised Windows-based computers mostly). Our implementation detects the operating system of the sending SMTP client and creates a tuple of the EHLO/HELO argument and the IP address from which the connection came from. It can also catch connections from hosts with dynamic IP addresses and hosts without reverse DNS. A Whistlist is used prevent valid Windows-based mail servers from being blocked.
The Greylisting milter also passes Operating System and ASN (Autonomous System Number) information to the filtering stage. This milter has also helped in rejecting connections from Virus-infected computers.
Antivirus filtering was previously a costly addition to a mail server. With the advent of ClamAV, a free open source Antivirus software, mail servers have a cost-effective alternative and can prevent the spread of viruses by scanning all messages before sending them out.
We scan all incoming and outgoing mail for viruses using ClamAV. The messages are passed to ClamAV using a milter.
Content-filtering is done at the final stage as it involves more processing overhead. We use SpamAssassin, a free open source antispam filter, in this implementation. Our milter passes the messages to SpamAssassin where it is given a score. The message is tagged or rejected based on the score. The scoring is done on a per-user basis as preferences vary from one user to another.
SpamAssassin performs rule-based tests using regular expressions on message headers and bodies, DNS Blocklist tests and will also test the message for urls found in SURBL. Any message which have made it through the previous stages should be caught at this stage.
SpamAssassin requires more CPU and memory usage and can cause significant load issues if it is not configured correctly. We sometimes have to add custom rules as spammers devise ways to get through Antispam filters. This comes at a cost to the server as it means more processing overhead.
Nowadays, there is a rash of image-based spam as spammers try to circumvent traditional text-filtering techniques. We have not implemented optical character recognition (OCR) because of the computational processing involved. We can catch the image spam using the Bayesian filter that is included in SpamAssassin. Bayesian filtering can be very effective if trained correctly.
Avoiding False Positives
The specifics of a mail site vary depending on the user-base. It is better to adopt a conservative approach to mail filtering until you understand how your users are affected. The SpamAssassin scores can be fine-tuned based on whether they generate false positives at your site.
We use DomainKeys to generate a negative score for messages sent by sites which our users regularly communicate with. We also sign all our outgoing mail so that it can be verified at the receiving mail server.
Up to now, we have focused on rejecting mail. Whitelisting can help in avoiding False Positives. It should be used in a targetted way, i.e. only whitelist addresses which you trust.
Messages which are rejected are also saved in a quarantine area which the recipient may access. The mail logs are also reviewed daily and rejections are reviewed for False Positives.
Four in five messages were determined to be spam. Estimates for False Negatives are 0.05% and 0.008% for False Positives. These summary figures may vary from site to site as they depend on the user-base and the domain profile.
If you have any comments, you can contact the author here