Combating Email Harvester Robots — Email Obfuscation

Enigma website development

Combat Email Harvester Robots with Our
ISO, Hex or Mixed Code Email Obfuscator

Mailwasher Pro advertisment: Fight Spam.

Email Obfuscator — Complete MailTo Address






Copy to clipboard (IE only)

In the same way that search engine spiders crawl the web grabbing site pages, examining and filing them in huge databases ready for retrieval by surfers using Google, Yahoo! or whatever web search engines, email harvester spiders traverse the Web looking for email signatures in web pages — for the (often sole) purpose of building large databases to sell on as spam email address lists. Here are a few methods of email obfuscation.

If you consider these obfuscators useful, throwing a link to this page would be much appreciated.

Using the robots.txt File

Good, legitimate, search engines will observe certain rules their spiders find in a file called 'robots.txt', placed in the root directory of a site, which will contain certain instructions about where a spider can and cannot go within the site structure, and which pages (and directories of pages and files) they should and should not retrieve for indexing.

Email Obfuscator — Address only



Copy to clipboard (IE only)

For instance —

User-agent: *
Disallow: /cgi-bin/
Disallow: /jscript/
Disallow: /new_layout/
Disallow: /other_images/
Disallow: /template/
Disallow: email_ty.htm

says that all search engine spiders are welcome [ User-agent: * ] but they don't need to visit certain directories [ Disallow: /cgi-bin/ ] (and pages therein) and neither should they retrieve the contents of one page in the root [ Disallow: email_ty.htm ].

Robot Meta Statements

You can also achieve similar by using meta headers like —

<meta name="robots" content="index,follow">
<meta name="robots" content="noindex,follow">
<meta name="robots" content="index,nofollow">
<meta name="robots" content="noindex,nofollow">

illustrating permutations of the theme but nowadays it's better practice to adopt the robots.txt standard (there's some debate as to how long the meta directives will be honoured or if indeed they currently are by many legit spiders).

But...

Dealing with Rogue Email Harvesters

Rogue bots simply ignore the in-/exclusion directives because they want to examine all site pages for email addresses.

Webmasters have developed various techniques for combating theft of email addresses, some more effective than others. For instance, JavaScript can be used —

<script type="text/javascript">
//<![CDATA[
var email = "enquiries"
var domain = "seowebsitepromotion.com"
document.write("<a href=" + "mail" + "to:" + email + "@" + domain + "?subject=General%20Enquiry" + ">" + email + "@" + domain + "</a>")
//]]>
</script>

The code uses variables to hold the email address name and then writes out the variables to the page as a coherent address visible to the surfer but (hopefully) jargon to the often simplistic retrieval mechanisms of the spider looking for a coherent structure like name@mydomain.com. Great — provided your visitor has JavaScript enabled...

Email Obfuscation Methods

Another method is to obfuscate the address using a combination of Hexadecimal and/or ISO characters in place of the letters which make up an email address.

This is what the second form on the page does. Drop in your email address then select ISO, Hex or Mixed output to produce an email address which will remain transparent to the majority of email harvester robots. The top form is similar but it generates a complete MailTo email address with the option to change the on-screen link, mouse-over title and email subject title.

A six month study found that email addresses encoded in this way and posted on the Web received no junk email.

Note: Email addresses are not stored, they're simply processed for you. We don't spam.

Spoofing Email Addresses to Clobber Harvesters

It is of course possible to take the battle to the enemy. There are a number of PERL, JavaScript and, as in the following instance, ASP scripts available which serve up bait pages containing contrived junk MailTo addresses. strap.asp (we warned, the page takes a while to generate) is an interpretation of a script found at Bureau of the Bizarre which uses a number of routines to simulate the rhythm of copy interspersed with anything between 300 and 1000 unique junk email addresses along with an appropriate level of random paragraphs, rehashed and delivered each time the page is visited. Simply drop such a page in your site root directory and send a link to it.

As with all such pages, it's wise to ensure you include a robots exclusion meta header - <meta name="robots" content="noindex,nofollow"> - and a similar exclusion line in the robots.txt file, Disallow: strap.asp to avoid indexing the page in legitimate engines who follow exclusion directives. We don't want to populate legitimate search engine databases with junk.

Of course, the page could be modified to self-reference itself as a new URL to trap email harvesters in a tar pit, ad infinitum ...

Thwarting Form Submission Bots with Hidden Form Fields

For some time rogue submission bots have trawled the Web seeking out unprotected forms. This has escalated enormously with the blogging phenomenon, with personal and professional sites offering (often unmonitored and automated) user feedback dialogue. Submission bots simply populate form fields with their masters' junk and trigger the submit button.

A method which endeavours to identify automated spam submission bots is use of a hidden field on a form. <input type="hidden" name="gotcha" /> might be such a field which your parsing script (either JavaScript or, preferably server-side validation) can interrogate upon submission. Should you detect a populated field value you know a valid visitor will not have completed the field and it's safe to kill the submission (and send the offending user agent to a bot trap). I've not tested the efficacy of this method but have heard reports it works well for the less intelligent submission spam bots).

Of course, it's possible to program a spam bot to ignore hidden fields and the submission would get through, so a potential fail-safe might be employed in your detection script. Bots often populate fields with the target's domain name in an effort to legitimize the post. Enigma's domain is www.seowebsitepromotion.com, and a bot would strip the 'www.' or 'http://www.' URL prefix, add a junk name and '@' to simulate a valid email address then use that spoofed address to populate some of the form's fields.

You can no doubt see where I'm going: parse all non-email fields for your domain name, like 'seowebsitepromotion', ignoring the URL prefix and TLD (Top Level Domain)) suffix, '.com' - and away you go...

CAPTCHAs

Arguably the more resilient defence against automated submission bots, Completely Automated Public Turing test[s] to tell Computers and Humans Apart - CAPTCHAs - demand active participation by human site visitors to check form validation.

A CAPTCHA illustrating distored words.In its simplest form a CAPTCHA is a distorted graphical image of a word or jumbled sequence of text and numbers generated programmatically. As illustrated, it may also contain obscuring lines or an overlayed grid to further obscure the image. Simply changing text to its graphical representation will not do since this is easily resolved by software, as evidenced by the number of online font readers which can readily interpret graphically-embedded fonts and supply the typeface.

The visitor is then invited to enter the word or character sequence into a concluding form field as proof (s)he can interpret the image and is therefore human.

All fine and dandy—provided the visitor is not visually impaired; non-sighted users have no chance. Therefore an alternative might be offered such as audio output or a text based equivalent based on an intrinsic element of a statement or an answer to a simple question like "How many wheels does a car have?" While text based CAPTCHAs do not conform to the spirit of CAPTCHA, in that they should be programmatically generated (and can be cracked using artificial intelligence), nevertheless they represent another relatively effective piece of armour against submission bots.

For further information, practical examples and code, check out www.captcha.net

Combating Email Spam

Once you're on a spam database there's little you can do about it other than to change the exposed email addresses. This is more easily said than done as some may be long-standing and obscurely originated, requiring considerable time to detect not necessarily valid originators but the inevitable accompanying authorisation passwords for newsletter or subscription accounts which will require updating to new email addresses.

And, unless you maintain a form-only level of email dialogue, there's every likelihood you'll be compromised again in the not-too-distant future, especially if you run a popular web presence or feature well in the search engines.

Many people make the mistake of using a catch-all address for their emails, something like mail@domainname.com, and while this may be handy for receiving all mail addressed to the domain it also relays all spam, since any addressee name is passed on. Make addresses specific and limited in number.

Far more insidious are the techniques used by spammers to verify a valid email address. Once an email has downloaded into your Outlook, Mozilla or other local inbox, especially mail with graphical or embedded elements, it is child's play for the spammer to originate your IP address and then use matching software to interrogate a database and identify the recipient's likely name and other details, enabling them to detect your geographical location and other demographic information, and thus penetrate and encourage you to lower your defences by using words which appear pertinent, valid or attractive to you.

The trick is not to let them get that far. Turn away spam at the host server before your IP address is compromised. Many hosting companies offer this as an inbuilt feature, either killing known spam or flagging potential spam for you. But this can be dangerous, possibly generating false positives and deleting mail from unverified, new addresses which may originate from prospective clients. I disable this feature on my POP3 accounts - I want new business! - which leaves the onus on me to filter my remote inbox.

There are a number of professional software applications which intercept mail at the server and offer a variety of features to thwart spam, including bouncing spam back to the originator, effectively indicating an invalid address. I have used MailWasher Pro for over a year and find it effective and efficient at identifying and managing spam emails. There are some free spam washers out there - I once naively used one which hooked into my local client - but they don't obviate the challenges mentioned above of IP address grabbing, being local email client filters.

Spamming Deterrent

What with spam email, harvester and submission bots, phishing, etc., the level of online time-wasting and threat to your money is ever increasing. Even using obfuscation, honey traps, mail washers and other tricks, traps and deterrents, I still receive thousands of spam mails every week and waste at least ten hours each week eyeballing my front-guard server inbox; I might be fastidious about hiding my online addresses but others who contact me are either not so vigilant or unwittingly harbour trojans on their systems.

Perhaps one way to deal with this and offer an effective deterrent to this criminal activity might be as follows:

  1. Find a wall
  2. Find a cigarette
  3. Find a blindfold
  4. Wait for a rainy day
  5. Light cigarette, place in spammer's mouth, position spammer against wall...

Have fun, and let's kill spam(mers)!