TIP #3 – It’s important to hide parts of your site from search engines

Date June 17, 2008

Depending on how your website is structured, there are probably areas that you don’t want the search engines to index. Areas that don’t represent your site well, or that might confuse the visitor are good to hide. For instance, there is a portion of this site that you guys can’t see that I use for testing. I want to hide this from search engines. Another good reason to hide a page is if there are email addresses that you don’t want spammers to find on that particular page. Search engines and spammers use programs called robots (or bots) to automatically do this. I’ll show you how to control how much of your site they can see.

First you have to create a file in the root of your website called robots.txt. One may already exist for you. If you open it up, it should look similar to this:

User-agent: *
Disallow: /administrator/
Disallow: /cache/
Disallow: /components/
Disallow: /editor/

The first line “User-agent:” is the name of the robot that you want to control. The asterisk means all robots. If you want you can control an individual search engine (i.e. robot) easily by changing the asterisk to the name of the robot. Example: User-agent: Google

After you have specified which robot you are going to control, you can use Disallow: to specify the folders you want to block. You can even use a file names.

For example, this robots.txt will block the folder /cgi-bin/ from being indexed:

User-agent: *
Disallow: /cgi-bin/

Whereas, this one will only block a specific file:

User-agent: *
Disallow: /data/emails.txt

To block your entire site use:

Disallow: /

To allow your entire site, leave the field empty like this:

Disallow:

You can even control multiple robots independently. The following example gives Google complete access to the site, while banning everything else from indexing it:

User-agent: googlebot
Disallow:

User-agent: *
Disallow: /

Other Notes
Your robots.txt is not intended to be used as a security tool. It can actually call attention to parts of your site that you want to keep secret. It should only be used when optimizing search results and restricting the number of bots that hit your server. These bots can fill your logs as well as use up your bandwidth. Robots.txt is a public file on your webserver. You can see how it can highlight “hidden” areas of your site simply by adding “/robots.txt” to the end of any website you visit. For example www.sin8.com/robots.txt will show you the one for this site.

Here are a few of the major user-agents:
Alexa/Wayback: ia_archiver
Ask/Teoma: teoma
Google: googlebot
MSN Search: msnbot
Yahoo: yahoo-slurp

Or if you’re feeling lazy, this tool will build a robot.txt for you:
www.clickability.co.uk/robotstxt.html

Carl

Share and Enjoy:
  • Digg
  • Sphinn
  • del.icio.us
  • StumbleUpon
  • Reddit
  • Mixx
  • Google Bookmarks

One Response to “TIP #3 – It’s important to hide parts of your site from search engines”

  1. F. Andy Seidl said:

    As a webmaster, you definitely should use user-agent headers to manager server traffic. But understand that this is purely a pragmatic tactic and not a serious security measure.

    I wrote more about this here:

    Webmaster Tips: Blocking Selected User-Agents
    http://faseidl.com/public/item/213126

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>