How to Check/Validate a Robots.txt File
Your robots.txt file is a very simple document. It consists of
nothing more than a list of URLs, pieces of URLs, wildcards (using
asterisks*) and a few lines of heading information specifying which
robot, or crawler, it should target. See an example of a robots.txt file here.
Your robots.txt file should exist in the root directory of your
website. For example, if your domain is www.yourdomain.com your
robots.txt file needs to be located at www.yourdomain.com/robots.txt
for Search Engines to find it. To see if your robots.txt file is online
simple enter the above URL into your favorite browser - the text file
should display in the browser.
Validating your robots.txt is a little less simple, but we all need to
do it. Essentially, you need to ensure that it is working to stop
crawlers from indexing content on your site that you don't want them
to. One of the best ways to do this is to run your own crawling program
that will crawl your site the same way a search engine would. One of
our favorite such programs is GSiteCrawler.
Most, if not all, crawling programs like this will respect your
robots.txt file - this means you can view exactly what URLs the search
engines will attempt to index on your site. Running a crawling program
like GSiteCrawler allows you to ensure that Googlebot or another search
engine crawler will avoid the content you specify.
You might be wondering: "well, I want them to index ALL of my content,
right?" Perhaps not. Consider this: sites that run on Content
Management Systems or include heavy JavaScript functionality or other
scripting languages often include dynamic content that populates your web pages on the fly.
The issue here is that crawlers will generally follow every link on
your pages unless you tell them not to. There are many cases in which
you won't want them to do this. Consider a calendar script that records
a schedule of events for your site for example. Most calendars operate
on calculations that determine where the dates and days of the weeks
will fall - and users can, feasibly, click infinitely into the past or
future. Now imagine this calendar in the hands of a search engine
crawler. The crawler doesn't pass judgment the way a user does. A
crawler can end up following your calendar into the infinite future.
Of course the crawler at some point will stop - it will determine that
it has fallen into an infinite loop and cease crawling your site. So
what's the problem? Infinite loops can cause crawlers to leave your
site. They can also cause them not to index the important content. If
they fall into an infinite loop before they index your main content,
guess what - your content doesn't get indexed.
The proper use of a robots.txt file is crucial for your site's Search Engine Optimization.
About the Author
Mike Tekula handles SEO, SEM, usability and standards-compliance for NewSunGraphics, a Long Island, New York firm offering Search Engine Optimization, Search Engine Marketing, W3C-Compliant web design using full CSS layouts and all things web design/development.
|