Website owners and webmasters would be ecstatic if their website is frequently visited by search engines and contents are indexed by search engine spiders. This is what SEO is all about – getting a high rank with search engines. However, there are parts of a website which you wouldn’t want spiders to include in their indexing activities
The best way to inform search engines to keep their spiders ‘off-limits’ to specific areas of the website is by using a robots.txt file.
Robots.txt is a text file and not html, put on your site to tell search engine robots which pages you would like them not to crawl or index. This is not to be confused as a way of preventing search engines from crawling your site. It is more of a notification for search engines not to enter certain areas of the websites for specific reasons. The file is placed in the main directory of a website where spiders can easily read it.
A better understanding of the different uses of this robots.txt file may be necessary in order to appreciate its importance for the website and the search engines.
- Saving Your Bandwidth. Checking your website’s statistics will show you that there are many requests for the robots.txt file by the different search engine spiders. Search engines try to recover the robots.txt file the website is indexed to check on any special instructions for the spiders. Without the special instructions, it will cause the search engine spider to use up more of the site’s bandwidth resulting from the repeated retrieval of say, a large 404 error file, which should not be accessed by the spiders in the first place. By using the robots.txt file, the site’s bandwidth is reserved for more important pages required by the searchers. You can also use this file to ban spiders from indexing graphic files.
- Keeping the Sensitive Information Hidden. This is one of the more important reasons for using the robots.txt file. As there are websites holding a database of confidential information, obtained through the interactive features of the site such as client’s registration form, special instructions are given to spiders not to crawl and index the pages containing the confidential information. If there is no special instruction for the spiders, these delicate information including credit card details can become accessible and available to anyone searching the web. The robots.txt file will be used to hide the database from the search engines.
- Avoiding Canonicalization Problems. This also refers to duplicate content problems which occur on a website where multiple pages have the same content. An example would be a product page that has a “print” version and another page for browsing. Spiders will find it difficult to identify the canonical version. Using robots.txt, the secondary or “print” version need not be crawled by the spiders.
- Avoiding Wastage of Server Resources. Most websites have scripts or computer programs built within the site. Examples would be contact forms or search engine within the site, etc. Such scripts or programs are intended for human use and not for spiders so it would be better to block the spiders from the directories containing the scripts. This will reduce the load on the web server every time scripts are called in by the search engine spiders.
Robots.txt is very useful in hiding web contents that are not for public viewing especially for electronic commerce websites. This serves as security for private data wherein only website administrators should be given access to. Also, robots.txt aid spiders in crawling pages which were intended to be duplicated to not be crawled as duplicate content which could decrease website ranking.
In creating a website, especially if you will be hiring an outsourced web design company; having clear understanding on robots.txt inclusion must be set before running the website live to avoid spider or robot confusion on data that will be shown to internet users. It must be clear that certain pages must not be shown in public for client protection and it must also be of tight security to avoid spammers having an easy access to confidential data.