The robots.txt file, known as Robots Exclusions Protocol , is an essential part of your website. Give instructions to search engine robots that review your web pages. If you configure it badly it can harm your positioning or, even worse, make it totally invisible to the search engines.
In this post, we explain everything about robots.txt to avoid errors and improve the SEO of your website.
What is robots.txt?
It is a list of instructions for search engines such as Google, Bing, and Yahoo etc. Indicate the areas of your website that can be indexed and those that cannot. From this definition you can understand the importance of this file for the correct positioning of the pages and elements (categories, products, images, etc.) of your website.
How does robots.txt work?
When robots crawl your website, the first action they take is to search your robots.txt file to know which page to visit and index.
To find the robots.txt on your website or blog, you only have to add /robots.txt to your domain.
When do we have to use robots.txt?
Robots.txt is useful in the following circumstances:
Ignore duplicate pages.
Do not index the internal search results (done with the search engine of your own website).
Do not index some areas of your website.
Do not index some files (images, PDF documents, etc) of your website.
Indicate to the search engines where your sitemap is located.
There are several reasons why a robots.txt file would be a beneficial addition to your website; these are:
The duplicate content
You can have duplicate content on your website. The duplicate content is punished by the search engines and should be avoided whenever possible. The robots.txt file allows you to do away with the duplicate content of your website by giving instructions to web crawlers.
For the duplicate contents canonical labelling can be used (we will talk about it in a next post).
Results of the internal search
If you have the internal search function on your website, you can choose to omit the pages of results generated by this type of query.
Ignoring the protected areas of your website
You can tell web crawlers to ignore some files or areas like the employees’ intranet. There are legal reasons to do so, how to protect employees’ information data, or simply because they are not relevant pages for users of your website.
Locate your sitemap.xml file
Another tool used by robots is the sitemap, or site map, where the tree of locations of the pages of your website is detailed.
Inserting the URL of the sitemap in your robots.txt facilitates the tracking of robots to the most important content of your website.
Create a robots.txt file
Create a new text file using Text Edit (Mac) and Notepad (PC) and save it as the name “robots.txt”
Upload to the root directory of your website normally called “htdocs” or “www”, which causes it to appear directly after the domain name.
If you use subdomains, you can create a robots.txt file for each of them.
Common file instructions Robots.txt
The robots.txt file depends on the requirements of your website, so every robots.txt is different from one website to another. However, there are some general instructions to configure a good tracking of your site.
First you have to authorize the robots to track your website, using the “User-agent:” command. Example: User-agent: Google bot (one of Google’s robots) means: “Google: follow the instructions below.”
If you want to authorize all the trackers you just have to put the following: User-agent: *
In this link you will find all the search engine crawlers.
No-index specific pages
The next step after the User-agent, is to use the Allow: and Disallow: instructions to indicate to the robots what to index and what not.
Disallow: / terms-conditions
Locate the sitemap
As we said before, telling web crawlers where the XML sitemap is is a good practice for SEO on your website. You can indicate it in the robots.txt in this way:
Sitemap: http //: www.my-domain / sitemap.xml
If you want to know more about this topic, we recommend this complete guide in Google Webmaster Tools to know other commands and try your robots.txt.