“Face is the index of mind”- A common quote. Similarly, a website is the index of its organization. It holds the supreme responsibility of showcasing the organization’s competency to the world. So digital marketers, gear up as it’s high time you manage your online presence diligently! In this process you need to take smart decisions on what to and what not to flaunt on your website. Well, this is what exactly robots.txt helps marketers do. This particular file enables marketers to allow and disallow what they want Google bots to consider while crawling and indexing.
This text file is a part of the robots exclusion protocol or REP which is a group of standards regulating robots on crawling the web and indexing content for users.
The basic format:
User-agent: [user-agent name]
Disallow: [URL string not to be crawled]
These two lines together form a robots.txt file.
Consider a website www.sampleweb.com
- To block web crawlers from the entire content:
- To allow crawlers access all content:
- To block a folder from access:
- To block a webpage from access:
How does it work?
Search engines perform two main functionalities:
- Crawling content and accessing them
- Indexing this content for the usage of readers
Crawling of websites involve bots accessing the content on the webpage. If Google bot finds a robots.txt file, it will read that first before proceeding further as this file has information on how to crawl the particular web page. If it is told to “disallow” the page it will drop crawling that page and skip to the next web page. If the bot does not find a robots.txt file, it will continue crawling the pages as usual.
Why will you need a Robots.txt file?
Robots.txt helps you have the complete control of access to your website. You could allow the bot to crawl through and index just the areas of your website you want it to consider. Here are few scenarios you would find robots.txt coming handy:
- You could use robots.txt to block access to any duplicate content on your website thereby saving your SERP from being hit hard.
- You could block access to complete sections of website like a staging site
- You could restrict the internal search result page from being displayed on a public SERP
- Use robots.txt for specifying location of sitemaps
- You could control access to specific portions of the website like images and content
- In scenarios when crawlers could load multiple bits of content simultaneously, you could specify a crawl delay and prevent your servers from getting overloaded.
Things you should know about robots.txt:
- A robots.txt file should be placed on a website’s top-level directory
- This text file is case sensitive
- Malware robots and email address scarpers may ignore the robots.txt file
- This file is publicly available so do not use it to hide private information
- Every sub domain of a root domain has its own robots.txt file.
Robots.txt and SEO best practices
- Ensure that you do not use robots.txt for areas in the website you want to showcase
- Links which are a part of blocked pages will not be crawled. No link equity will be passed on through such links. However, if these links are present on other accessible pages, they will be indexed through such other web pages.
- Do not use robots.txt to block sensitive data as such pages may as well be crawled and indexed via other accessible web pages. If you want to block such pages from search results you will need to use an alternative like noindex meta directive.
- There are multiple user agents of search engines. However, there is no requirement to specify each user gent for multiple crawlers.
Robots.txt is a powerful aspect of website management as it allows marketers to decide upon accessibility factors without having to rework on the website structure. A detailed assessment of your website will give you a peek into how best you can utilize this particular file.