What is Robots.txt? Why do I need it?

Robots.txt is a file which tells the web crawlers which pages to crawl & which to not. This article is a detailed overview of robots.txt

When it comes to SEO, there are several important and essential components that you need to know such as canonical links, robots.txt, sitemaps, etc. Among them, robots.txt is a crucial part of SEO. In this article, we are going to explain what is robots..txt and why you need them.

What is robots.txt?

Robots.txt is a set of instructions for search engine crawlers. Some people refer to these search engine crawlers as bots; actually good bots. In other words, robots.txt files are mainly used to control the actions of good bots like web crawlers. Because bad bots are unlikely to follow instructions. A robots.txt file is something similar to a “Code of Conduct” notice that is displayed on the wall of a bar, community center, or, any public place: “Good” people will follow the rules, while “bad” ones are more likely to disobey them and get themselves banned. Here, good people are like good bots like web crawlers while bad people are similar to bad bots.

FYI - Bot is a computer software that interacts with websites and applications automatically. A web crawler bot is an example for a good bot However, there are good bots and terrible bots. These "crawl" bots index content on websites so that it might appear in search engine results. A robots.txt file controls the actions of these web crawlers to prevent overloading the server that hosts the website or indexing sites that are not intended for public viewing.

The robots.txt file is one of the online standards known as the robots exclusion protocol (REP), which controls how robots browse the internet, access content, and index it before making it available to users. The REP also contains instructions on how search engines should approach links (such as “dofollow” or “nofollow”), as well as directives like meta robots.

How does robots.txt work?

It is a text file with a .txt extension that is hosted on a web server like any other file. The search engine does two main jobs;

  1. Crawling the web to discover content

  2. Indexing that content so that it can be served up to searchers who are looking for information

Search engines follow links to get from one site to another in order to crawl sites. In this process, crawlers have to go across billions of links and websites. This process is known as “Spidering”. However, after arriving at a website and before spidering, crawlers look for the robot.txt file first. If there is one, the crawler will read it before moving on to the rest of the page. The information found in the robots.txt file will direct additional crawler action on this specific site because it contains instructions on how the search engine should crawl. It will continue to crawl other content on the site if the robots.txt file does not contain directives that forbid user-agent activity (or if the site does not have a robots.txt file).

Moreover, you can basically see the robots.txt file for any given website by just typing the full URL for the homepage of that particular website and then adding /robots.txt part at the end to it. For example, https://www.abc.com/robots.txt. This file is linked nowhere on the site. So, users are unlikely to stumble upon it. On the contrary, web crawlers look into this file before crawling.

What is more interesting is when the robot.txt file provides instructions for web crawler bots, it cannot enforce them to do this and that. Only the good bots will follow those instructions given in the files while bad bots simply ignore them and crawl forbidden pages on websites. Moreover, it is important to have robots.txt files for each subdomain a particular website has. For example, while hyvor.com has its own robots.txt file, subdomains like blogs.hyvor.com & talk.hyvor.com have their own robots.txt files.

If you do not have a robots.txt file, you can create one on your own. This article(https://developers.google.com/search/docs/advanced/robots/create-robots-txt) from Google explains how to create one. There are online tools to create robots.txt files. Refer to the below-mentioned syntaxes.

What does it look like?

In this case, here are some examples for you.

1# robots.txt generated by atozseotools.com
2User-agent: *
4Disallow: /cgi-bin/
5Sitemap: abc.com

This is a simple robots.txt file generated for www.abc.com without any restrictions.

To block all web crawlers from all content you can use the following block of syntax.

1User-agent: * Disallow: /

According to the example website abc.com, using this syntax in a robots.txt file tells all web crawlers not to crawl any pages on www.abc.com, including the homepage of that site.

To allow all web crawlers access to all content

1User-agent: * Disallow:

Using this syntax in a robots.txt file tells all web crawlers to crawl all pages on www.abc.com, including the homepage.

To block a specific web crawler from a specific folder

1User-agent: Googlebot Disallow: /abc-folderB/

This syntax tells only Google’s crawlers (the user-agent name is Googlebot) not to crawl any pages that contain the URL string www.abc.com/abc-folderB/.

To block a specific web crawler from a specific web page

1User-agent: Bingbot Disallow: /example-subfolder/blocked-page.html

This syntax tells only Bing’s crawlers (the user-agent name is Bing) to avoid crawling the specific page at www.abc.com/abc-folderB/blocked-page.html

There are some important things about robot.txt files.

  • This file must be positioned in a particular website’s top-level directory in order to be found by the crawlers.

  • It is case-sensitive in nature. So the file should be named “robots.txt” not Robots.txt, robots.TXT, or in toggle cases.

  • Some user agents (robots) may choose to ignore your robots.txt file. This is especially common with more nefarious crawlers like malware robots or email address scrapers.

  • The /robots.txt file is available to the public: just add /robots.txt to the end of any root domain to see that website’s directives (if that site has a robots.txt file!). This means that anyone can see what pages you do or don’t want to be crawled, so don’t use them to hide private user information.

  • Basically, it is a best practice to indicate the location of any sitemaps associated with a domain at the bottom of the robots.txt file.

Why robots.txt?

Crawler access to particular regions of your website is managed via robots.txt files. Although it can be quite risky to unintentionally prevent Googlebot from crawling your entire website, there are several instances where a robots.txt file can be really helpful.

  • To keep duplicate information out of search engine results (note that meta robots are often a better choice for this)

  • To keep entire websites (like the staging site for your technical team) secret

  • To not allow internal search results pages to appear on a public SERP

  • To identify the sitemap’s locations

  • To prevent some files on your website from being indexed by search engines (images, PDFs, etc.)

  • To define a crawl delay to stop crawlers from loading numerous pieces of content at once and overloading your servers

How to create a robots.txt file

To do this, you can use any text editor to create a robots.txt file. For example, Notepad, TextEdit, Vi, and Vim. Most importantly, make sure not to use a word processor such as Microsoft Word. These word processors normally save files in a proprietary format that can add unexpected characters such as “”, which can cause problems for crawlers. Make sure to save the file with UTF-8 encoding if prompted during the save file dialog.

These are the format and location rules:

  • The file must be named “robots.txt”.

  • The robots.txt file must be located at the root of the website host to which it applies. For instance, to control crawling on all URLs below https://www.abc.com/, the robots.txt file must be located at https://www.abc.com/robots.txt. It cannot be placed in a subdirectory (for example, at https://abc.com/pages/robots.txt).

  • A robots.txt file can be posted on a subdomain (for example, https://website.abc.com/robots.txt) or on non-standard ports (for example, http://abc.com:8181/robots.txt).

  • A robots.txt file applies only to paths within the protocol, host, and port where it is posted. That is, rules in https://abc.com/robots.txt apply only to files in https://abc.com/, not to subdomains such as https://m.abc.com/, or alternate protocols, such as http://abc.com/. Make sure to be aware of that.

  • A robots.txt file must be a UTF-8 encoded text file (which includes ASCII).

Adding instructions to the file.

  • A robots.txt file consists of one or more groups.

  • Each group consists of multiple rules or directives (instructions), one directive per line. Each group begins with a User-agent line that specifies the target of the group.

  • A group gives the following information:

    • Who the group applies to (the user agent).

    • Which directories or files the agent can access?

    • Which directories or files that the agent cannot access?

  • Crawlers process groups from top to bottom. A user agent can match only one rule set, which is the first, most specific group that matches a given user agent.

  • The default assumption is that a user agent can crawl any page or directory not blocked by a disallow rule.

  • Rules are case-sensitive. For instance, disallow: /file.asp applies to https://www.abc.com/file.asp, but not https://www.abc.com/FILE.asp.

  • The # character marks the beginning of a comment.

After setting rules and instructions correctly, you’re ready to make your robots.txt file accessible to search engine crawlers once you’ve saved it to your PC.

However, There isn’t a single tool that can assist you with this because the server and website architecture determine how you upload the robots.txt file. So, get in touch with your hosting company and their documentation to fulfill your task. You can test this file’s existence using the method we have mentioned in the first parts of the article. Apart from that,

Fun facts

Robots.txt Easter eggs

Occasionally a robots.txt file will contain Easter eggs – humorous messages that the developers included because they know these files are rarely seen by users. For example, the YouTube robots.txt file reads, “Created in the distant future (the year 2000) after the robotic uprising of the mid-’90s which wiped out all humans.” The Cloudflare robots.txt file asks, “Dear robot, be nice.”

2# .__________________________.
3# | .___________________. |==|
4# | | ................. | | |
5# | | ::[ Dear robot ]: | | |
6# | | ::::[ be nice ]:: | | |
7# | | ::::::::::::::::: | | |
8# | | ::::::::::::::::: | | |
9# | | ::::::::::::::::: | | |
10# | | ::::::::::::::::: | | ,|
11# | !___________________! |(c|
12# !_______________________!__!
13# / \
14# / [][][][][][][][][][][][][] \
15# / [][][][][][][][][][][][][][] \
16#( [][][][][____________][][][][] )
17# \ ------------------------------ /
18# \______________________________/

Google also has a “humans.txt” file at: https://www.google.com/humans.txt

– Extracted from Cloudfare

At Last

Robots.txt is not a big deal. You just only have to know about that better and use it on your blogs. If you have questions or suggestions, please comment down below.