Have you ever imagined how a search engine robot can analyze the data of a website for indexing?
Do you own a WordPress website? Sometimes you want Googlebot to quickly index your website or not a specific page? So what to do now?
I can give you an answer right away: Create a robots.txt file for WordPress right away! To understand the robots.txt file and how to create it, I will bring you the following useful article.
This article will guide you:
- Understand the concept What is robots.txt file??
- Basic structure of a robots.txt . file
- What are the notes when creating robots.txt WordPress
- Why you need robots.txt for your website
- How to create a complete file for your website
What is robots.txt file?
The robots.txt file is a simple .txt text file. This file is part of the Robots Exclusion Protocol (REP) that contains a group of Web standards that regulate how Web Robots (or Search Engine Robots) crawl the web, access, index content, and serve it up. that content to the user.
REP also includes commands like Meta Robots, Page-Subdirectory, Site-Wide Instructions. It instructs Google's tools to process links. (eg Follow or Nofollow link).
In fact, creating robots.txt WordPress gives webmasters more flexibility and initiative in allowing or disallowing the tool's bots. Google Index certain parts of your site.
Syntax of robots.txt . file
The syntax is considered the own language of robots.txt files. There are 5 common terms that you will come across in a robots.txt file. These include:
User-agent: This section is the name of the web crawlers that access the data. (e.g. Googlebot, Bingbot, ...)
Disallow: Used to notify User-agents not to collect any specific URL data. Only 1 Disallow line can be used per URL.
Allow(Googlebot search engines only): The command tells Googlebot that it will visit a page or subdirectory. Although pages or its subfolders may not be allowed.
Crawl-delay: Notice to the Web Crawler knows how many seconds it must wait before loading and crawling the page's content. However, note that the Googlebot search engine does not recognize this command. You set the crawl rate in Google Search Console.
Sitemap: Used to provide the location of any XML Sitemap which is associated with this URL. Note this command is only supported by Google, Ask, Bing and Yahoo tools.
Pattern – Matching
In fact, WordPress robots.txt files are quite complicated to block or allow bots because they allow the use of Pattern-Matching feature to cover a wide range of URL options.
All Google and Bing tools allow the use of 2 regexes to identify pages or subdirectories that SEOs want to exclude. These two characters are the asterisk (*) and the dollar sign ($).
*is a wildcard for any string of characters – Which means it is applicable to all Bots of the Google tools.
$is the character that matches the end of the URL.
Basic format of robots.txt . file
The robots.txt file has the following basic format:
User-agent: Disallow: Allow: Crawl-delay: Sitemap:
However, you can still omit parts
Sitemap. This is the basic format of the complete WordPress robots.txt. However, in reality, the robots.txt file contains many lines
User-agent and more user directives.
Such as command lines:
Crawl-delay, … In the robots.txt file, you specify for many different bots. Each command is usually written separately, separated by a line.
In a WordPress robots.txt file you can specify multiple commands for the bots by writing them consecutively with no lines. However, in case a robots.txt file has many commands for a type of bot, by default the bot will follow the command written clearly and completely.
Standard robots.txt file
To block all Web crawlers from collecting any data on the website including the home page. Let's use the following syntax:
User-agent: * Disallow: /
To allow all crawlers to access all content on the website including the homepage. Let's use the following syntax:
User-agent: * Disallow:
To block crawlers, do a Google search (User-agent: Googlebot) does not crawl any pages that contain the URL string www.example.com/example-subfolder/. Let's use the following syntax:
User-agent: Googlebot Disallow: /example-subfolder/
To block Bing's crawlers (User-agent: Bing) avoid crawling on the specific page at www.example.com/example-subfolder/blocked-page. Let's use the following syntax:
User-agent: Bingbot Disallow: /example-subfolder/blocked-page.html
Example for standard robots.txt file
Here is an example of a robots.txt file that works for the site www.example.com:
User-agent: * Disallow: /wp-admin/ Allow: / Sitemap: https://www.example.com/sitemap_index.xml
In your opinion, what does the robots.txt file structure mean? Let me explain. This proves that you allow all Google tools to follow the link www.example.com/sitemap_index.xml to find the robots.txt file and analyze it. The same index of all the data in the pages of your website except the www.example.com/wp-admin/ page.
Don't forget to sign up for a free 3-day trial of Entity Mastermind - SEO skill level to help you X10 Organic Traffic after 6 months.
Why do you need to create robots.txt file?
Creating robots.txt for your website helps you control bots' access to certain areas of your website. And this can be extremely dangerous if you accidentally make a few mistakes that make Googlebot unable to index your website. However, creating a robots.txt file is still really useful for many reasons:
- Prevent duplicate content (Duplicate Content) appear in the website (note that Meta Robots are usually a better choice for this)
- Keep some parts of the page private
- Keep internal search results pages from showing up on the SERP
- Specify the location of the Sitemap
- Prevents Google Index tools from certain files on your site (images, PDFs, ...)
- Use the Crawl-delay command to set the time. This will prevent your server from being overloaded when crawlers load a lot of content at once.
If you don't want to prevent Web crawlers from crawling your website, you don't need to create robots.txt at all.
Limitations of the robots.txt . file
1. Some search browsers do not support commands in robots.txt
Not all search engines will support the commands in the robots.txt file, so to keep your data secure, your best bet is to password protect private files on the server.
2. Each dataset has its own data parsing syntax
Usually for reputable data engines will follow the standard of the commands in the robots.txt file. But each search engine will have a different way of interpreting the data, some will not be able to understand the statement set in the robots.txt file. Therefore, web developers must understand the syntax of each website crawling tool.
3. Blocked by robots.txt file but Google can still index
Even if you previously blocked a URL on your website but that URL still appears, now Google can still crawl and index your URL.
You should delete that URL on your website if the content inside is not too important for the highest security. Because content in this URL can still appear when someone searches for them on Google.
Some notes when using robots.txt . file
- It is not necessary to specify commands for each User-agent, because most User-agents are from a search engine and follow a general rule.
- Absolutely do not use the robots.txt file to block private data such as user information because Googlebot will ignore the commands in the robots.txt file, so the security is not high.
- To secure the data for the website, the best way is to use a separate password for the files or URLs you do not want to access on the website. However, you should not abuse robots.txt commands because sometimes the efficiency will not be as high as expected.
How does the robots.txt file work?
Search engines have 2 main tasks:
- Crawl (scratch/analyze) data on the web to discover content
- Index that content in response to user searches
In order to crawl the website, the engines will follow the links from one page to another. Ultimately, it crawls through billions of different web pages. This crawling process is also known as “Spidering”.
After visiting a website, before spidering, the Google engine bots will look for the WordPress robots.txt files. If it finds a robots.txt file, it will read that file first before proceeding to the next steps.
The robots.txt file will contain information about how Google's engines should crawl your website. Here these bots will be guided with more specific information for this process.
If the robots.txt file does not contain any directives for User-agents or if you do not create a robots.txt file for the website, the bots will proceed to crawl other information on the web.
Where is the robots.txt file located on a website?
When you create a WordPress website, it will automatically create a robots.txt file located just below the server root directory.
For example, if your site is located in the root directory of the address gtvseo.com, you will be able to access the robots.txt file at the path gtvseo.com/robots.txt, the initial output will look like this:
User-agent: * Disallow: /wp-admin/ Disallow: /wp-includes/
As I said above, the part after User-agent: * means that the rule is applied to all types of bots everywhere on the website. In this case, this file will tell bots that they are not allowed in the wp-admin and wp-includes directory files. Very reasonable, isn't it, because these 2 folders contain a lot of sensitive information files.
Remember this is a virtual file, which WordPress defaults to on installation and cannot be edited (although it should still work). Usually, the standard WordPress robots.txt file location is located in the root directory, often called public_html and www (or website name). And to create your own robots.txt file, you need to create a new file to replace the old file placed in that original directory.
In the section below, I will show you many ways to create a new robots.txt file for WordPress very easily. But first, do your research on the rules you should use in this file.
How to check if the website has a robots.txt file?
If you are wondering if your website has a robots.txt file. Enter your Root Domain, then add /robots.txt to the end of the URL. If you don't have a .txt page showing up, then your website is definitely not creating robots.txt for WordPress. Very simple! Similarly, you can check if my website gtvseo.com generates a robots.txt file by doing the following:
Enter Root Domain (gtvseo.com) > insert /robots.txt at the end (resulting in gtvseo.com/robots.txt) > Press Enter. And wait for the results to know right away!
What rules should be added in the WordPress robots.txt file?
So far, they all handle one rule at a time. But what if you want to apply different rules to different bots?
You just need to add each set of rules in the User-agent declaration for each bot.
For example, If you want to create one rule that applies to all bots and another that applies only to Bingbot, you can do it like this:
User-agent: * Disallow: /wp-admin/ User-agent: Bingbot Disallow: /
Here, all bots will be blocked from accessing /wp-admin/ but Bingbot will be blocked from accessing your entire site.
3 How to create a simple WordPress robots.txt file
If after checking, you find that your website does not have a robots.txt file or you simply want to change your robots.txt file. Refer to 3 ways to create robots.txt for WordPress below:
1. Use Yoast SEO
You can edit or create a robots.txt file for WordPress on the WordPress Dashboard itself with a few simple steps. Log in to your website, when you log in you will see the interface of the Dashboard page.
On the left side of the screen, click SEO > Tools > File editor.
The file editor feature will not appear if your WordPress does not have a file editing manager enabled. So enable them via FTP (File Transfer Protocol).
You will now see the robots.txt and .htaccess file sections – this is where you can create the robots.txt file.
2. Through the All in One SEO Plugin set
Or you can use the All in One SEO Plugin to create a WordPress robots.txt file quickly. This is also a utility plugin for WordPress – Simple, easy to use.
To create a WordPress robots.txt file, you must go to the main interface of the All in One SEO Pack Plugin. Select All in One SEO > Features Manager > Click Active for robots.txt
At this point, the interface will appear many interesting features:
And then, the robots.txt section will appear as a new tab in the large folder All in One SEO. You can create and modify the WordPress robots.txt file here.
However, this set of plugins is a bit different from the Yoast SEO I just mentioned above.
All in One SEO blurs out the information of the robots.txt file instead of you being able to edit the file like the Yoast SEO tool. This can make you a bit passive when editing the WordPress robots.txt file. However, positively speaking, this factor will help you limit the damage to your website. Especially some Malware bots will harm your website without your expectation.
3. Create and upload robots.txt file via FTP
If you don't want to use a plugin to create a WordPress robots.txt file, then I have a way for you – Create your own robots.txt file manually for WordPress.
It only takes you a few minutes to create this WordPress robots.txt file manually. Use Notepad or Textedit to create a WordPress robots.txt file template according to the Rule I introduced at the beginning of the article. Then upload this file via FTP No need to use a plugin, the process is very simple and won't take you too long.
Some rules when creating robots.txt . file
- To be found by bots, the WordPress robots.txt files must be placed in the top-level directories of the site.
- Txt is case sensitive. So the file must be named robots.txt. (not Robots.txt or robots.TXT, ...)
- Should not be placed /wp-content/themes/ nice /wp-content/plugins/ go to Disallow. That will prevent the tools from seeing exactly how your blog or website looks.
- Some User-agents choose to ignore your standard robots.txt files. This is quite common with nefarious User-agents like:
- Malware robots (bots of malicious code)
- Email Address Scraping (self-crawling process)
- Robots.txt files are generally available and publicly available on the web. You just need to add /robots.txt to the end of any Root Domain to see that site's directives. This means that anyone can see the pages you want or don't want to crawl. So don't use these files to hide user's personal information.
- Every Subdomains on a Root Domain will use separate robots.txt files. This means that both blog.example.com and example.com should have separate robots.txt files (blog.example.com/robots.txt and example.com/robots.txt). In short, this is considered the best way to indicate the location of any sitemaps associated with the domain at the bottom of the robots.txt file.
Some notes when using robots.txt . file
Make sure you're not blocking any content or parts of your site that you want Google to index.
Links on pages blocked by robots.txt will not be tracked by bots. Unless these links have links to other pages (pages not blocked by robots.txt, Meta Robots, etc.). Otherwise, the linked resources may not be crawled and indexed.
Link juice will not be passed from blocked pages to landing pages. So if you want the power of Link juice to pass through these pages, then you should use another method instead of creating WordPress robots.txt.
The robots.txt file should not be used to prevent sensitive data (such as private user information) from appearing in SERP results. Because this website containing personal information may be linked to many other websites. So bots will ignore the directives of the robots.txt file on your Root Domain or homepage, so this site can still be indexed.
If you want to block this site from search results, use another method instead of creating robots.txt file for WordPress like using password protection nice Noindex Meta Directive. Some search engines have a lot of User-agent. For example, Google uses Googlebot for free searches and Googlebot-Image for image searches.
Most User-agents from the same engine follow the same rule. Therefore you do not need to specify commands for each User-agent. However, doing this can still help you adjust the way the website content is indexed.
Search engines will cache the content of the WordPress robots.txt file. However it still usually updates the content in the cache at least once a day. If you change files and want to update your files faster then use the function To send of the Robots.txt File Checker.
Frequently asked questions about robots.txt
Here are some frequently asked questions, which may be your questions about robots.txt now:
What is the maximum size of robots.txt file?
500 kilobytes (approx.).
Where is the WordPress robots.txt file located on the website?
At the location: domain.com/robots.txt.
How to edit robots.txt WordPress?
You can do it manually or use one of the many WordPress SEO plugin like Yoast which allows you to edit robots.txt from the WordPress backend.
What if Disallow on Noindex content in robots.txt?
Google will never see the Noindex directive because it can't crawl the page data.
I use the same robots.txt file for multiple sites. Can I use a full URL instead of a relative path?
No, commands in robots.txt file (except code
Sitemap:) only applies to relative paths.
How can I suspend all of my site's crawling?
You can suspend all crawling by returning an HTTP 503 result code for every URL, including the robots.txt file. You should not change the robots.txt file to block crawling.
How to block all Web Crawler?
All you need to do is go to Settings > Reading and check the box next to the Search Engine Visibility option.
Once selected, WordPress adds this line to your site's header:
meta name='robots' content='noindex,follow'
WordPress also changes your site's robots.txt file and adds these lines:
User-agent: * Disallow: /
These lines tell robots (web crawlers) not to index your pages. However, it is entirely up to the search engines to accept this request or ignore it.
Block Google crawlers and searchers:
To block crawlers, Google crawler (User-agent: Googlebot) does not crawl any pages that contain the URL string www.example.com/example-subfolder/. Please use the following syntax:
User-agent: Googlebot Disallow: /example-subfolder
Block Bing's Crawler:
Please use the following syntax:
User-agent: Bingbot Disallow: /example-subfolder/blocked-page.html
How are robots.txt, Meta robots and X-robots different?
First, robots.txt is a text file while Meta robots and X-robots are Meta Directives. In addition, the functions of these 3 types of Robots are also completely different.
Meta Robots are pieces of code that provide instructions to crawlers on how to crawl or index web page content.
It is placed in the of the site and looks like:
<meta name="robots" content="noindex" />
X-robot is part of the HTTP header sent from the web server. Unlike the robots meta tag, this tag is not placed in the HTML of a page (i.e. the . section). Web's).
X-Robots are used to prevent search engines from indexing specific file types like images or PDFs, even for non-HTML files.
Any directive that is available in the robots meta tag can also be specified as an X-Robots.
By allowing you to control how specific file types are indexed, X-Robots provides more flexibility than the Meta robots tag and the robots.txt file.
Creating the robots.txt file dictates the indexing of the entire site or directory. Meanwhile, Meta robots and X-robots can dictate indexing at the individual page level.
Now it's your turn! You already know What is robots.txt file? not yet? Checked if my website has robots.txt file or not. Create and edit the WordPress robots.txt file to your liking to help search engine bots crawl and index your site quickly.
If, after reading this detailed article, you still find it difficult to understand, you can completely consider enrolling in an SEO training course or program at GTV!