I recently came across a site that had conflicts between the
Robots.txt file and the on-page
meta robots tag. It got me wondering which of the two holds the ultimate authority if there are conflicts. This may seem like common sense, but it may not be intrinsically clear to even the most advanced SEOs and web masters. I decided to do some research and consolidate my findings.
First, a short lesson.
What is a
Robots.txt file anyways?
Robots.txt file is a web masters way to communicate with search engine and web crawlers, and give them directions about how to handle your content. The file is placed in the root directory of the domain, and is always named the same. Here is an example URL: http://example.com/robots.txt.
Web site owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.
This file can be used to give generic rules for robots, or specific rules for specific pages.
What is the
meta robots tag?
meta robots tag is used on a page by page basis (it’s not a requirement), and gives similar instructions to robots that the
Robots.txt file does.
You can use a special HTML tag to tell robots not to index the content of a page, and/or not scan it for links to follow.
The standard meta tag looks like this:
<meta name="robots" content="noindex, nofollow">
You could also explicitly ask robots to index and follow, but since these are the default settings if no meta tag exists, it’s unnecessary.
So what happens if there is a conflict?
Robots.txt requests noindex, meta robots requests index
Result: Bots and web crawlers will be given the instructions to not index the page, and therefore will not index it when crawling your site. However, bots and web crawlers may come across your page through an external link, bypass the
Robots.txt file, see the
meta robots tag and decide to index the page. In rare cases, this page could be included in search results. Google will correctly decide to not index the page, but other search engines may not use the same logic.
Robots.txt requests index, meta robots requests noindex
Result: Search engine bots and web crawlers will be given permission to crawl the page based on the
Robots.txt file, but as soon as they hit the
meta robots tag, they will be given the instructions to not index the page and will move on. We crawlers hitting this page from external resources that are linking to the page will also stop at the
meta robots tag, so the page will not show up in search results whatsoever.
meta robots requests noindex then second meta robots requests index
This might sound confusing, but what I mean by this is literally having two conflicting
meta robots tags in the
head section of your HTML like below.
<meta name="robots" content="noindex, nofollow"> <meta name="robots" content="index, follow">
Result: The page will not be indexed and will not be shown in any search results. Basically, if there is a “noindex”
meta robots tag on the page, you cannot override it, even with another
meta robots tag.
I had a client once who had an inhouse CMS built for their site, and for whatever reason (I suspect it was leftover from a development or staging environment) the template files themselves had a “noindex”
meta robots tag in the
head section. My client noticed their site wasn’t being indexed (before employing my services), and attempted to fix this by implicitly adding a “index”
meta robots tag. They didn’t realize at the time that their efforts were all-for-naught. This situation might not be as uncommon as you think.
What about following links?
This is where things get a little unclear.
Generically speaking, if a search engine is able to crawl a page, and the
meta robots tag is either not set, or set to “follow”, then the links will be followed and indexed by search engines. This is the most common case.
If a search engine bot or web crawler is not given instructions to crawl a page, but the
meta robots tag is not set or set to “index” then the links will be followed. You would encounter this situation with the phantom “noindex,follow” meta tag:
<meta name="robots" content="noindex, follow">
Matt Cutts addresses this situation in a video that he posted to YouTube:
So basically, the page will not be indexed, but the links will be followed. This is good to know as SEOs that even if our links appear on pages that are not indexed (like archives, category pages, etc.), they are still followed links.
On the most basic level, neither the
meta robots tag or the
Robots.txt has authority over the other – but rather the “noindex” request has authority over the “index” request.
What I’ve suggested to development teams I’ve worked with in the past is to set the
Robots.txt file to allow all crawlers, then use the
meta robots tag to request that a page not be indexed on a page by page basis. This is easy to remember, easy to update, and makes it easy to avoid conflicts and confusion.