How to Control Crawling and Indexing on Your Website?
Managing how search engines and AI crawlers interact with your site is crucial for performance, SEO, and privacy. Whether you’re looking to block sensitive content, improve crawl efficiency, or support AI agents with documentation, this guide walks you through your options.
There are two primary methods for controlling crawling and indexing:
1. Using robots.txt
The robots.txt
file is a standard used to instruct web crawlers (also known as bots or spiders) which parts of your website they can or cannot access. This file sits at the root of your domain and acts as your site’s first line of communication with crawlers.
What robots.txt
can do:
- Control crawler traffic to reduce server load
- Prevent indexing of specific resources
- Block access to scripts or stylesheets
- Prevent rich media from appearing in search results
Note: Blocking a URL via robots.txt
doesn’t guarantee it won’t appear in search results. Search engines may still index it if other sites link to it.
Use Cases of robots.txt file:
Resource Type | Can Be Blocked by robots.txt? | Notes |
---|---|---|
Web Pages | ✅ | Blocks crawling, not necessarily indexing. |
Media Files | ✅ | Hidden from search but can still be linked directly. |
Resource Files | ✅ | Ensure functionality is not broken. |
Limitations
- Not all crawlers obey
robots.txt
-> well-known crawlers (such as Googlebot) follow the instructions set in the robots.txt file, but other crawlers might not. - Syntax differences across crawlers -> regardless of how good the crawler is, they don't all interpret the same syntax. You also need to pay attention to the syntax you use for different crawlers.
You can find more information about the different syntaxes in the following article. - Pages may still be indexed if linked externally -> even if the robots.txt file normally prevents the content from being crawled and indexed, it will be crawled and indexed if the URL is linked from other places.
For stronger control, consider using noindex
tags, password protection, or removing the page entirely.
How to Create a robots.txt File
-
Create a plain text file using UTF-8 encoding
-
Name it
robots.txt
and place it at your site’s root -
Test its visibility by visiting
https://yourdomain.com/robots.txt
Note:
- You can only have one robots.txt file per site
- If, for example, you need to use it for the URL https://www.domain.com/, the file must be located at https://www.domain.com/robots.txt and not at https://www.domain.com/pages/robots.txt.
- The file can be used on a subdomain (https://www.example.domain.com/robots.txt) or non-standard ports (https://www.domain.com:8181/robots.txt)
Rules
A rule is an instruction given to the crawler telling it which part of your site can be crawled.
Guidelines that need to be observed:
- You can have one or more groups
- For each group, you'll have multiple instructions arranged, one per line. Each of the groups will begin with a User-agent line that will specify the target of those groups
- A group gives the following information:
- Who the group applies to (the user agent).
- Which directories or files can the agent access.
- Which directories or files that agent cannot access
- The groups are processed from top to bottom. One user agent can only match one rule set, and this will be the first and most specific group that matches a given user agent.
- We start with the assumption that a user agent can crawl any page or directory that is not blocked by a disallow rule.
- All rules are case sensitive
- The beginning of a comment is marked by #
Supported directives for Google's crawlers
- user-agent (required)
- disallow (at least one allow /disallowed)
- sitemap (optional)
Upload the file
Once verified and saved on your computer, the file is ready to be uploaded to the server. Since there is no dedicated tool for this, you need to verify how this can be done according to the server you are using.
Test robots.txt markup & Submit
Google offers two options for this:
- Use the robots.txt tester in Google Search Console.
- Google's open-source robots.txt library for developers.
Submit it to Google
Once all the steps above are done, Google crawlers will be able to find and use your robots.txt file. There is no other action needed on your side unless you update the file. In this case, you need to refresh Google's cached copy. You can read more about this in the following article.
File Structure Example
User-agent: Googlebot
Disallow: /nogooglebot/
User-agent: *
Allow: /
Sitemap: https://www.example.com/sitemap.xml
You can find more information about robots.txt rules in the following Google article.
2. Using Meta Tags
Meta tags let you control crawling and indexing at the page level. They are placed in the HTML <head>
section.
Example
<meta name="robots" content="noindex, nofollow">
You can find the complete list of meta tags processed by Google here.
Best Practices
- Supported in both HTML and XHTML
- Not case-sensitive (except some tags)
- Unsupported tags will be ignored by Google
You can find the complete list of meta tags processed by Google here.
Exclude Content with data-nosnippet:
Use the data-nosnippet
attribute to prevent part of a page from showing in search result snippets:
<p>
This text can be shown.
<span data-nosnippet>This will not be shown.</span>
</p>
De-indexing Pages from Google
1. Use the noindex Meta Tag
The NOINDEX meta tag needs to be added in the header, on every page that should be de-indexed.
The changes will be taken into account as soon as googlebot crawls the pages again. However, the resolution time will depend on how often Google crawls the pages so this process may take up a while.
Add the following to your HTML <head>
:
<meta name="robots" content="noindex">
2. Manually de-index the pages
This can be done in the Google Webmaster Tools account -> Crawl menu -> 'remove a page from the index' link. The pages can be manually added and Google will immediately crawl them and see the NOINDEX meta tag.
Noted: The robots.txt file can also be used to tell google not to crawl and index the pages. This is generally used in the case of directories that should not be crawled.
If you escape tags with '' then they will be seen as simple text: <p>
To read more about how to ensure machine readability, have a look at this section.
+1 Using llms.txt and llms-full.txt with LLMs
With the rapid development of AI and large language models (LLMs), a new method is emerging to make technical content more accessible to these systems: the use of llms.txt and llms-full.txt files.
To improve how large language models (LLMs) access and understand your API or programming documentation, you can now use two specific text files.
Key Differences:
- llms.txt is a specially formatted text file designed to help large language models (LLMs) and AI agents find, access, and understand technical documentation, such as API references or programming guides. It contains a list of links with short summaries that LLMs can follow to access full content. Using this file is a simple yet effective way to make your documentation more LLM-friendly.
- What it’s used for:
llms.txt works like a sitemap for LLMs, listing key documentation links with brief descriptions. It helps developer tools and IDEs (like Cursor or Windsurf) guide LLMs to accurate, relevant resources—improving their performance and reducing errors.
- What it’s used for:
- llms-full.txt is a plain text file that contains the entire content of your technical documentation in one place. It’s designed to be directly consumed by large language models (LLMs), improving their ability to understand and answer questions about your product, API, or codebase—without needing to follow external links.
- What it’s used for:
llms-full.txt is ideal when you want to give LLMs direct access to complete documentation. It’s especially useful in environments that support large context windows or Retrieval-Augmented Generation (RAG), enabling models to generate more accurate and context-rich answers
- What it’s used for:
- Common LLM Bots Using These Files:
Some of the LLM user agents that can leverage these files include:- OAI-SearchBot
- ChatGPT
- GPTBot
- ClaudeBot
- Amazonbot
- Perplexity