How to Control Crawling and Indexing on Your Website?

There are two primary methods for controlling crawling and indexing:

1. Using robots.txt

The robots.txt file is a standard used to instruct web crawlers (also known as bots or spiders) which parts of your website they can or cannot access. This file sits at the root of your domain and acts as your site’s first line of communication with crawlers.

What `robots.txt` can do:

Control crawler traffic to reduce server load
Prevent indexing of specific resources
Block access to scripts or stylesheets
Prevent rich media from appearing in search results

Note: Blocking a URL via robots.txt doesn’t guarantee it won’t appear in search results. Search engines may still index it if other sites link to it.

Use Cases of robots.txt file:

Resource Type	Can Be Blocked by robots.txt?	Notes
Web Pages	✅	Blocks crawling, not necessarily indexing.
Media Files	✅	Hidden from search but can still be linked directly.
Resource Files	✅	Ensure functionality is not broken.

Limitations

Not all crawlers obey robots.txt -> well-known crawlers (such as Googlebot) follow the instructions set in the robots.txt file, but other crawlers might not.
Syntax differences across crawlers -> regardless of how good the crawler is, they don't all interpret the same syntax. You also need to pay attention to the syntax you use for different crawlers.
You can find more information about the different syntaxes in the following article.
Pages may still be indexed if linked externally -> even if the robots.txt file normally prevents the content from being crawled and indexed, it will be crawled and indexed if the URL is linked from other places.

For stronger control, consider using noindex tags, password protection, or removing the page entirely.

How to Create a robots.txt File

Create a plain text file using UTF-8 encoding
Name it robots.txt and place it at your site’s root
Test its visibility by visiting https://yourdomain.com/robots.txt

Note:

You can only have one robots.txt file per site
If, for example, you need to use it for the URL https://www.domain.com/, the file must be located at https://www.domain.com/robots.txt and not at https://www.domain.com/pages/robots.txt.
The file can be used on a subdomain (https://www.example.domain.com/robots.txt) or non-standard ports (https://www.domain.com:8181/robots.txt)

Rules

A rule is an instruction given to the crawler telling it which part of your site can be crawled.
Guidelines that need to be observed:

You can have one or more groups
For each group, you'll have multiple instructions arranged, one per line. Each of the groups will begin with a User-agent line that will specify the target of those groups
A group gives the following information:
- Who the group applies to (the user agent).
- Which directories or files can the agent access.
- Which directories or files that agent cannot access
The groups are processed from top to bottom. One user agent can only match one rule set, and this will be the first and most specific group that matches a given user agent.
We start with the assumption that a user agent can crawl any page or directory that is not blocked by a disallow rule.
All rules are case sensitive
The beginning of a comment is marked by #

Supported directives for Google's crawlers

user-agent (required)
disallow (at least one allow /disallowed)
sitemap (optional)

Upload the file

Once verified and saved on your computer, the file is ready to be uploaded to the server. Since there is no dedicated tool for this, you need to verify how this can be done according to the server you are using.

Test robots.txt markup & Submit

Google offers two options for this:

Use the robots.txt tester in Google Search Console.
Google's open-source robots.txt library for developers.

Submit it to Google

Once all the steps above are done, Google crawlers will be able to find and use your robots.txt file. There is no other action needed on your side unless you update the file. In this case, you need to refresh Google's cached copy. You can read more about this in the following article.

File Structure Example

User-agent: Googlebot
Disallow: /nogooglebot/

User-agent: *
Allow: /
Sitemap: https://www.example.com/sitemap.xml

You can find more information about robots.txt rules in the following Google article.

2. Using Meta Tags

Meta tags let you control crawling and indexing at the page level. They are placed in the HTML <head> section.

Example

<meta name="robots" content="noindex, nofollow">

You can find the complete list of meta tags processed by Google here.

Best Practices

Supported in both HTML and XHTML
Not case-sensitive (except some tags)
Unsupported tags will be ignored by Google

You can find the complete list of meta tags processed by Google here.

Exclude Content with data-nosnippet:

Use the data-nosnippet attribute to prevent part of a page from showing in search result snippets:

<p>
  This text can be shown.
  <span data-nosnippet>This will not be shown.</span>
</p>

De-indexing Pages from Google

1. Use the noindex Meta Tag

The NOINDEX meta tag needs to be added in the header, on every page that should be de-indexed.
The changes will be taken into account as soon as googlebot crawls the pages again. However, the resolution time will depend on how often Google crawls the pages so this process may take up a while.

Add the following to your HTML <head>:

<meta name="robots" content="noindex">

2. Manually de-index the pages

This can be done in the Google Webmaster Tools account -> Crawl menu -> 'remove a page from the index' link. The pages can be manually added and Google will immediately crawl them and see the NOINDEX meta tag.

Noted: The robots.txt file can also be used to tell google not to crawl and index the pages. This is generally used in the case of directories that should not be crawled.

Good to know

If you escape tags with '' then they will be seen as simple text: <p>
To read more about how to ensure machine readability, have a look at this section.

+1 Using llms.txt and llms-full.txt with LLMs

With the rapid development of AI and large language models (LLMs), a new method is emerging to make technical content more accessible to these systems: the use of llms.txt and llms-full.txt files.

To improve how large language models (LLMs) access and understand your API or programming documentation, you can now use two specific text files.

Key Differences:

llms.txt is a specially formatted text file designed to help large language models (LLMs) and AI agents find, access, and understand technical documentation, such as API references or programming guides. It contains a list of links with short summaries that LLMs can follow to access full content. Using this file is a simple yet effective way to make your documentation more LLM-friendly.
- What it’s used for:
  llms.txt works like a sitemap for LLMs, listing key documentation links with brief descriptions. It helps developer tools and IDEs (like Cursor or Windsurf) guide LLMs to accurate, relevant resources—improving their performance and reducing errors.
llms-full.txt is a plain text file that contains the entire content of your technical documentation in one place. It’s designed to be directly consumed by large language models (LLMs), improving their ability to understand and answer questions about your product, API, or codebase—without needing to follow external links.
- What it’s used for:
  llms-full.txt is ideal when you want to give LLMs direct access to complete documentation. It’s especially useful in environments that support large context windows or Retrieval-Augmented Generation (RAG), enabling models to generate more accurate and context-rich answers
Common LLM Bots Using These Files:
Some of the LLM user agents that can leverage these files include:
- OAI-SearchBot
- ChatGPT
- GPTBot
- ClaudeBot
- Amazonbot
- Perplexity

How to Control Crawling and Indexing on Your Website?

Managing how search engines and AI crawlers interact with your site is crucial for performance, SEO, and privacy. Whether you’re looking to block sensitive content, improve crawl efficiency, or support AI agents with documentation, this guide walks you through your options.

There are two primary methods for controlling crawling and indexing:

1. Using robots.txt

What `robots.txt` can do:

Use Cases of robots.txt file:

Limitations

How to Create a robots.txt File

Rules

Upload the file

Test robots.txt markup & Submit

Submit it to Google

File Structure Example

2. Using Meta Tags

Example

Best Practices

Exclude Content with data-nosnippet:

De-indexing Pages from Google

1. Use the noindex Meta Tag

2. Manually de-index the pages

+1 Using llms.txt and llms-full.txt with LLMs

Key Differences:

Related Articles

Was this article helpful?

How to Control Crawling and Indexing on Your Website?

Managing how search engines and AI crawlers interact with your site is crucial for performance, SEO, and privacy. Whether you’re looking to block sensitive content, improve crawl efficiency, or support AI agents with documentation, this guide walks you through your options.

There are two primary methods for controlling crawling and indexing:

1. Using robots.txt

What robots.txt can do:

Use Cases of robots.txt file:

Limitations

How to Create a robots.txt File

Rules

Upload the file

Test robots.txt markup & Submit

Submit it to Google

File Structure Example

2. Using Meta Tags

Example

Best Practices

Exclude Content with data-nosnippet:

De-indexing Pages from Google

1. Use the noindex Meta Tag

2. Manually de-index the pages

+1 Using llms.txt and llms-full.txt with LLMs

Key Differences:

Related Articles

Was this article helpful? Yes No

What `robots.txt` can do:

Was this article helpful?