How to control crawling and indexing
Three methods for telling AI crawlers and search engines what to read, block, or ignore on your site.
TL;DR
There are three main ways to control how AI crawlers and search engines access and index your website: robots.txt for site-wide crawl rules, meta tags for page-level index control, and llms.txt files for structured access by AI agents. Each method solves a different problem, and they can be used together. Choose based on how granular the control needs to be.
robots.txt
A robots.txt file tells AI crawlers and search engines which parts of your site they can and cannot access. Place it at the root of your domain so crawlers find it on their first visit.
What robots.txt controls:
- Crawler traffic volume, to reduce unnecessary server load.
- Access to specific pages, directories, scripts, or stylesheets.
- Whether media files appear in AI-generated answers or search results.
⚠️ Blocking a URL in robots.txt does not guarantee it will be excluded from AI-generated answers or search results. If the page is linked from an external site, crawlers may still index it. For stronger control, use a noindex meta tag, password protection, or remove the page entirely.
Setting up your robots.txt file
Create a plain text file named robots.txt using UTF-8 encoding, and place it at the root of your domain. For example, if your site is https://www.yourdomain.com/, the file must live at https://www.yourdomain.com/robots.txt. Verify it's accessible by opening that URL in a browser.
Rules to know before you write your file:
- One
robots.txtfile per site. You can host it on a subdomain or non-standard port. - All rules are case-sensitive.
- Comments start with
#. - Each group begins with a
User-agentline, followed byAllowandDisallowdirectives. - Crawlers process groups top to bottom. One crawler matches only one group.
Directives supported by Google's crawlers:
| Directive | Required? |
|---|---|
user-agent |
Required |
disallow |
Required (at least one) |
sitemap |
Optional |
Here is an example file that blocks Googlebot from one directory and allows all other crawlers:
User-agent: Googlebot
Disallow: /nogooglebot/
User-agent: *
Allow: /
Sitemap: https://www.example.com/sitemap.xml
For the complete syntax reference, see Google's robots.txt guide.
Limitations of robots.txt
- Not all crawlers obey
robots.txt. Well-known crawlers like Googlebot follow it; others may not. - Crawlers may interpret syntax differently. Check the robots.txt syntax reference for per-crawler differences.
- Pages linked from external sites may still be indexed, even if blocked here.
Testing and submitting your file
Test your robots.txt markup using the robots.txt tester in Google Search Console. Developers can also use Google's open-source robots.txt library.
Google crawlers will find and use the file automatically. No manual submission is required unless you update the file. If you do update it, refresh Google's cached copy so the new rules take effect promptly.
ℹ️ Prerender.io respects your robots.txt configuration. For more on how Prerender.io identifies itself to your server, see the overview of Prerender.io crawlers.
Meta tags
Meta tags let you control crawling and indexing at the page level. Add them in the <head> section of your HTML.
The example below tells AI crawlers and search engines not to index the page and not to follow any links on it:
<meta name="robots" content="noindex, nofollow">
For the full list of supported directives, see Google's special tags reference.
Key properties:
- Supported in HTML and XHTML.
- Not case-sensitive (with some exceptions).
- Unsupported tags are ignored by Google.
Excluding content from search snippets
Use the data-nosnippet attribute to prevent specific text from appearing in AI-generated answers or search result snippets:
<p> This text can appear in snippets.
<span data-nosnippet>This text will not appear in snippets.</span>
</p>
See the data-nosnippet documentation for more detail.
⚠️ If you escape your tags with '', they render as plain text and Googlebot will not read them. The meta tag must be valid HTML inside your <head>.
De-indexing pages from Google
Use one of two approaches to remove a page from Google's index.
Option 1: noindex meta tag
Add the following tag to the <head> of any page you want de-indexed:
<meta name="robots" content="noindex">
Googlebot will remove the page the next time it crawls. The timeline depends on crawl frequency, so the change may not be reflected immediately.
ℹ️ robots.txt can also prevent crawling of a directory. For individual pages, the noindex meta tag is the more precise approach. See best practices for Prerender.io integration for guidance on combining these controls effectively.
Option 2: manual removal via Google Search Console
Open Google Search Console, navigate to Index, and use the Removals tool. After submitting the URL, Googlebot crawls the page and processes the noindex tag immediately.
llms.txt and llms-full.txt
llms.txt and llms-full.txt are a newer way to make your content accessible to AI agents. These files don't control indexing. Instead, they help AI agents understand and navigate your documentation more accurately
ℹ️ These files are most relevant if your site includes API references, technical documentation, or content targeted at developer tools like Cursor or Windsurf.
| File | Purpose |
|---|---|
llms.txt |
A structured list of links with short descriptions. Works like a sitemap for AI agents, helping them locate the most relevant documentation pages. |
llms-full.txt |
A single plain text file containing your complete documentation. Designed for direct consumption by models with large context windows or Retrieval-Augmented Generation (RAG) systems. |
AI agents that may use these files include GPTBot, ClaudeBot, OAI-SearchBot, Amazonbot, and Perplexity.
Related Articles
- How does Prerender.io work?
- Best practices for crawler-ready pages
- How do I get started with Prerender.io?
- How do I find URLs that failed to cache?
💬 Still need help? If you have questions about controlling crawling, indexing, or how Prerender.io interacts with your robots.txt setup, our support team can help. → Contact us at support@prerender.io