RFC 9309, published September 2022, is the first formal internet standard for the Robots Exclusion Protocol. It standardizes robots.txt parsing, officially recognizes the Allow directive, defines a 500 KiB file size limit, and specifies crawler behavior for different HTTP status codes.

Robots.txt Complete Guide 2026: Control Crawlers

Q: What is robots.txt and why does it matter for SEO?

Robots.txt is a plain text file at your website root that tells search engine crawlers and AI bots which pages to access. It controls crawl budget allocation, prevents indexing of private pages, and manages AI crawler access to your content.

Q: Does robots.txt prevent a page from being indexed?

No. Robots.txt only blocks crawling, not indexing. Google can still index a blocked page if external links point to it. To prevent indexing, use the noindex meta tag or X-Robots-Tag HTTP header instead.

Q: Should I block AI crawlers like GPTBot and ClaudeBot?

It depends on your content strategy. Block them if you want to protect premium or paywalled content from AI training. Allow them if you want visibility in AI-powered search results like ChatGPT search or Google AI Overviews.

Q: What happens if robots.txt returns a 5xx server error?

Search engine crawlers treat a 5xx error on robots.txt as a temporary full-site block. No pages will be crawled until the error resolves. This can effectively remove your site from search results.

Q: What is the Crawl-delay directive and does Google support it?

Crawl-delay tells crawlers to wait a specified number of seconds between requests. Google completely ignores this directive. Bing and some other crawlers respect it. Use Google Search Console crawl rate settings instead.

Q: Can I use wildcards in robots.txt?

Yes. Google and most modern crawlers support * (match any characters) and $ (match end of URL) wildcards. For example, Disallow: /*.pdf$ blocks all PDF files. These are not part of RFC 9309 but are widely supported.

1. What is Robots.txt and the RFC 9309 Standard

The robots.txt file is a plain text file located at the root of your website (e.g., https://example.com/robots.txt) that communicates with web crawlers using the Robots Exclusion Protocol. Originally proposed by Martijn Koster in 1994, the protocol operated without a formal standard for nearly three decades until the Internet Engineering Task Force (IETF) published RFC 9309 in September 2022.

RFC 9309 brought several important clarifications and rules that every webmaster should understand in 2026:

Official recognition of the Allow directive: Previously informal, Allow is now a standard directive for creating exceptions within Disallow rules.
500 KiB file size limit: Crawlers may ignore content beyond 500 KiB (approximately 500 KB). Most robots.txt files are under 10 KB.
UTF-8 encoding required: The file must be served as UTF-8 encoded plain text with Content-Type: text/plain.
Longest-match-wins rule: When multiple Allow and Disallow rules match the same path, the most specific (longest) rule takes precedence.
HTTP status code handling: A 5xx response means “assume full disallow” while a 4xx response means “assume full allow.”
Group structure: Rules are organized in groups starting with one or more User-agent lines, followed by Allow/Disallow rules.

Important Note

Robots.txt is a polite request, not an access control mechanism. Well-behaved crawlers (Googlebot, Bingbot, GPTBot) honor it, but malicious bots can and do ignore it. For true access control, use server-side authentication, IP blocking, or HTTP authentication.

Figure 1: Robots.txt file structure showing directive groups, sitemap placement, and key RFC 9309 rules

2. All Robots.txt Directives Explained

User-agent

The User-agent directive specifies which crawler a set of rules applies to. Use * to target all crawlers, or specify individual bot names for targeted rules. A crawler looks for its specific User-agent block first; if none exists, it falls back to the * block.

# Rules for all crawlers
User-agent: *
Disallow: /admin/

# Specific rules for Googlebot
User-agent: Googlebot
Allow: /

# Multiple user-agents can share rules (RFC 9309)
User-agent: Googlebot
User-agent: Bingbot
Disallow: /internal/

Disallow

The Disallow directive tells crawlers which URL paths they should not access. Paths are case-sensitive and match from the start of the URL path. An empty Disallow: value means “allow everything” per RFC 9309.

User-agent: *
Disallow: /admin/        # Blocks /admin/ and all sub-paths
Disallow: /private/      # Blocks /private/ and all sub-paths
Disallow: /search        # Blocks /search, /search?q=test, etc.
Disallow:                # Empty = allow everything (per RFC 9309)

Allow

The Allow directive permits access to specific paths within an otherwise disallowed directory. It is most useful when you want to block a directory but allow access to certain files inside it. Per RFC 9309, the longest (most specific) matching rule wins.

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php  # Allow AJAX handler within blocked dir

# The longer path wins:
# /wp-admin/admin-ajax.php -> ALLOWED (17 chars > 10 chars)
# /wp-admin/options.php    -> BLOCKED

Crawl-delay

The Crawl-delay directive requests crawlers to wait a specified number of seconds between requests. This can help servers that struggle under crawl pressure. However, Google completely ignores Crawl-delay. Bing, Yandex, and some other crawlers respect it.

Warning

Setting a high Crawl-delay (e.g., 30 seconds) for Bingbot can dramatically slow Bing's ability to discover and index your pages. For Google, use the Google Search Console crawl rate setting instead.

Sitemap

The Sitemap directive points crawlers to your XML sitemap location. Unlike other directives, Sitemap is not bound to any User-agent group -- it applies globally. You can include multiple Sitemap directives. Place them at the bottom of your robots.txt file.

# Sitemap directives (always at the bottom)
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml
Sitemap: https://example.com/sitemap-images.xml.gz

Directive	RFC 9309	Google	Bing	Purpose
User-agent	Yes	Yes	Yes	Identify which crawler rules apply to
Disallow	Yes	Yes	Yes	Block access to specific paths
Allow	Yes	Yes	Yes	Override Disallow for specific paths
Crawl-delay	No	Ignored	Yes	Seconds between requests
Sitemap	Informational	Yes	Yes	XML sitemap discovery

3. AI Crawler Management in 2026

The explosion of large language models has introduced a new category of web crawlers that scrape content for AI training and AI-powered search. According to Dark Visitors, there are now over 200 known AI crawlers active on the web. The most important ones to manage in your robots.txt are those from major AI companies that respect the protocol.

The critical distinction in 2026 is between AI training crawlers (which scrape content to build models) and AI search crawlers (which fetch content to power AI search results). Blocking training crawlers protects your content, but blocking search crawlers reduces your visibility in AI-powered search engines.

Bot Name	Company	Purpose	Respects Robots.txt
GPTBot	OpenAI	AI training	Yes
OAI-SearchBot	OpenAI	ChatGPT search	Yes
ClaudeBot	Anthropic	AI training	Yes
PerplexityBot	Perplexity	AI search	Yes
Google-Extended	Google	Gemini training	Yes
Bytespider	ByteDance	AI training	Partially
CCBot	Common Crawl	Dataset collection	Yes
Applebot-Extended	Apple	Apple Intelligence	Yes

A 2024 Originality.ai study found that among the top 1,000 websites, 35% block GPTBot, 28% block Google-Extended, and 20% block CCBot. The trend toward selective blocking is growing: publishers want to protect training data while maintaining visibility in AI-powered search results.

Figure 3: AI crawler decision matrix -- allow or block based on your content type and business model

How to Block AI Training Crawlers While Allowing AI Search

The optimal 2026 strategy for most content-driven websites is to block AI training crawlers (which build models from your content) while allowing AI search crawlers (which display your content in search results):

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

# ALLOW AI search crawlers for visibility
User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

# Regular search engines always allowed
User-agent: Googlebot
Allow: /

User-agent: *
Allow: /
Disallow: /admin/

Sitemap: https://example.com/sitemap.xml

Best Practice

Review your AI crawler rules quarterly. New AI search engines and crawlers appear regularly. Check Dark Visitors for an updated list of active AI crawlers and their user-agent strings.

4. Crawl Budget Optimization

Crawl budget is the number of URLs a search engine will crawl on your site within a given timeframe. It is determined by two factors: crawl capacity (how fast your server can respond without degrading user experience) and crawl demand (how valuable and fresh Google perceives your content to be). According to Google's own documentation, crawl budget is primarily a concern for sites with more than 10,000 URLs.

Figure 2: Crawl budget funnel -- robots.txt determines which pages receive crawling resources from Google

What to Block for Better Crawl Efficiency

Block these low-value URL patterns in robots.txt to preserve crawl budget for your important content:

Search results pages: Internal site search creates infinite URL variations with no unique content (/search?q=*)
Faceted navigation and filters: E-commerce filter combinations generate thousands of near-duplicate pages (/*?sort=, /*?filter=)
User account pages: Cart, checkout, profile, and dashboard pages have no SEO value (/cart, /my-account)
Admin and staging areas: CMS admin panels, staging directories, and development environments (/wp-admin/, /staging/)
Calendar and archive pages: Date-based archives and calendar views that duplicate content (/*?year=, /calendar/)
Print and PDF versions: Duplicate print-friendly versions of existing pages (/*?print=1)

Critical: Never Block These

Never block JavaScript or CSS files in robots.txt. Google requires these resources to render your pages for mobile-first indexing. Blocking /*.js$ or /*.css$ causes Google to see a blank or broken page, resulting in severe ranking drops. Sites that block JS/CSS routinely score worse in Core Web Vitals assessments, because Google can't render the page the way real users see it.

Crawl Budget and Server Response

Server response codes for robots.txt itself have major consequences for your crawl budget:

Status Code	Crawler Behavior	SEO Impact
200 OK	Reads and follows directives	Normal operation
404 Not Found	Assumes no restrictions (full allow)	Lose crawl control, AI bots access everything
5xx Error	Temporarily blocks entire site	No crawling until resolved -- critical
429 Rate Limited	Same as 5xx (full block)	No crawling until rate limit resets

5. Testing Your Robots.txt

Testing your robots.txt before deploying changes is essential. A single typo can block your entire site. Here are the best tools for validation in 2026:

InstaRank SEO Robots.txt Checker

Our free robots.txt checker performs the most comprehensive analysis available. It detects AI crawler blocking, validates sitemap accessibility, checks RFC 9309 compliance including BOM detection and line-ending validation, and includes a generator with presets for all the major bots.

Google Search Console

Google Search Console's URL Inspection tool shows whether Googlebot can access specific pages. Check the Coverage report for pages blocked by robots.txt. Note that Google deprecated its standalone robots.txt tester tool in 2023, so the URL Inspection approach is now the primary method.

Third-Party Tools

Screaming Frog SEO Spider: Crawls your site and shows which URLs are blocked by robots.txt (free version: up to 500 URLs)
Ahrefs Site Audit: Flags robots.txt errors as part of comprehensive site auditing
Google Rich Results Test: Shows if resources needed for rendering are blocked by robots.txt

Manual Validation Checklist

Open https://yourdomain.com/robots.txt in your browser -- verify it shows plain text, not HTML
Check the Content-Type header in browser DevTools (Network tab) -- must be text/plain
Verify every Allow/Disallow has a User-agent above it (no orphaned directives)
Confirm no Disallow: / under User-agent: * (this blocks everything)
Verify no JS/CSS patterns are blocked (search for .js and .css in Disallow rules)
Test each Sitemap URL -- click and verify it returns valid XML or gzipped content
Verify AI crawler rules match your intended content strategy

6. Real Examples for Different Site Types

E-Commerce Site

# E-Commerce Robots.txt (2026 Best Practice)
User-agent: *
Allow: /
Disallow: /cart
Disallow: /checkout
Disallow: /my-account
Disallow: /wishlist
Disallow: /search
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?color=
Disallow: /*?size=
Disallow: /*?page=          # Pagination handled by rel=next/prev
Disallow: /cdn-cgi/         # Cloudflare internals

# Block AI training, allow AI search
User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Allow: /products/
Allow: /categories/

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

Sitemap: https://store.example.com/sitemap.xml
Sitemap: https://store.example.com/sitemap-products.xml

Blog / Content Site

# Blog/Content Site Robots.txt (2026 Best Practice)
User-agent: *
Allow: /
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /search
Disallow: /tag/             # Tag archives (often thin content)
Disallow: /*?replytocom=    # Comment reply links

# Allow all AI crawlers (content marketing = visibility)
# Only block aggressive scrapers
User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

Sitemap: https://blog.example.com/sitemap.xml
Sitemap: https://blog.example.com/sitemap-news.xml

SaaS Application

# SaaS Application Robots.txt (2026 Best Practice)
User-agent: *
Allow: /
Allow: /docs/
Allow: /blog/
Allow: /pricing
Allow: /features
Disallow: /app/             # Protected application
Disallow: /api/             # API endpoints
Disallow: /dashboard/       # User dashboards
Disallow: /settings/
Disallow: /admin/

# Allow all crawlers for docs/blog visibility
# Block training on customer data areas
User-agent: GPTBot
Allow: /docs/
Allow: /blog/
Disallow: /

User-agent: ClaudeBot
Allow: /docs/
Allow: /blog/
Disallow: /

Sitemap: https://saas.example.com/sitemap.xml

Next.js Application (App Router)

// app/robots.ts — Next.js 14+ dynamic robots.txt
import { MetadataRoute } from 'next'

export default function robots(): MetadataRoute.Robots {
  return {
    rules: [
      {
        userAgent: '*',
        allow: '/',
        disallow: ['/api/', '/admin/', '/_next/'],
      },
      {
        userAgent: 'GPTBot',
        disallow: '/',
      },
      {
        userAgent: 'ClaudeBot',
        disallow: '/',
      },
      {
        userAgent: 'Google-Extended',
        disallow: '/',
      },
    ],
    sitemap: 'https://example.com/sitemap.xml',
  }
}

Figure 4: Before and after -- a misconfigured robots.txt fixed to score 100/100 with proper structure

7. Common Mistakes and How to Fix Them

Mistake 1: Blocking Everything with Disallow: /

Accidentally placing Disallow: / under User-agent: * blocks your entire site from all crawlers. This is the most catastrophic robots.txt error and we see it on approximately 3% of websites we audit. It most commonly happens during development (staging environments) when the rule is accidentally deployed to production.

Fix: Replace Disallow: / with Allow: / and add specific Disallow rules for paths you want to block.

Mistake 2: Blocking JavaScript and CSS

Rules like Disallow: /*.js$ or Disallow: /assets/ prevent Google from rendering your pages. Since Google switched to 100% mobile-first indexing in July 2024, rendering is essential -- Google must execute JavaScript and load CSS to understand your page content, layout, and Core Web Vitals.

Mistake 3: Orphaned Directives

Allow or Disallow rules that appear before any User-agent line are “orphaned” and silently ignored by RFC 9309-compliant crawlers. This is a sneaky bug because the rules look correct but have no effect. Always start with a User-agent line before any access rules.

Mistake 4: Broken Sitemap References

Referencing sitemaps that return 404 or 5xx errors wastes crawler attention and provides a negative signal. When you move or rename sitemaps, always update your robots.txt. Our tool tests each referenced sitemap and reports inaccessible ones.

Mistake 5: Serving HTML Instead of Plain Text

Single-page applications (React, Vue, Angular) often catch all routes and serve the HTML shell for /robots.txt. Crawlers see HTML, not robots.txt directives, and ignore the file entirely. Configure your server or CDN to serve a static robots.txt file before the SPA catch-all route.

Mistake 6: Ignoring AI Crawlers Entirely

In 2026, if your robots.txt has no rules for AI crawlers, you are implicitly allowing all of them to access your entire site. This means GPTBot, ClaudeBot, Bytespider, CCBot, and dozens of others are free to scrape your content for AI training. Whether you allow or block them should be a deliberate decision, not an oversight.

Mistake 7: Oversized Files

Per RFC 9309, crawlers may truncate files larger than 500 KiB. If your robots.txt exceeds this limit (usually from listing thousands of individual blocked URLs), rules at the end may be silently ignored. Use directory-level blocking and wildcard patterns instead.

8. Frequently Asked Questions

What is robots.txt and why does it matter for SEO?

Robots.txt is a plain text file placed at your website's root directory that instructs search engine crawlers and AI bots which pages they can and cannot access. It matters for SEO because it controls crawl budget allocation (ensuring Google spends its crawl budget on your important pages), prevents sensitive or low-value pages from consuming crawl resources, manages AI crawler access to your content, and helps search engines discover your sitemap. Without it, crawlers access everything with no guidance, which can waste crawl budget and expose private content.

Does robots.txt prevent a page from being indexed?

No. This is one of the most common misconceptions in SEO. Robots.txt prevents crawling, not indexing. If external sites link to a page that your robots.txt blocks, Google can still index that page based on the link's anchor text -- it just cannot crawl the page's content. To prevent indexing, use the noindex meta tag or the X-Robots-Tag: noindex HTTP header. Paradoxically, if you block a page in robots.txt AND use noindex on the page, Google cannot see the noindex directive (because it cannot crawl the page), so the page may still appear in search results.

Should I block AI crawlers like GPTBot and ClaudeBot?

It depends on your content strategy and business model. Block them if: you have premium or paywalled content, you do not want your content used for AI model training, or you are a news publisher concerned about content attribution. Allow them selectively if: you want visibility in AI-powered search (ChatGPT search, Perplexity), you produce public documentation or educational content, or your business benefits from AI citation. The nuanced approach is to block training crawlers (GPTBot, ClaudeBot, Google-Extended) while allowing search crawlers (OAI-SearchBot, PerplexityBot).

What is RFC 9309 and why does it matter?

RFC 9309, published by the IETF in September 2022, is the first formal internet standard for the Robots Exclusion Protocol. Before RFC 9309, robots.txt behavior was based on a 1994 informal agreement, and different crawlers interpreted the file inconsistently. RFC 9309 standardizes the Allow directive, defines the longest-match-wins rule for conflicting directives, establishes the 500 KiB file size limit, specifies UTF-8 encoding, and defines crawler behavior for different HTTP status codes. Following RFC 9309 ensures your robots.txt works consistently across all compliant crawlers.

What happens if robots.txt returns a 5xx server error?

When search engine crawlers receive a 5xx error (500, 502, 503, etc.) or a 429 (rate limited) response when fetching robots.txt, they temporarily treat your entire site as blocked. No pages will be crawled until the error resolves. This behavior is defined in RFC 9309 and is designed to protect against accidentally crawling sites that might have removed their access restrictions. If your robots.txt endpoint goes down during a deployment, Google will stop crawling your site for up to 24 hours. Always ensure your robots.txt endpoint is highly available.

How does crawl budget work with robots.txt?

Crawl budget is the number of pages Google will crawl on your site within a given period. It is determined by crawl capacity (your server's ability to handle requests) and crawl demand (how important and fresh Google considers your content). Robots.txt directly controls which pages receive that budget. By blocking low-value pages (internal search, faceted navigation, admin areas, user accounts), you preserve crawl budget for your money pages -- product pages, blog posts, and landing pages that drive traffic and revenue. For large sites (10,000+ pages), this optimization can significantly improve how quickly Google discovers and indexes new content.

What is the Crawl-delay directive and does Google support it?

The Crawl-delay directive tells crawlers to wait a specified number of seconds between requests. For example, Crawl-delay: 2 requests a 2-second gap between crawl requests. Google completely ignores this directive. To control Google's crawl rate, use the crawl rate setting in Google Search Console. Bing, Yandex, and some other crawlers do respect Crawl-delay. Note that Crawl-delay is not part of RFC 9309 -- it is an informal extension used by some crawlers. Setting a very high Crawl-delay for Bingbot can severely limit Bing's ability to discover your content.

Can I use wildcards in robots.txt?

Yes. Google, Bing, and most modern crawlers support two wildcard characters: * (matches any sequence of characters) and $ (matches the end of a URL). For example, Disallow: /*.pdf$ blocks all URLs ending in .pdf, and Disallow: /*?* blocks all URLs with query parameters. Wildcards are not part of the original RFC 9309 specification but are widely supported as a de facto standard. They are especially useful for blocking file types and query parameter patterns without listing every individual URL.

Audit Your Robots.txt Now

Use InstaRank SEO's free robots.txt checker to instantly analyze your file against 7 critical parameters, detect AI crawler blocking, validate sitemap accessibility, and generate a fixed, RFC 9309-compliant version.

Check Robots.txt Free Full Website Audit

1. What is Robots.txt and the RFC 9309 Standard

RFC 9309 brought several important clarifications and rules that every webmaster should understand in 2026:

Official recognition of the Allow directive: Previously informal, Allow is now a standard directive for creating exceptions within Disallow rules.
500 KiB file size limit: Crawlers may ignore content beyond 500 KiB (approximately 500 KB). Most robots.txt files are under 10 KB.
UTF-8 encoding required: The file must be served as UTF-8 encoded plain text with Content-Type: text/plain.
Longest-match-wins rule: When multiple Allow and Disallow rules match the same path, the most specific (longest) rule takes precedence.
HTTP status code handling: A 5xx response means “assume full disallow” while a 4xx response means “assume full allow.”
Group structure: Rules are organized in groups starting with one or more User-agent lines, followed by Allow/Disallow rules.

Important Note

Figure 1: Robots.txt file structure showing directive groups, sitemap placement, and key RFC 9309 rules

2. All Robots.txt Directives Explained

User-agent

# Rules for all crawlers
User-agent: *
Disallow: /admin/

# Specific rules for Googlebot
User-agent: Googlebot
Allow: /

# Multiple user-agents can share rules (RFC 9309)
User-agent: Googlebot
User-agent: Bingbot
Disallow: /internal/

Disallow

User-agent: *
Disallow: /admin/        # Blocks /admin/ and all sub-paths
Disallow: /private/      # Blocks /private/ and all sub-paths
Disallow: /search        # Blocks /search, /search?q=test, etc.
Disallow:                # Empty = allow everything (per RFC 9309)

Allow

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php  # Allow AJAX handler within blocked dir

# The longer path wins:
# /wp-admin/admin-ajax.php -> ALLOWED (17 chars > 10 chars)
# /wp-admin/options.php    -> BLOCKED

Crawl-delay

Warning

Sitemap

# Sitemap directives (always at the bottom)
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml
Sitemap: https://example.com/sitemap-images.xml.gz

Directive	RFC 9309	Google	Bing	Purpose
User-agent	Yes	Yes	Yes	Identify which crawler rules apply to
Disallow	Yes	Yes	Yes	Block access to specific paths
Allow	Yes	Yes	Yes	Override Disallow for specific paths
Crawl-delay	No	Ignored	Yes	Seconds between requests
Sitemap	Informational	Yes	Yes	XML sitemap discovery

3. AI Crawler Management in 2026

Bot Name	Company	Purpose	Respects Robots.txt
GPTBot	OpenAI	AI training	Yes
OAI-SearchBot	OpenAI	ChatGPT search	Yes
ClaudeBot	Anthropic	AI training	Yes
PerplexityBot	Perplexity	AI search	Yes
Google-Extended	Google	Gemini training	Yes
Bytespider	ByteDance	AI training	Partially
CCBot	Common Crawl	Dataset collection	Yes
Applebot-Extended	Apple	Apple Intelligence	Yes

Figure 3: AI crawler decision matrix -- allow or block based on your content type and business model

How to Block AI Training Crawlers While Allowing AI Search

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

# ALLOW AI search crawlers for visibility
User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

# Regular search engines always allowed
User-agent: Googlebot
Allow: /

User-agent: *
Allow: /
Disallow: /admin/

Sitemap: https://example.com/sitemap.xml

Best Practice

Review your AI crawler rules quarterly. New AI search engines and crawlers appear regularly. Check Dark Visitors for an updated list of active AI crawlers and their user-agent strings.

4. Crawl Budget Optimization

Figure 2: Crawl budget funnel -- robots.txt determines which pages receive crawling resources from Google

What to Block for Better Crawl Efficiency

Block these low-value URL patterns in robots.txt to preserve crawl budget for your important content:

Search results pages: Internal site search creates infinite URL variations with no unique content (/search?q=*)
Faceted navigation and filters: E-commerce filter combinations generate thousands of near-duplicate pages (/*?sort=, /*?filter=)
User account pages: Cart, checkout, profile, and dashboard pages have no SEO value (/cart, /my-account)
Admin and staging areas: CMS admin panels, staging directories, and development environments (/wp-admin/, /staging/)
Calendar and archive pages: Date-based archives and calendar views that duplicate content (/*?year=, /calendar/)
Print and PDF versions: Duplicate print-friendly versions of existing pages (/*?print=1)

Critical: Never Block These

Crawl Budget and Server Response

Server response codes for robots.txt itself have major consequences for your crawl budget:

Status Code	Crawler Behavior	SEO Impact
200 OK	Reads and follows directives	Normal operation
404 Not Found	Assumes no restrictions (full allow)	Lose crawl control, AI bots access everything
5xx Error	Temporarily blocks entire site	No crawling until resolved -- critical
429 Rate Limited	Same as 5xx (full block)	No crawling until rate limit resets

5. Testing Your Robots.txt

Testing your robots.txt before deploying changes is essential. A single typo can block your entire site. Here are the best tools for validation in 2026:

InstaRank SEO Robots.txt Checker

Google Search Console

Third-Party Tools

Screaming Frog SEO Spider: Crawls your site and shows which URLs are blocked by robots.txt (free version: up to 500 URLs)
Ahrefs Site Audit: Flags robots.txt errors as part of comprehensive site auditing
Google Rich Results Test: Shows if resources needed for rendering are blocked by robots.txt

Manual Validation Checklist

Open https://yourdomain.com/robots.txt in your browser -- verify it shows plain text, not HTML
Check the Content-Type header in browser DevTools (Network tab) -- must be text/plain
Verify every Allow/Disallow has a User-agent above it (no orphaned directives)
Confirm no Disallow: / under User-agent: * (this blocks everything)
Verify no JS/CSS patterns are blocked (search for .js and .css in Disallow rules)
Test each Sitemap URL -- click and verify it returns valid XML or gzipped content
Verify AI crawler rules match your intended content strategy

6. Real Examples for Different Site Types

E-Commerce Site

# E-Commerce Robots.txt (2026 Best Practice)
User-agent: *
Allow: /
Disallow: /cart
Disallow: /checkout
Disallow: /my-account
Disallow: /wishlist
Disallow: /search
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?color=
Disallow: /*?size=
Disallow: /*?page=          # Pagination handled by rel=next/prev
Disallow: /cdn-cgi/         # Cloudflare internals

# Block AI training, allow AI search
User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Allow: /products/
Allow: /categories/

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

Sitemap: https://store.example.com/sitemap.xml
Sitemap: https://store.example.com/sitemap-products.xml

Blog / Content Site

# Blog/Content Site Robots.txt (2026 Best Practice)
User-agent: *
Allow: /
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /search
Disallow: /tag/             # Tag archives (often thin content)
Disallow: /*?replytocom=    # Comment reply links

# Allow all AI crawlers (content marketing = visibility)
# Only block aggressive scrapers
User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

Sitemap: https://blog.example.com/sitemap.xml
Sitemap: https://blog.example.com/sitemap-news.xml

SaaS Application

# SaaS Application Robots.txt (2026 Best Practice)
User-agent: *
Allow: /
Allow: /docs/
Allow: /blog/
Allow: /pricing
Allow: /features
Disallow: /app/             # Protected application
Disallow: /api/             # API endpoints
Disallow: /dashboard/       # User dashboards
Disallow: /settings/
Disallow: /admin/

# Allow all crawlers for docs/blog visibility
# Block training on customer data areas
User-agent: GPTBot
Allow: /docs/
Allow: /blog/
Disallow: /

User-agent: ClaudeBot
Allow: /docs/
Allow: /blog/
Disallow: /

Sitemap: https://saas.example.com/sitemap.xml

Next.js Application (App Router)

// app/robots.ts — Next.js 14+ dynamic robots.txt
import { MetadataRoute } from 'next'

export default function robots(): MetadataRoute.Robots {
  return {
    rules: [
      {
        userAgent: '*',
        allow: '/',
        disallow: ['/api/', '/admin/', '/_next/'],
      },
      {
        userAgent: 'GPTBot',
        disallow: '/',
      },
      {
        userAgent: 'ClaudeBot',
        disallow: '/',
      },
      {
        userAgent: 'Google-Extended',
        disallow: '/',
      },
    ],
    sitemap: 'https://example.com/sitemap.xml',
  }
}

Figure 4: Before and after -- a misconfigured robots.txt fixed to score 100/100 with proper structure

7. Common Mistakes and How to Fix Them

Mistake 1: Blocking Everything with Disallow: /

Fix: Replace Disallow: / with Allow: / and add specific Disallow rules for paths you want to block.

TL;DR Summary

1. What is Robots.txt and the RFC 9309 Standard

Important Note

2. All Robots.txt Directives Explained

User-agent

Disallow

Allow

Crawl-delay

Warning

Sitemap

3. AI Crawler Management in 2026

How to Block AI Training Crawlers While Allowing AI Search

Best Practice

4. Crawl Budget Optimization

What to Block for Better Crawl Efficiency

Critical: Never Block These

Crawl Budget and Server Response

5. Testing Your Robots.txt

InstaRank SEO Robots.txt Checker

Google Search Console

Third-Party Tools

Manual Validation Checklist

6. Real Examples for Different Site Types

E-Commerce Site

Blog / Content Site

SaaS Application

Next.js Application (App Router)

7. Common Mistakes and How to Fix Them

Mistake 1: Blocking Everything with Disallow: /

Mistake 2: Blocking JavaScript and CSS

Mistake 3: Orphaned Directives

Mistake 4: Broken Sitemap References

Mistake 5: Serving HTML Instead of Plain Text

Mistake 6: Ignoring AI Crawlers Entirely

Mistake 7: Oversized Files

8. Frequently Asked Questions

What is robots.txt and why does it matter for SEO?

Does robots.txt prevent a page from being indexed?

Should I block AI crawlers like GPTBot and ClaudeBot?

What is RFC 9309 and why does it matter?

What happens if robots.txt returns a 5xx server error?

How does crawl budget work with robots.txt?

What is the Crawl-delay directive and does Google support it?

Can I use wildcards in robots.txt?

Audit Your Robots.txt Now

Related guides

TL;DR Summary

1. What is Robots.txt and the RFC 9309 Standard

Important Note

2. All Robots.txt Directives Explained

User-agent

Disallow

Allow

Crawl-delay

Warning

Sitemap

3. AI Crawler Management in 2026

How to Block AI Training Crawlers While Allowing AI Search

Best Practice

4. Crawl Budget Optimization

What to Block for Better Crawl Efficiency

Critical: Never Block These

Crawl Budget and Server Response

5. Testing Your Robots.txt

InstaRank SEO Robots.txt Checker

Google Search Console

Third-Party Tools

Manual Validation Checklist

6. Real Examples for Different Site Types

E-Commerce Site

Blog / Content Site

SaaS Application

Next.js Application (App Router)

7. Common Mistakes and How to Fix Them

Mistake 1: Blocking Everything with Disallow: /

Mistake 2: Blocking JavaScript and CSS

Mistake 3: Orphaned Directives

Mistake 4: Broken Sitemap References

Mistake 5: Serving HTML Instead of Plain Text

Mistake 6: Ignoring AI Crawlers Entirely