robots.txt in the age of AI agents: why your crawl policy needs a rethink
The File That Controls Who Sees Your Product
There's a 600-byte text file on your web server that increasingly determines whether AI agents can discover and evaluate your product. It's robots.txt, and it was created in 1994 by Martijn Koster to solve a specific problem: web crawlers were consuming too much bandwidth. The solution was a simple convention — put a text file at your domain root that tells bots which URLs they can and can't access.
Thirty-two years later, this file is more consequential than ever. But the "robots" reading it aren't search engine crawlers anymore. They're AI agents evaluating whether to sign up for your product.
How AI Agents Interpret robots.txt
When a search engine crawler like Googlebot encounters a Disallow directive, it simply skips that URL and moves on. The crawler's job is indexing — it doesn't care about individual pages beyond their SEO value.
AI agents interpret robots.txt differently. They're not indexing — they're evaluating. When an agent encounters a restrictive robots.txt, it draws conclusions:
- Disallowed documentation pages → "I can't evaluate this product's API. Moving to a competitor."
- Disallowed signup pages → "I can't access the onboarding flow. This product doesn't support programmatic access."
- Blanket
Disallow: /→ "This product actively blocks automated access. It's not agent-friendly."
The agent doesn't just skip the URL — it forms an opinion about your product's agent-readiness. And that opinion influences recommendations to the human who asked it to evaluate your product.
The robots.txt Mistakes That Kill Agent Adoption
Here are the most common robots.txt configurations we see that inadvertently block legitimate agent traffic:
Mistake 1: Blocking Documentation
User-agent: *
Disallow: /docs/
Disallow: /api/reference/
Some companies block documentation from crawlers to prevent content scraping or to keep docs behind authentication. This makes sense for proprietary internal documentation, but for public API docs, it's self-defeating. Agents that can't read your docs will recommend a competitor whose docs are accessible.
Fix: Allow crawling of public documentation while protecting authenticated or internal docs.
Mistake 2: Blocking Signup and Onboarding Pages
User-agent: *
Disallow: /signup
Disallow: /register
Disallow: /onboarding
This is often done to prevent bots from creating spam accounts. The intent is reasonable, but the implementation is too broad. It prevents agents from even seeing what your signup process looks like — which is information they need to evaluate whether programmatic access is possible.
Fix: Instead of blocking signup pages entirely, add a machine-readable agent access endpoint that agents can use as an alternative to the human signup form.
Mistake 3: Blocking All Non-Google Crawlers
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
In 2024-2025, many companies added blanket blocks against AI training crawlers (GPTBot, ClaudeBot, etc.) to prevent their content from being used as training data. This is a legitimate concern for content publishers. But for SaaS products, blocking AI crawlers also blocks the agent systems built on those same platforms.
When you block ClaudeBot, you're not just preventing Anthropic from training on your content — you're preventing Claude Code from reading your documentation when a developer asks it to evaluate your product.
Fix: Use the ai.txt or specific path-based rules to allow AI agents to access documentation and public-facing pages while blocking access to proprietary content.
Mistake 4: Overly Aggressive Rate Limiting on Crawlers
Some companies don't block crawlers in robots.txt but implement aggressive rate limiting that effectively blocks them:
Get Started
Ready to make your product agent-accessible?
Add a few lines of code and let AI agents discover, request access, and get real credentials — with human oversight built in.
Get started with Anon →User-agent: *
Crawl-delay: 30
A 30-second crawl delay means an agent needs 5+ minutes to read through 10 pages of documentation. In agent time, that's an eternity. The agent will time out and move on.
Fix: Set reasonable crawl delays (1-5 seconds) for known AI agent user agents, or better yet, provide a structured API spec (OpenAPI) that lets agents get all the information in a single request.
Building an Agent-Aware Crawl Policy
A modern robots.txt should reflect the reality that non-human visitors fall into multiple categories with different access needs:
# Search engines — standard crawl access
User-agent: Googlebot
Allow: /
Disallow: /api/internal/
Disallow: /admin/
User-agent: Bingbot
Allow: /
Disallow: /api/internal/
Disallow: /admin/
# AI agents — allow documentation and public pages
User-agent: GPTBot
Allow: /docs/
Allow: /api/reference/
Allow: /pricing
Allow: /llms.txt
Disallow: /api/internal/
Disallow: /admin/
Disallow: /user/
User-agent: ClaudeBot
Allow: /docs/
Allow: /api/reference/
Allow: /pricing
Allow: /llms.txt
Disallow: /api/internal/
Disallow: /admin/
Disallow: /user/
User-agent: anthropic-ai
Allow: /docs/
Allow: /api/reference/
Allow: /pricing
Allow: /llms.txt
Disallow: /api/internal/
Disallow: /admin/
Disallow: /user/
# Default — restrictive for unknown bots
User-agent: *
Allow: /
Disallow: /api/internal/
Disallow: /admin/
Disallow: /user/
Crawl-delay: 5
This configuration gives AI agents access to the information they need (docs, API reference, pricing) while protecting internal endpoints and user data.
Beyond robots.txt: The Emerging Standards
The industry is moving toward more expressive machine-readable policies that go beyond the binary allow/disallow of robots.txt:
llms.txt
As mentioned in the documentation article, llms.txt provides structured metadata about your product specifically for LLM consumption. It's a complement to robots.txt, not a replacement.
agent-access.json
A newer convention, agent-access.json (placed at /.well-known/agent-access.json) declares what types of programmatic access your product supports:
{
"version": "1.0",
"product": "YourProduct",
"agent_signup": {
"supported": true,
"endpoint": "https://yourproduct.com/api/agent-access",
"auth_methods": ["api_key"],
"scopes_available": ["read", "write", "admin"],
"approval_required": true
},
"documentation": {
"api_reference": "https://yourproduct.com/docs/api",
"openapi_spec": "https://yourproduct.com/api/openapi.json"
}
}
ai.txt
Proposed by several industry groups as a more granular alternative to the AI-specific entries in robots.txt, ai.txt allows you to specify different policies for different types of AI access (training, inference, agent operation).
The Audit You Should Do Today
Pull up your robots.txt right now (yourdomain.com/robots.txt) and ask these questions:
- Can an AI agent read your documentation? If your docs are behind a
Disallow, agents can't evaluate your product. - Can an AI agent see your pricing page? Agents make cost-benefit recommendations. If they can't see pricing, they can't recommend you.
- Are you blocking AI-specific user agents? If you have
Disallow: /for GPTBot or ClaudeBot, you're blocking agent evaluations. - Do you have an
llms.txtfile? This takes five minutes to create and dramatically improves agent discoverability. - Is your OpenAPI spec accessible? If it's behind authentication, agents can't evaluate your API without signing up first.
For a comprehensive score, run your domain through AgentGate's benchmark tool. It checks all of these factors and more, giving you a specific score and actionable recommendations.
The Strategic Implication
Here's the meta-point: robots.txt was designed for a web where non-human visitors were a nuisance to be managed. In the agent economy, non-human visitors are customers to be welcomed. The companies that update their access policies to reflect this reality will capture agent-driven growth. The ones that don't will watch from the sidelines, wondering why their competitor's product keeps getting recommended.
Your robots.txt is a policy document. Make sure it says what you mean.
Free Tool
How agent-ready is your website?
Run a free scan to see how AI agents experience your signup flow, robots.txt, API docs, and LLM visibility.
Run a free scan →