← Back to Blog
Developer Experience12 minFebruary 27, 2026

What 1,000 robots.txt files tell us about the internet's stance on AI agents

A
Anon Team

When we built the AgentGate Benchmark, we scored nearly 1,000 SaaS companies across seven dimensions of agent readiness. One of those dimensions is crawl access — what does your robots.txt file say about AI?

We expected a clear split: companies that welcome AI and companies that block it. What we actually found was five distinct policy approaches, ranging from explicit invitations to existential protests. And the correlation between robots.txt policy and overall agent readiness tells a story about where the industry is heading.

Here's the full analysis.

The dataset

We analyzed the robots.txt files of 997 unique SaaS domains from our benchmark results. These aren't random websites — they're developer tools, business software, APIs, and platforms that AI agents are most likely to interact with. Think Stripe, GitHub, Notion, Datadog, HubSpot, Supabase, and 990 others.

For each domain, we checked:

  • Whether a robots.txt file exists
  • Whether it mentions any AI-specific crawlers (GPTBot, ClaudeBot, CCBot, Google-Extended, PerplexityBot, Bytespider, Applebot-Extended, Meta-ExternalAgent)
  • Whether it blocks, allows, or selectively restricts AI crawlers
  • Whether it uses emerging standards like Crawl-delay or Cloudflare's Content-Signal

Then we cross-referenced these policies against overall agent-readiness scores.

The headline numbers

Metric Count Percentage
Total unique domains 997 100%
Has robots.txt 870 87.3%
No robots.txt at all 127 12.7%
Mentions any AI bot 78 7.8%
Blocks AI crawlers 40 4.0%
Blocks GPTBot specifically 49 4.9%
Blocks all bots (wildcard disallow) 9 0.9%

The first surprise: only 4% of SaaS companies actively block AI crawlers. For all the headlines about publishers waging war against GPTBot, the SaaS industry has largely taken the opposite stance.

But the second surprise is more interesting: only 1.9% explicitly welcome AI crawlers with allow directives. The vast majority — nearly 80% — simply have no AI-specific rules at all. They're allowing AI by default, but not by design.

Five policy approaches

We categorized every domain into one of five distinct policy types. The taxonomy reveals very different attitudes toward AI — and very different outcomes.

1. The Open Welcome (1.9% of domains)

What it looks like: Explicit Allow: / directives for AI-specific user agents.

These companies don't just tolerate AI crawlers — they roll out the red carpet. Some go as far as adding comments explaining their AI-friendly stance.

Examples from our data:

# windsor.ai — Welcomes AI crawlers with explicit Allow
User-agent: GPTBot
Allow: /
Disallow: /ppc/*
Disallow: /wp-admin
# deel.com — Explicitly allows GPTBot and Google-Extended
User-agent: GPTBot
Allow: /

User-agent: Google-Extended
Allow: /
# weaviate.io — Welcome mat for all AI
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

The top scorers in this category:

Domain Agent Readiness Score
cloudzero.com 84
deel.com 84
spider.cloud 75
windsor.ai 74
bitwarden.com 70
getport.io 69
weaviate.io 69
descript.com 69
windmill.dev 69

Average agent readiness score: 66.9

These companies understand that AI crawlers represent visibility in the emerging AI-powered search layer. Weaviate is an AI-native vector database — of course they want GPTBot indexing their docs. Deel, a global payroll platform, is betting that AI agents will increasingly be the ones researching and recommending HR tools.

2. The Silent Majority (79.7% of domains)

What it looks like: A robots.txt file exists with standard rules for search crawlers and SEO bots, but zero mention of any AI-specific user agents.

This is the default position of the internet: allow by omission. These companies haven't actively decided to welcome or block AI crawlers. They simply haven't updated their robots.txt since before GPTBot existed.

# Typical silent majority robots.txt
User-agent: *
Disallow: /admin/
Disallow: /api/internal/

Sitemap: https://example.com/sitemap.xml

No GPTBot directive. No ClaudeBot mention. Just standard rules that have been there since 2019.

Average agent readiness score: 55.8

The silent majority's lack of AI-specific policy isn't a problem today — robots.txt defaults to "allow everything not explicitly disallowed." But it becomes a problem when you need nuanced control. Do you want training crawlers indexing your pricing page? Do you want user-initiated AI agents accessing your docs? A wildcard Allow: / doesn't distinguish between these use cases.

3. The Selective Gatekeepers (1.6% of domains)

What it looks like: Mentions AI crawlers with a mix of allow and disallow rules — different bots get different access, or specific paths are blocked while others are open.

This is the most sophisticated approach, and arguably the right one. These companies have thought about what AI access means for their business and drawn deliberate boundaries.

Examples from our data:

# ringly.io — Public marketing content: yes. App dashboard: no.
User-agent: GPTBot
Allow: /
Disallow: /app/

User-agent: ChatGPT-user
Allow: /
Disallow: /app/
# cockroachlabs.com — Docs yes, navigation pages no
User-Agent: GPTBot
Allow: /
Disallow: /tags/
Disallow: /page/
Disallow: /animation-test/
Disallow: /categories/
# investopedia.com — Nuanced per-bot rules
User-agent: ChatGPT-User
Disallow: /thmb/           # Block image thumbnails

User-agent: OAI-SearchBot
Disallow: /thmb/           # Same for search

User-agent: GPTBot
Disallow: /thmb/           # Training? Only block images

User-agent: anthropic-ai
Disallow: /                 # Block Anthropic entirely

User-agent: ClaudeBot
Disallow: /                 # Block Claude entirely

Investopedia's policy is particularly revealing: they allow OpenAI's crawlers to access article content but block image thumbnails (protecting visual assets), while completely blocking Anthropic's crawlers. This suggests a business relationship with OpenAI that doesn't extend to competitors.

Average agent readiness score: 51.2

Selective gatekeepers have the most intentional policies, but they score lower on average because selectivity often means restriction. A benchmark system sees blocked crawlers as a negative signal, even if the policy is well-reasoned.

4. The AI Blockers (4.0% of domains)

What it looks like: Explicit Disallow: / for one or more AI-specific user agents, without corresponding allow rules.

These companies have made a conscious decision to keep AI crawlers out — entirely.

Who's blocking, by bot:

AI Bot Domains mentioning it % of all domains
GPTBot (OpenAI) 56 5.6%
ClaudeBot (Anthropic) 47 4.7%
CCBot (Common Crawl) 33 3.3%
PerplexityBot 28 2.8%
Google-Extended 24 2.4%
Bytespider (ByteDance) 16 1.6%
Applebot-Extended 14 1.4%
Meta-ExternalAgent 1 0.1%

GPTBot is the most commonly mentioned — and most commonly blocked — AI crawler. This tracks with the HTTP Archive's broader findings: across 12 million websites, GPTBot is the most referenced AI user agent in robots.txt files, appearing on almost 21% of the top 1,000 websites.

The most aggressive blockers in our dataset:

# quora.com — The nuclear option with a manifesto
# NOTICE: All crawlers and bots, regardless of whether or not they are
# specified below, are strictly prohibited from using Quora platform content 
# for the purposes of training AI models or similar machine learning systems

Quora doesn't just block AI bots — they include a legal notice in their robots.txt asserting that all crawlers, even those not blocked, are prohibited from using content for AI training. This is robots.txt as legal document, not just technical configuration.

Other notable blockers:

Domain Score What they blocked
quora.com 3 All AI crawlers + legal notice
investopedia.com 25 Anthropic entirely, OpenAI images only
sourceforge.net 24 All major AI crawlers
bandcamp.com 29 AI crawlers (protecting creator content)
gizmodo.com 35 Uses Content-Signal: ai-train=no
palantir.com 49 AI crawlers (defense/intelligence context)
calendly.com 68 AI crawlers (unusual for a SaaS tool)

Average agent readiness score: 36.7

Here's the critical finding: companies that block AI crawlers score 36.7 on agent readiness, versus 56.9 for non-blockers. That's a 55% gap. Blocking AI isn't just a robots.txt policy — it correlates with a broader stance against AI integration. These companies tend to also lack API documentation optimized for agents, have CAPTCHAs on their signup flows, and offer no machine-readable interface descriptions.

5. The Absent (12.7% of domains)

What it looks like: No robots.txt file at all. The request returns a 404.

127 domains in our dataset have no robots.txt — not even a default one. This includes some surprising names:

Get Started

Ready to make your product agent-accessible?

Add a few lines of code and let AI agents discover, request access, and get real credentials — with human oversight built in.

Get started with Anon →
Domain Score Notes
box.com 68 Enterprise file storage — no robots.txt!
drift.com 67 Conversational marketing platform
vonage.com 48 Communications API
npmjs.com JavaScript package registry
wave.com Financial services

Average agent readiness score: 46.6

The absence of a robots.txt file technically means "allow everything" — but it signals a lack of awareness about crawl management. It's different from an intentional open policy.

The scorecard: policy vs. agent readiness

Here's the finding that surprised us most. We plotted average agent-readiness scores against robots.txt policy type:

Policy Count Avg Score vs. Overall Avg
Open Welcome 19 66.9 +19%
Open Default 795 55.8 baseline
Selective 16 51.2 -8%
Block AI 40 36.7 -34%
No robots.txt 127 46.6 -16%

Companies that explicitly welcome AI crawlers score 82% higher than those that block them (66.9 vs. 36.7). This isn't because robots.txt is a large scoring factor — it's one dimension of seven. The correlation exists because robots.txt policy reflects a broader organizational stance toward AI integration.

Companies that block AI crawlers also tend to:

  • Have CAPTCHAs on their signup flows (blocking agents at the front door)
  • Lack developer documentation (no API for agents to use)
  • Have no OAuth or machine-to-machine auth options
  • Offer no llms.txt or machine-readable API descriptions

The robots.txt is the canary in the coal mine for agent readiness.

The emerging standard: Content-Signal

Seven domains in our dataset use Cloudflare's new Content-Signal directive — an extension to robots.txt that provides more granular control over AI usage:

# gizmodo.com — The new standard in action
User-agent: *
Content-Signal: ai-train=no, search=yes, ai-input=yes

This is a three-part signal:

  • search=yes — Allow indexing for search results (traditional + AI search)
  • ai-train=no — Don't use our content to train models
  • ai-input=yes — OK to use as input for AI-generated responses (like citations in ChatGPT)

This is the future of robots.txt for AI. Instead of a binary "block/allow" per crawler, publishers can express intent-based preferences. "Use my content to answer user questions, but don't train your next model on it."

Other Content-Signal adopters in our data:

# ko-fi.com — Same pattern: search yes, training no
User-agent: *
Content-Signal: search=yes, ai-train=no

# valve.com — Even Valve uses it
User-agent: *
Content-Signal: search=yes, ai-train=no

Cloudflare announced Content-Signal in September 2025 and set the default for its customers to search=yes, ai-train=no. Given that Cloudflare protects roughly 20% of all websites, this single default may be the largest shift in AI crawl policy in history — applied to millions of domains simultaneously.

What robots.txt can't do

Before drawing too many conclusions, it's worth noting what robots.txt doesn't control:

1. Robots.txt is advisory, not enforceable. Compliance is voluntary. A well-behaved crawler like GPTBot respects Disallow: /. A malicious scraper doesn't care. This is why Cloudflare, Akamai, and other CDNs are building enforcement layers (like AI Audit) that actually block non-compliant crawlers at the network level.

2. Robots.txt doesn't distinguish between AI use cases. GPTBot is used for training data collection. ChatGPT-User fetches pages in real-time when a user asks a question. Blocking GPTBot might make sense (you don't want your content training models without compensation), but blocking ChatGPT-User means your product won't appear in ChatGPT's responses when users ask about tools in your category.

Anthropic recently formalized this distinction, splitting their crawlers into three separate user agents:

  • ClaudeBot — Training data collection
  • Claude-User — User-initiated page fetches
  • Claude-SearchBot — Search indexing

This lets site owners block training while allowing citation. It's a pattern all AI companies should follow.

3. Robots.txt doesn't cover AI agents. Here's the biggest gap: robots.txt was designed for crawlers — bots that index content. But the new wave of AI agents (OpenAI's Operator, Claude's computer use, Browser Use, Genspark) use full browser automation. They don't read robots.txt. They launch Chrome, navigate to your site, and interact with it like a human.

From our signup flow research, AI agents using Playwright or Puppeteer are indistinguishable from a human at the HTTP level. They use standard Chrome user-agent strings, execute JavaScript, render CSS, and interact with form elements. Your robots.txt is irrelevant to them.

This is why robots.txt is necessary but not sufficient. It handles the crawl layer. For the agent layer, you need authentication, rate limiting, and behavioral analysis — the topics we've covered in our posts on agent-aware logging and rate limiting.

Five things you should do with your robots.txt today

Based on our analysis of 997 SaaS companies, here's our recommendation for each policy type:

If you're in the Silent Majority (no AI rules)

Add explicit rules. Not having a policy isn't a strategy. At minimum:

# Explicitly allow AI search and citation
User-agent: ChatGPT-User
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

# Decide on training crawlers
User-agent: GPTBot
Allow: /           # or Disallow: / — your call

User-agent: ClaudeBot
Allow: /           # or Disallow: /

User-agent: Google-Extended
Allow: /           # or Disallow: /

If you're blocking everything

Reconsider. The data shows a strong correlation between AI blocking and low agent readiness. You may be protecting your content, but you're also becoming invisible to the AI-powered search layer that McKinsey projects will drive $750 billion in consumer spend by 2028.

At minimum, allow user-initiated crawlers (ChatGPT-User, Claude-User, Perplexity-User) even if you block training crawlers.

If you want nuanced control

Use Content-Signal if you're on Cloudflare. If not, split your rules by crawler purpose:

# Allow user-initiated AI access (citations, search)
User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

# Block training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

Add an llms.txt for AI-optimized content

Robots.txt controls access. But you also want to optimize what AI sees. An llms.txt file at your domain root provides a structured summary of your product for AI consumption:

# llms.txt — Machine-readable product summary
> Your Product Name
> Category: Developer Tools
> Pricing: https://yoursite.com/pricing
> API Docs: https://yoursite.com/docs/api
> Signup: https://yoursite.com/signup

From our benchmark data, only 2% of SaaS companies have an llms.txt file. It's a massive differentiator.

Consider the agent layer separately

Robots.txt governs crawlers. For AI agents that use full browsers, you need:

  • Agent-friendly authentication — OAuth with machine-to-machine flows
  • Discoverable APIs — OpenAPI specs, MCP endpoints, agent-readable documentation
  • Adaptive rate limiting — Higher limits for authenticated agents, lower for unknown traffic
  • Structured error responses — JSON errors with actionable hints, not HTML error pages

The trajectory

Cloudflare's data shows AI crawler traffic growing 18% year-over-year. The HTTP Archive recorded GPTBot in the robots.txt files of 560,000+ websites by mid-2025, up from zero in July 2023. The conversation has shifted from "should we have an AI policy" to "what should our AI policy be."

The SaaS companies in our dataset that score highest on agent readiness are the ones that moved past the binary block/allow debate. They're thinking about AI access as a spectrum — training vs. citation vs. search vs. agent commerce — and building policies that match.

Your robots.txt is the first thing any AI system sees when it visits your domain. Make sure it says what you mean.


Want to see how your robots.txt and overall agent readiness stacks up? Run the free AgentGate Benchmark on your domain. Check the Leaderboard to compare against 1,000+ SaaS companies.

Free Tool

How agent-ready is your website?

Run a free scan to see how AI agents experience your signup flow, robots.txt, API docs, and LLM visibility.

Run a free scan →