← Back to Blog
Developer Experience13 min readFebruary 27, 2026

How we built the AgentGate benchmark: scoring methodology and technical decisions

A
Anon Team

Why build a benchmark?

When we started AgentGate, we had a hypothesis: most SaaS websites are not ready for AI agents. We had anecdotal evidence — CAPTCHAs blocking automated signups, APIs with no machine-to-machine auth, documentation that requires JavaScript rendering to read.

But "most websites aren't ready" doesn't tell you anything useful. How many? In what ways? Which categories hurt the most? Is a payment platform more agent-friendly than a CRM? Does having great API docs matter if your signup requires a CAPTCHA?

We needed data. So we built a benchmark that programmatically evaluates how easy it is for an AI agent to discover, sign up for, authenticate with, and use a SaaS product — then assigns a score from 0 to 100.

After scoring 844 domains, here's what the data looks like:

Metric Value
Total domains scored 844
Mean score 53.0
Median score 57.0
Standard deviation 19.1
Highest score 104*
Domains scoring 70+ 15.6%
Domains scoring below 30 16.1%

*Scores above 100 result from an LLM scoring edge case we'll address below.

This post breaks down exactly how the benchmark works — the scoring rubric, the data gathering pipeline, the technical decisions we made, and the lessons we learned from having an LLM evaluate 844 websites.

The scoring rubric: 7 categories + 1 penalty

We needed a rubric that was consistent, interpretable, and actionable. A score should tell you not just "how ready" a site is, but what specifically needs to change. After several iterations, we settled on 7 categories plus a binary penalty:

1. Crawl Access (0–20 points)

Can an AI agent even read your website?

robots.txt allows all bots:               20 pts
Blocks specific AI crawlers only:          10 pts
No robots.txt at all:                      15 pts (permissive by default)
Blocks all bots (Disallow: /):              3 pts

This is the front door. If your robots.txt blocks GPTBot, ClaudeBot, or CCBot, an agent using those crawlers can't index your documentation. We give 15 points for no robots.txt at all — the absence of a policy is more permissive than an explicitly restrictive one.

What the data says: Average score 18.4/20. Most SaaS companies don't block AI crawlers. This category is rarely the bottleneck.

2. Signup Friction (0–20 points)

Can an agent create an account without human intervention?

No CAPTCHA + OAuth/SSO + public signup:    20 pts
No CAPTCHA + public signup, no OAuth:      15 pts
CAPTCHA present but OAuth available:        8 pts
CAPTCHA present, no OAuth:                  4 pts
No public signup (invite/waitlist/sales):   0 pts

This is where most of the variance lives. Average score is 14.3/20, but the median is 20 — meaning a large chunk of sites score full marks while a significant minority get hammered by CAPTCHAs or invite-only flows.

Design decision: We originally treated OAuth as a binary bonus. But there's a subtlety: OAuth for end users (like "Sign in with Google") still requires browser interaction. It helps agents that can puppet a browser, but it's not truly headless. We kept it as a positive signal because it eliminates password creation and reduces friction, even if it's not fully autonomous.

3. API & Developer Docs (0–20 points)

Does the site have comprehensive, accessible API documentation?

Comprehensive API docs + reference + SDK:  20 pts
API docs exist but limited (basic only):   14 pts
Developer content but no API reference:     8 pts
No developer docs found:                    0 pts

Average score: 13.3/20. Most SaaS companies have some documentation, but there's a wide gap between "we have a /docs page" and "we have a comprehensive API reference with examples, SDKs, and quickstart guides."

Technical challenge: SPA detection. Many modern doc sites (built with Docusaurus, GitBook, Mintlify) render entirely client-side. A basic fetch() returns an empty HTML shell with a <div id="root"></div> and a bunch of JavaScript. Our initial scoring under-counted docs for these sites until we added Brave Search enrichment (more on this below).

4. Bot Protection (0–15 points)

How aggressively does the site block automated access?

No bot protection:                         15 pts
Basic (simple rate limiting):              12 pts
Moderate (Cloudflare standard, WAF):        7 pts
Aggressive (Under Attack mode, fingerprinting): 2 pts

Average: 14.0/15. This surprised us — most SaaS companies don't have aggressive bot protection on their marketing sites and documentation. The aggressive blockers tend to be in specific verticals: security products, enterprise platforms, and companies that have been scraped heavily.

Data source: We classify bot protection by comparing the response from a basic fetch() against the response from Bright Data's Web Unlocker (which uses residential proxies and browser-grade rendering). If the basic fetch gets a 403/503 but Bright Data succeeds, we know there's protection in play. If both fail, it's aggressive.

5. Agent-Specific Features (0–10 points)

Does the site explicitly support AI agent access?

Has llms.txt, ai-plugin.json, or agent endpoint: 10 pts
Machine-readable API spec (OpenAPI/Swagger):       6 pts
API mentioned but no machine-readable spec:        2 pts
No agent-specific features:                        0 pts

Average: 4.2/10. This is the biggest growth area. Only 30.2% of scored domains have explicit agent-specific features (like llms.txt or an ai-plugin.json file). The llms.txt standard is gaining traction — we've seen adoption increase week over week — but the majority of sites haven't adopted any agent-specific standards yet.

6. Pricing Transparency (0–8 points)

Can an agent determine what access will cost?

Public pricing page with clear tiers:      8 pts
Pricing behind signup/login:               4 pts
"Contact sales" only:                      0 pts

Average: 6.5/8. Most SaaS companies publish pricing. The ones that don't tend to be enterprise-focused platforms where the sales cycle is inherently human-driven.

7. Onboarding Automation (0–7 points)

How quickly can an agent go from "found this product" to "making API calls"?

Self-serve signup + instant API key:        7 pts
Self-serve signup + manual key approval:    4 pts
Requires demo call or sales contact:        1 pt
No self-serve path:                         0 pts

Average: 3.1/7. Most sites have self-serve signup but require dashboard navigation to find API keys. Instant programmatic credential issuance is rare.

The -25 Penalty: No Autonomous Agent Auth

This is the single most impactful scoring element. It's a binary check:

Does the site offer a fully programmatic path where an AI agent can create an account, authenticate, and receive API credentials — all without any human-in-the-loop steps?

Human-in-the-loop includes: CAPTCHAs, email verification clicks, phone verification, manual dashboard navigation, human approval, or any step requiring a browser session with human interaction.

If the answer is no: -25 points.

98.7% of the 820 uniquely scored domains received this penalty.

This is deliberate. The penalty represents the fundamental gap between "agent-friendly features" and "actually usable by an autonomous agent." A site can score 20/20 on crawl access, have beautiful documentation, and publish pricing transparently — but if an agent can't get credentials without a human clicking through a signup flow, it's not truly agent-ready.

The penalty is what makes our scores feel low. Without it, the mean score would be ~78 instead of 53. We considered removing it to make the numbers look better. We decided the honest number was more useful.

The data gathering pipeline

Each domain goes through a multi-stage data collection process before the LLM sees it:

Stage 1: Direct fetch (raw signals)

For each domain:
  1. GET /robots.txt → full text
  2. GET / → homepage HTML (first 50KB)
  3. GET /docs, /developers, /api, /api-reference, /documentation,
     /api-docs, /developer, /reference, /api/v1, /dev,
     /guides, /quickstart, /getting-started → status codes + redirect URLs
  4. GET first found docs page → HTML (first 10KB)
  5. GET /.well-known/ai-plugin.json, /.well-known/agent-access.json,
     /llms.txt, /llms-full.txt, /.well-known/ai-agent.json → content

We use a standard browser user agent (Chrome/120.0.0) rather than identifying as a bot. This gives us the same view a typical agent framework would see when not using custom user agent strings.

SPA detection: After fetching the homepage, we check for SPA shells — HTML under 2KB with 3+ script tags, or frameworks like Next.js/Nuxt returning minimal server-rendered content. When detected, we flag it so the LLM knows direct HTML analysis is unreliable.

Stage 2: Brave Search enrichment

Direct fetching misses a lot. SPAs return empty shells. Some sites block server-side requests. Documentation lives on subdomains.

For each domain, we run three Brave Search queries:

  • site:{domain} API documentation
  • site:{domain} developer docs
  • {domain} API reference pricing signup

This gives us ground truth for what content actually exists, even when we can't fetch it directly. It's particularly valuable for SPA-heavy sites where the rendered content is invisible to a basic HTTP request.

Stage 3: Bright Data Web Unlocker

For sites where basic fetching fails or returns suspiciously little content, we use Bright Data's Web Unlocker — a residential proxy service that renders pages in a real browser environment, solves CAPTCHAs, and returns the fully rendered page.

This gives us three critical signals:

  1. Bot protection classification — by comparing basic fetch results to Bright Data results, we can classify protection as none/basic/moderate/aggressive
  2. Rendered content — the actual page as a human would see it, converted to markdown
  3. Signup flow analysis — whether the signup page has email signup, OAuth, CAPTCHAs, phone verification, or invite-only gating

We cap Bright Data at 3 concurrent requests to control costs — each request costs roughly $0.01-0.05 depending on the site's complexity.

Get Started

Ready to make your product agent-accessible?

Add a few lines of code and let AI agents discover, request access, and get real credentials — with human oversight built in.

Get started with Anon →

Stage 4: LLM scoring (Claude Sonnet)

All the raw data from stages 1-3 gets summarized and sent to Claude Sonnet for analysis. The LLM receives:

  • Full robots.txt text
  • Summarized homepage signals (title, meta, auth signals, dev signals, nav links, headings)
  • Documentation path results
  • AI-specific file contents
  • Brave Search enrichment
  • Bright Data signals (bot protection level, rendered content, signup flow details)

Why LLM-as-judge? We tried pure heuristic scoring first. It was fast and cheap but produced wildly inconsistent results. The heuristic couldn't distinguish between a site that mentions "API" in a marketing context ("our API-first platform") and one that has a comprehensive API reference. It couldn't evaluate documentation quality — only presence.

The LLM reads the actual page content and evaluates quality, not just presence. It can tell the difference between "Stripe-quality docs with code examples in 6 languages" and "a single /docs page with three paragraphs."

Why Sonnet, not Opus? Cost. Each domain evaluation costs approximately $0.02-0.05 in API tokens. At 844 domains, that's $17-42 per full batch run with Sonnet. Opus would be 5-10x more expensive. We tested both — Sonnet's scoring correlated >0.95 with Opus scores, so the accuracy tradeoff was minimal.

Temperature 0: We use temperature 0 for deterministic scoring. Given the same raw data, the same domain should produce the same score. In practice, there's still some variance across runs (~±3 points), but it's significantly better than temperature >0.

Keeping the LLM honest

LLMs are good at analysis but need guardrails. Here's how we constrain the scoring:

Structured rubric enforcement

The system prompt includes the exact point values for each category. The LLM must return a scoreBreakdown object with values for all 7 categories plus the penalty. The score field must equal the sum. This makes scoring auditable — if a domain gets 68, we can see exactly why: 20 (crawl) + 8 (signup) + 20 (docs) + 15 (bot) + 10 (agent) + 8 (pricing) + 4 (onboarding) - 25 (penalty) = 60. Wait, that's 60, not 68? Then we know the LLM made an arithmetic error and can flag it.

HTML summarization

We don't send raw HTML to the LLM. Instead, we extract structured signals:

  • Title and meta description
  • Authentication signals (CAPTCHA markers, OAuth elements, signup forms)
  • Developer signals (API, SDK, GraphQL, REST, webhooks, OpenAPI mentions)
  • Key navigation links (docs, API, pricing, signup paths)
  • Top headings for content context
  • Script count and HTML size

This reduces token usage by ~80% while preserving the signals the LLM needs for scoring. A typical homepage HTML is 50KB; the summary is 500-2000 characters.

Competitor context

Each benchmark run evaluates the target domain alongside 3-6 competitors. We find competitors via Brave Search ({domain} competitors, {domain} alternatives) and maintain a local map of known competitive sets for popular categories.

This gives the LLM comparative context — it can calibrate scores relative to peers. A payment platform with decent docs but no OAuth scores differently when compared to Stripe (which has exceptional docs) versus when compared in isolation.

Edge cases we've hit

Scores above 100: Our rubric maxes at 100 (75 base + 25 by avoiding the penalty), but we've seen LLM scores of 102 and 104. This happens when the LLM gives full marks in categories where our rubric is ambiguous. CloudZero and PostHog both scored above 100 in some runs — they genuinely max out every category. We clamp to 100 on display.

Review site contamination: When running competitor discovery via Brave Search, we sometimes get review sites (G2, Capterra, TrustRadius) instead of actual competitors. These domains aren't SaaS products and score poorly. We maintain an exclusion list of ~30 non-product domains to filter these out.

Subdomain docs: Companies like Stripe (docs.stripe.com) and Twilio (www.twilio.com/docs) host docs on subdomains or subpaths. Our doc path checking hits /docs on the main domain, which may redirect. We follow redirects and record the final URL, but we don't independently crawl subdomains — this occasionally undervalues companies with extensive docs on separate subdomains.

Score distribution: what 844 domains look like

Here's the full distribution:

Score Range Count Percentage
0–9 4 0.5%
10–19 52 6.2%
20–29 79 9.4%
30–39 67 7.9%
40–49 137 16.2%
50–59 154 18.2%
60–69 220 26.1%
70–79 96 11.4%
80–89 30 3.6%
90–99 3 0.4%
100+ 2 0.2%

The distribution is roughly normal with a slight left skew, centered around 53-57. The penalty drags everything down — without it, the peak would shift to the 70s.

Where the points are lost

Average category scores reveal the patterns:

Category Max Points Average % of Max
Crawl Access 20 18.4 92%
Bot Protection 15 14.0 93%
Pricing Transparency 8 6.5 81%
API & Developer Docs 20 13.3 67%
Signup Friction 20 14.3 72%
Agent Features 10 4.2 42%
Onboarding Automation 7 3.1 44%
No Agent Auth Penalty 0 (best) -24.7

The top three categories (crawl access, bot protection, pricing) are near max. Companies aren't actively hostile to agents — they're just not building for them. The gap is in the active measures: agent-specific features (42% of max), onboarding automation (44%), and the universal auth penalty.

Heuristic fallback

The LLM analysis costs money and takes time (~5-10 seconds per domain). For real-time benchmark requests on the website, we can't always wait for Claude. We built a heuristic scorer that runs instantly as a fallback:

// Heuristic crawl access score
let crawlAccess = 15; // default: no robots.txt
if (robots.exists) {
  if (robots.blocksAllBots) crawlAccess = 3;
  else if (robots.blocksAI) crawlAccess = 10;
  else crawlAccess = 20;
}

// Heuristic signup friction
let signupFriction = 10; // default
if (!signup.hasSignupLink) signupFriction = 0;
else if (!signup.hasCaptcha && signup.hasOAuth) signupFriction = 20;
else if (!signup.hasCaptcha) signupFriction = 15;
else if (signup.hasOAuth) signupFriction = 8;
else signupFriction = 4;

The heuristic detects CAPTCHAs via regex (recaptcha|hcaptcha|turnstile|data-sitekey), OAuth via keyword matching, and documentation via path probing. It agrees with the LLM scorer ~80% of the time (within ±10 points). The biggest divergence is in docs quality — the heuristic can tell you docs exist but can't evaluate their quality.

When Bright Data signals are available, the heuristic gets significantly more accurate. Browser-grade signup flow analysis (detecting CAPTCHAs, OAuth buttons, and phone verification on the actual rendered signup page) closes much of the gap with LLM scoring.

Lessons learned

1. The penalty is the most important design decision

Removing the -25 penalty would make our scores higher, our charts prettier, and our pitch to potential customers more flattering ("your score is 78 — but it could be 95!"). We kept it because the penalty represents a real, fundamental problem: there is almost no SaaS product today where an AI agent can go from zero to API access without a human touching something.

This is the market gap AgentGate exists to close.

2. LLM scoring requires a rigid rubric

Our first version of the scoring prompt was: "Score this website from 0-100 on agent readiness." The results were inconsistent — the same site would get 45 in one run and 72 in another. The LLM had no anchor for what "agent readiness" meant quantitatively.

The structured rubric with fixed point values solved this. By decomposing the score into 7 concrete categories with explicit criteria, we reduced run-to-run variance from ±15 to ±3 points. The LLM still exercises judgment (is this API documentation "comprehensive" or "limited"?), but the judgment is bounded within a 6-point range, not a 27-point range.

3. Brave Search is better than fetching for content verification

For verifying what content exists on a website, Brave Search outperforms direct HTTP requests. Direct fetching fails on SPAs, gets blocked by WAFs, and misses content on subdomains. Brave's index reflects the fully rendered, indexed reality of what pages exist and what they contain.

We now treat Brave snippets as ground truth and use direct fetching primarily for robots.txt, AI-specific files, and structured data that search engines don't index.

4. Bot protection matters less than expected

Before building the benchmark, we assumed bot protection would be a major barrier. In practice, only 7% of domains have moderate-or-higher bot protection. The real barriers are more mundane: CAPTCHAs on signup forms, lack of M2M auth flows, and documentation that assumes a human reader.

5. The benchmark is a snapshot, not a verdict

A score of 45 doesn't mean a product is bad. It means that, at the time of measurement, the specific signals we check suggest room for improvement in AI agent accessibility. Companies change their practices, adopt new standards, and improve their developer experience. The benchmark should be re-run periodically.

We're building toward automated re-scoring — running the full batch monthly and tracking score changes over time. The most interesting data will be the trend lines, not the individual snapshots.


Try it yourself

The AgentGate benchmark is live at anon-dev.com/benchmark. Enter any domain and see how it scores across all 7 categories. The leaderboard shows the current top-scoring companies from our batch analysis.

If you're a SaaS company looking to improve your agent-readiness score, the breakdown tells you exactly where to focus. The highest-impact changes:

  1. Remove CAPTCHAs from signup (or add an OAuth option that bypasses them)
  2. Publish an llms.txt at your domain root
  3. Offer OAuth Client Credentials for M2M authentication
  4. Make API keys available programmatically via an API endpoint, not just a dashboard

Every point matters when an agent is deciding which service to integrate with.


Interested in the raw data? We'll be publishing the full dataset of scored domains on the leaderboard. Want to discuss methodology? Reach out — we're always refining the rubric based on how the agent ecosystem evolves.

Free Tool

How agent-ready is your website?

Run a free scan to see how AI agents experience your signup flow, robots.txt, API docs, and LLM visibility.

Run a free scan →