Rate limiting for agents vs humans: why your 429s are killing conversion
The 429 problem
A human user hits your API maybe 10-50 times during a session. They click a button, wait for a response, think about what to do next, click another button. The gaps between requests are measured in seconds.
An AI agent hits your API 500 times in the first minute. It reads your entire API reference, enumerates available resources, runs a discovery scan, then starts making the calls it actually needs. The gaps between requests are measured in milliseconds.
Your rate limiter — designed for humans — sees 500 requests in 60 seconds and does exactly what it was built to do: return 429 Too Many Requests. The agent backs off, retries, gets another 429, backs off again. Within 3 minutes, a legitimate agent that was trying to integrate with your product has given up and moved to a competitor with more permissive limits.
You never see this in your analytics because the agent doesn't file a support ticket. It just leaves.
According to Cloudera's 2025 enterprise survey, 96% of IT leaders plan to expand their use of AI agents in the next 12 months. If your rate limiting is calibrated for human traffic, you're about to lose a lot of potential integrations.
How agent traffic differs from human traffic
Before redesigning rate limits, you need to understand what you're designing for. Agent traffic has five characteristics that break traditional rate limiting assumptions:
1. Burst-then-settle pattern
Human traffic is relatively steady — a few requests per second, sustained over minutes. Agent traffic is bimodal: an intense initial burst (discovery, authentication, schema fetching) followed by a lower, steady-state pattern (actual API usage).
Human traffic pattern:
▂▃▃▂▃▃▂▃▂▃▃▂▃▃▂▃▂▃▃▂ (~3-5 req/sec, steady)
Agent traffic pattern:
█████████▃▂▂▁▁▂▁▂▁▂▁▁ (50 req/sec burst → 2 req/sec steady)
A fixed-window rate limit of 100 requests/minute handles the human fine (180 total) but kills the agent in the first 12 seconds of its burst phase (it blows through 100 requests) — even though the agent's total requests over 10 minutes might be lower than the human's.
2. Parallel request patterns
Humans are serial: click, wait, click, wait. Agents are parallel. A well-built agent pipeline will fire 10-20 concurrent requests to fetch related resources, process them simultaneously, and then issue the next batch.
This means per-second rate limits hit agents harder than per-minute limits. An agent sending 20 parallel requests in 100ms looks like a DDoS attack to a per-second limiter, but it's well within a 1000 req/minute budget.
3. Retry amplification
This is the most insidious pattern. When an agent gets a 429:
- It retries (often immediately — not all agents implement backoff)
- The retry also gets 429'd (it's still in the rate limit window)
- Each retry counts against the limit, making the window last longer
- Multiple agents hitting the limit simultaneously create a retry storm
A single 429 can cascade into dozens of additional requests — the rate limiter makes the problem it was designed to prevent.
4. Discovery-heavy first sessions
The first time an agent interacts with your API, it needs to discover what's available. This means fetching:
- API schema or OpenAPI spec
- Authentication endpoints
- Available resources and their relationships
- Pagination metadata for collection endpoints
- Rate limit policies (if documented)
This initial discovery phase can generate 50-200 requests before the agent makes a single "real" API call. If your rate limit is 100 req/minute globally, the agent never gets past discovery.
5. Multi-agent synchronization
When a popular AI framework releases a new feature or a triggering event occurs (market data change, scheduled task), hundreds of agents may hit your API simultaneously. Unlike human traffic spikes that ramp up gradually, agent traffic spikes are instantaneous — every agent acts on the same signal at the same time.
What your rate limit headers should look like
Before changing your limits, make sure agents can read your limits. The IETF draft for RateLimit header fields defines a standard that agents can parse programmatically.
The minimum viable rate limit response
Every API response should include these headers:
HTTP/1.1 200 OK
Content-Type: application/json
RateLimit-Limit: 1000
RateLimit-Remaining: 847
RateLimit-Reset: 1709078400
RateLimit-Policy: 1000;w=3600, 50;w=1
Let's break these down:
| Header | Meaning | Why agents need it |
|---|---|---|
RateLimit-Limit |
Max requests in the current window | Agent knows its budget |
RateLimit-Remaining |
Requests left in this window | Agent can pace itself |
RateLimit-Reset |
Unix timestamp when the window resets | Agent knows when to resume |
RateLimit-Policy |
Limit structure (1000/hour, 50/second) | Agent can plan request scheduling |
The 429 response
When an agent does hit the limit, give it everything it needs to recover:
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 30
RateLimit-Limit: 1000
RateLimit-Remaining: 0
RateLimit-Reset: 1709078400
RateLimit-Policy: 1000;w=3600, 50;w=1
{
"error": {
"type": "rate_limit_exceeded",
"message": "Rate limit exceeded. 1000 requests per hour allowed.",
"retry_after": 30,
"limit": 1000,
"window": "1h",
"reset_at": "2026-02-27T12:00:00Z",
"docs_url": "https://docs.yourapp.com/rate-limits"
}
}
The Retry-After header is critical. Without it, agents guess — and they usually guess wrong (either too aggressive or too conservative). GitHub's API is the gold standard here:
x-ratelimit-limit: 5000
x-ratelimit-remaining: 4999
x-ratelimit-reset: 1709078400
x-ratelimit-resource: core
x-ratelimit-used: 1
GitHub returns rate limit headers on every response, not just 429s. The x-ratelimit-resource field is particularly useful — it tells the agent which rate limit bucket the request counted against, so it can manage different quotas independently.
Three rate limiting architectures, compared
1. Fixed window (what most APIs use today)
Window: 1 minute
Limit: 100 requests
Counter resets at: start of each minute
[ 0:00 ─────────────── 1:00 ][ 1:00 ─────────────── 2:00 ]
Human: ▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃ ▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃ ✅ (60/min)
Agent: █████████░░░░░░░░ ██████░░░░░░░░░░░ ❌ (100 in 20s)
Problem for agents: An agent hitting 100 requests in the first 20 seconds gets blocked for 40 seconds, even though its total load over the full minute would've been under 100. Worse: the boundary problem. If an agent sends 90 requests in the last 10 seconds of one window and 90 in the first 10 seconds of the next, it effectively sends 180 requests in 20 seconds — double the intended limit — because counters reset at the boundary.
Implementation (Redis):
async function fixedWindowCheck(clientId, limit, windowSec) {
const window = Math.floor(Date.now() / 1000 / windowSec);
const key = `ratelimit:${clientId}:${window}`;
const current = await redis.incr(key);
if (current === 1) {
await redis.expire(key, windowSec);
}
return {
allowed: current <= limit,
remaining: Math.max(0, limit - current),
resetAt: (window + 1) * windowSec,
};
}
Verdict: Simple to implement. Bad for agents. Use only for the most basic APIs where agent traffic is minimal.
2. Token bucket (better for burst tolerance)
Bucket capacity: 100 tokens
Refill rate: 10 tokens/second
Each request costs 1 token
[ 0:00 ─────────────────────────── 10:00 ]
Agent burst: ██████████ (100 tokens spent in 2s)
Bucket refills: ░░░░░░░░░░░░░░░░░░ (10/sec)
Agent resumes: ▃▃▃▃▃▃▃▃▃▃▃▃▃▃ (steady 8/sec) ✅
The token bucket allows bursts up to the bucket capacity, then throttles to the refill rate. This matches the agent's burst-then-settle pattern perfectly.
Implementation (Redis + Lua for atomicity):
-- Token bucket rate limiter (Lua script for Redis)
local key = KEYS[1]
local capacity = tonumber(ARGV[1]) -- 100
local refill_rate = tonumber(ARGV[2]) -- 10 tokens/sec
local now = tonumber(ARGV[3]) -- current timestamp (ms)
local requested = tonumber(ARGV[4]) -- tokens needed (usually 1)
-- Get current bucket state
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now
-- Calculate refilled tokens since last request
local elapsed = (now - last_refill) / 1000 -- convert ms to seconds
local refilled = math.floor(elapsed * refill_rate)
tokens = math.min(capacity, tokens + refilled)
-- Check if request is allowed
local allowed = tokens >= requested
if allowed then
tokens = tokens - requested
end
-- Update bucket state
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) + 1)
return {allowed and 1 or 0, tokens, math.ceil((requested - tokens) / refill_rate * 1000)}
Node.js wrapper:
const tokenBucketScript = fs.readFileSync('./token-bucket.lua', 'utf8');
async function tokenBucketCheck(clientId, capacity, refillRate) {
const [allowed, remaining, retryAfterMs] = await redis.eval(
tokenBucketScript,
1, // number of keys
`ratelimit:${clientId}`, // KEYS[1]
capacity, // ARGV[1]
refillRate, // ARGV[2]
Date.now(), // ARGV[3]
1 // ARGV[4] - tokens per request
);
return {
allowed: allowed === 1,
remaining,
retryAfterMs: allowed ? 0 : retryAfterMs,
};
}
Verdict: Great for agent traffic. Tolerates the initial burst, then enforces a sustainable rate. The capacity parameter directly controls how much burst you'll accept.
3. Sliding window with agent-aware tiers (recommended)
The ideal approach combines a sliding window counter (no boundary problems) with per-client tier configuration that recognizes agents as a distinct traffic class:
const TIERS = {
// Human-oriented: browser sessions, interactive use
free: {
requestsPerMinute: 60,
requestsPerHour: 1000,
burstCapacity: 20, // max concurrent
retryAfterSeconds: 60,
},
// Standard agent tier: registered API clients
agent_standard: {
requestsPerMinute: 300,
requestsPerHour: 10000,
burstCapacity: 100, // agents burst
retryAfterSeconds: 10,
},
// Premium agent tier: paid integrations
agent_premium: {
requestsPerMinute: 1000,
requestsPerHour: 50000,
burstCapacity: 500,
retryAfterSeconds: 5,
},
};
Sliding window implementation:
async function slidingWindowCheck(clientId, tier) {
const config = TIERS[tier];
const now = Date.now();
const windowMs = 60_000; // 1 minute
const windowStart = now - windowMs;
const key = `ratelimit:sliding:${clientId}`;
// Atomic operation: remove old entries, add new, count
const pipeline = redis.pipeline();
pipeline.zremrangebyscore(key, 0, windowStart); // Remove expired
pipeline.zadd(key, now, `${now}:${Math.random()}`); // Add current
pipeline.zcard(key); // Count in window
pipeline.expire(key, 120); // TTL safety
const results = await pipeline.exec();
const requestCount = results[2][1];
const allowed = requestCount <= config.requestsPerMinute;
if (!allowed) {
// Remove the request we just added
await redis.zremrangebyscore(key, now, now);
}
return {
allowed,
limit: config.requestsPerMinute,
remaining: Math.max(0, config.requestsPerMinute - requestCount),
resetAt: Math.ceil((windowStart + windowMs) / 1000),
retryAfter: allowed ? 0 : config.retryAfterSeconds,
tier,
};
}
Verdict: Most accurate rate limiting. Per-agent tiers let you offer higher limits to registered agent clients without changing anything for human users. The sliding window eliminates boundary exploits.
Free Tool
How agent-ready is your website?
Run a free scan to see how AI agents experience your signup flow, robots.txt, API docs, and LLM visibility.
Run a free scan →Designing agent-aware rate limit tiers
The key insight: agents and humans should not share the same rate limit pool. Here's a practical tier design:
Tier 1: Anonymous / unidentified traffic
# No API key, no auth — could be anyone
requests_per_minute: 30
requests_per_hour: 500
burst: 10
applies_to: requests without API key or token
This is your DDoS protection tier. Low limits, aggressive throttling. Agents that haven't authenticated yet land here during their discovery phase.
Tier 2: Authenticated human users
# Logged-in users making API calls from your dashboard/app
requests_per_minute: 100
requests_per_hour: 3000
burst: 30
applies_to: requests with session cookie or user access token
Standard human limits. Most SaaS APIs already have something like this.
Tier 3: Registered agent clients
# OAuth client credentials, API keys marked as agent/service
requests_per_minute: 500
requests_per_hour: 20000
burst: 200
initial_burst_bonus: 500 # Extra burst for first 5 minutes
applies_to: requests with client_credentials token or agent-flagged API key
This is the critical tier. Registered agent clients get 5x the human limit, plus an initial burst bonus for the discovery phase. The initial_burst_bonus gives agents 500 extra requests in their first 5 minutes — enough to fetch your OpenAPI spec, enumerate resources, and start working.
Tier 4: Premium / enterprise agent clients
requests_per_minute: 2000
requests_per_hour: 100000
burst: 1000
dedicated_pool: true # Doesn't share capacity with other tiers
applies_to: enterprise contracts, high-volume integrations
Enterprise agents get dedicated capacity that doesn't compete with other traffic. This is table stakes for any SaaS company selling to enterprises that use AI agents for automation.
Implementation: detecting agent traffic
How do you know which requests come from agents? Several signals:
function classifyClient(req) {
// 1. OAuth client credentials = definitely an agent
if (req.auth?.grantType === 'client_credentials') {
return req.auth.tier || 'agent_standard';
}
// 2. API key with agent flag
if (req.apiKey?.type === 'agent' || req.apiKey?.type === 'service') {
return 'agent_standard';
}
// 3. User-Agent heuristics (fallback)
const ua = req.headers['user-agent'] || '';
const agentPatterns = [
/^python-requests/i,
/^axios/i,
/^node-fetch/i,
/^Go-http-client/i,
/langchain/i,
/openai-agent/i,
/anthropic-sdk/i,
/^curl/i,
];
if (agentPatterns.some(p => p.test(ua))) {
return 'agent_standard'; // Treat SDK traffic as agent
}
// 4. Behavioral signals
if (req.session?.requestsInLastMinute > 50) {
return 'agent_standard'; // Upgraded mid-session based on behavior
}
return 'free'; // Default to human tier
}
This isn't about blocking agents — it's about serving them better. Agents classified into the agent_standard tier get higher limits than the default free tier.
The retry-after contract
The Retry-After header is a contract between your API and the agent. When you return it, you're saying: "If you wait this long, I guarantee your next request will succeed."
Most APIs break this contract. They return Retry-After: 60 but the rate limit resets in 45 seconds, or worse — the agent retries after 60 seconds and gets another 429 because the window calculation doesn't align.
Implementing honest Retry-After
function buildRateLimitResponse(req, res, rateLimitResult) {
const { limit, remaining, resetAt, retryAfter, tier } = rateLimitResult;
// Always include rate limit headers — even on 200 responses
res.set({
'RateLimit-Limit': limit,
'RateLimit-Remaining': remaining,
'RateLimit-Reset': resetAt,
'RateLimit-Policy': `${limit};w=60`,
'X-RateLimit-Tier': tier,
});
if (!rateLimitResult.allowed) {
// Calculate EXACT seconds until the agent can retry
const exactRetryAfter = Math.max(1, resetAt - Math.floor(Date.now() / 1000));
res.set('Retry-After', exactRetryAfter);
return res.status(429).json({
error: {
type: 'rate_limit_exceeded',
message: `Rate limit exceeded for tier "${tier}". ` +
`${limit} requests per minute allowed.`,
retry_after: exactRetryAfter,
limit,
remaining: 0,
reset_at: new Date(resetAt * 1000).toISOString(),
tier,
upgrade_url: tier === 'free'
? 'https://yourapp.com/pricing#agent-tier'
: undefined,
},
});
}
}
Note the upgrade_url in the error response. When a free-tier client gets rate limited, the response tells the agent (or the developer building the agent) exactly where to go to get higher limits. This turns a 429 from a dead end into a conversion opportunity.
Adaptive rate limiting: letting the system breathe
Static rate limits — even well-designed ones — can't handle all scenarios. What happens when your API is running hot and even the allowed traffic is causing latency? Or when it's 3 AM and your servers are idle — why not let agents burst higher?
Adaptive rate limiting adjusts limits based on real-time system health:
class AdaptiveRateLimiter {
constructor(baseConfig) {
this.baseConfig = baseConfig;
this.healthMultiplier = 1.0;
// Monitor system health every 10 seconds
setInterval(() => this.updateHealth(), 10_000);
}
async updateHealth() {
const metrics = await this.getSystemMetrics();
// Scale limits based on system load
if (metrics.p99Latency > 2000 || metrics.errorRate > 0.05) {
// System stressed: tighten limits
this.healthMultiplier = 0.5;
} else if (metrics.p99Latency > 1000 || metrics.errorRate > 0.02) {
// System warm: slight reduction
this.healthMultiplier = 0.75;
} else if (metrics.cpuUtilization < 0.3) {
// System idle: allow more traffic
this.healthMultiplier = 1.5;
} else {
// Normal operation
this.healthMultiplier = 1.0;
}
}
getEffectiveLimit(tier) {
const base = this.baseConfig[tier].requestsPerMinute;
return Math.floor(base * this.healthMultiplier);
}
async check(clientId, tier) {
const effectiveLimit = this.getEffectiveLimit(tier);
return slidingWindowCheck(clientId, {
...this.baseConfig[tier],
requestsPerMinute: effectiveLimit,
});
}
}
The healthMultiplier scales all rate limits based on system conditions:
- System stressed (high latency or errors): Cut limits by 50% to protect the service
- System warm: Reduce by 25% as a precaution
- System idle: Boost limits by 50% — let agents use available capacity
- Normal: Apply base limits
This means an agent hitting your API at 3 AM might get 750 req/minute instead of 500, while the same agent during a traffic spike might get 250. Both are fair — and both are better than a static limit that's either too low during quiet times or too high during load.
What this looks like in practice
Here's how the complete middleware fits together:
const rateLimiter = new AdaptiveRateLimiter(TIERS);
app.use(async (req, res, next) => {
const clientId = req.auth?.clientId || req.ip;
const tier = classifyClient(req);
const result = await rateLimiter.check(clientId, tier);
// Always set headers (even on success)
buildRateLimitHeaders(res, result);
if (!result.allowed) {
return buildRateLimitResponse(req, res, result);
}
next();
});
And the agent-side handling:
import httpx
import asyncio
from typing import Optional
class AgentAPIClient:
def __init__(self, base_url: str, api_key: str):
self.base_url = base_url
self.client = httpx.AsyncClient(
headers={"Authorization": f"Bearer {api_key}"}
)
self.rate_limit_remaining: Optional[int] = None
self.rate_limit_reset: Optional[float] = None
async def request(self, method: str, path: str, **kwargs):
# Pre-check: if we know we're out of budget, wait
if self.rate_limit_remaining == 0 and self.rate_limit_reset:
wait = self.rate_limit_reset - asyncio.get_event_loop().time()
if wait > 0:
await asyncio.sleep(wait)
for attempt in range(5):
response = await self.client.request(
method, f"{self.base_url}{path}", **kwargs
)
# Update rate limit state from headers
self.rate_limit_remaining = int(
response.headers.get("RateLimit-Remaining", 999)
)
reset = response.headers.get("RateLimit-Reset")
if reset:
self.rate_limit_reset = float(reset)
if response.status_code != 429:
return response
# 429: use Retry-After if available, else exponential backoff
retry_after = response.headers.get("Retry-After")
if retry_after:
await asyncio.sleep(float(retry_after))
else:
await asyncio.sleep(2 ** attempt) # 1, 2, 4, 8, 16s
raise Exception(f"Rate limited after 5 retries on {path}")
The agent reads RateLimit-Remaining on every response and proactively waits when it knows it's about to hit the limit. When it does get a 429, it respects the Retry-After header exactly. No guessing, no retry storms.
The conversion math
Let's put real numbers to this. Assume:
- 100 AI agent integrations attempt your API per month
- 30% are blocked by rate limits during the discovery phase (common with default configs)
- Each successful agent integration generates $200/month in API usage
- Average agent lifetime: 8 months
With default rate limits (calibrated for humans):
- 70 agents succeed → $200 × 70 × 8 = $112,000 lifetime revenue
With agent-aware rate limits:
- 95 agents succeed → $200 × 95 × 8 = $152,000 lifetime revenue
The difference: $40,000 in recovered revenue from changing a configuration file. No new features. No new code. Just acknowledging that your fastest-growing user segment needs different limits than a human clicking around a dashboard.
The takeaway
Rate limiting exists to protect your infrastructure. But protection that blocks legitimate traffic isn't protection — it's a conversion killer.
The fix is straightforward:
- Add rate limit headers to every response (not just 429s)
- Use token bucket or sliding window instead of fixed window
- Create separate agent tiers with higher burst capacity and limits
- Return honest
Retry-Aftervalues so agents can schedule retries precisely - Consider adaptive limits that scale with system health
- Include an
upgrade_urlin 429 responses to convert rate-limited agents into paying customers
Your rate limiter is the first thing every AI agent interacts with on your platform. Make it an onramp, not a wall.
Want to see how your API's rate limiting affects your agent-readiness score? Run your domain through the AgentGate benchmark.
Get Started
Ready to make your product agent-accessible?
Add a few lines of code and let AI agents discover, request access, and get real credentials — with human oversight built in.
Get started with Anon →