Loading...
Loading...
Weekly AI insights —
Real strategies, no fluff. Unsubscribe anytime.
Written by Gareth Simono, Founder and CEO of Agentik {OS}. Full-stack developer and AI architect with years of experience shipping production applications across SaaS, mobile, and enterprise platforms. Gareth orchestrates 267 specialized AI agents to deliver production software 10x faster than traditional development teams.
Founder & CEO, Agentik {OS}
Someone will abuse your AI API. Without rate limiting, you face a financial emergency. Token bucket, per-user quotas, and circuit breakers prevent this.

Someone will abuse your AI API. Not if. When.
Maybe it is a developer who accidentally puts an API call inside an infinite loop. Maybe it is a bot scraping your service by submitting thousands of requests per minute. Maybe it is a legitimate power user who has no idea their workflow generates 50x the normal request volume. Maybe it is you, testing at 2am, forgetting you left a script running.
Without rate limiting, any of these scenarios turns into a financial emergency. A single user can generate thousands of dollars in AI provider costs before anyone notices. I have watched it happen. Three times in the last year alone. Each time, the team said the same thing afterward: "We knew we needed rate limiting. We just hadn't built it yet."
The cost profile of AI APIs is fundamentally different from traditional APIs. A traditional API request costs fractions of a cent. An AI API request can cost multiple cents or even dollars for complex queries with large context windows. That difference changes everything about how you design protection systems.
This is not optional infrastructure. This is the foundation you build on day one, before you write a single feature.
Most rate limiting tutorials cover the basics: requests per minute, responses per user, standard sliding windows. That knowledge transfers to AI APIs, but it is not sufficient.
The problem is the cost multiplier. When I send an HTTP request to a traditional REST API, the server runs a database query that costs milliseconds and fractions of a cent. When I send a request to an AI API, the server runs inference that costs compute time and real money. The cost scales with the request, not just the count.
A user sending ten requests with 50-token contexts costs almost nothing. A user sending ten requests with 100,000-token contexts might cost you $20. Same number of requests. Completely different financial exposure.
This means you need to rate limit on multiple dimensions simultaneously:
Request count keeps burst attacks manageable. Token consumption keeps the cost bounded. Concurrent connections prevents thread exhaustion. Time-windowed spending catches abuse that stays under request limits by using expensive prompts.
Ignore any of these dimensions and you have a gap. Gaps get exploited. Sometimes accidentally, sometimes deliberately.
The first financial emergency I witnessed was a developer testing an integration. They accidentally sent the entire contents of a 500-page PDF as context on every request, in a loop. Thirty minutes and $400 later, someone noticed. Rate limiting would have capped the damage at $2.
Most rate limiting articles explain five different algorithms and leave you confused about which to pick. Leaky bucket, fixed window, sliding window, token bucket, concurrency limits. They all have their place.
For AI applications, start with token bucket. It is almost always the right choice.
Here is how it works. Each user has a virtual bucket. The bucket fills with tokens at a constant rate. Each API request consumes tokens from the bucket. When the bucket is empty, requests get queued or rejected until more tokens accumulate.
Why this works perfectly for AI applications: it allows burst usage while enforcing average rates. A user can send ten requests in rapid succession to populate a dashboard. Then they sit idle for a few minutes. The bucket refills. They burst again. This matches how humans actually use AI tools. Nobody sends requests at a perfectly steady rate.
The two parameters you configure per tier are fill rate and bucket size. Fill rate determines the sustained request rate. Bucket size determines the maximum burst. Get these numbers from your usage analytics, not from guessing.
import Redis from 'ioredis';
const redis = new Redis(process.env.REDIS_URL!);
type Tier = 'free' | 'pro' | 'enterprise';
interface RateLimitConfig {
fillRate: number; // requests per minute
bucketSize: number; // max burst
}
const TIER_CONFIG: Record<Tier, RateLimitConfig> = {
free: { fillRate: 10, bucketSize: 20 },
pro: { fillRate: 60, bucketSize: 100 },
enterprise: { fillRate: 300, bucketSize: 500 },
};
export async function checkRateLimit(
userId: string,
tier: Tier
): Promise<{ allowed: boolean; remaining: number; retryAfter?: number }> {
const config = TIER_CONFIG[tier];
const key = `rl:${userId}`;
const now = Date.now();
// Atomic read of current state
const [rawTokens, rawLastRefill] = await redis.hmget(key, 'tokens', 'lastRefill');
const currentTokens = Number(rawTokens ?? config.bucketSize);
const lastRefillTime = Number(rawLastRefill ?? now);
// Calculate tokens accumulated since last check
const elapsedMinutes = (now - lastRefillTime) / 1000 / 60;
const accruedTokens = elapsedMinutes * config.fillRate;
const newTokens = Math.min(config.bucketSize, currentTokens + accruedTokens);
if (newTokens < 1) {
const secondsToWait = Math.ceil((1 - newTokens) / config.fillRate * 60);
return { allowed: false, remaining: 0, retryAfter: secondsToWait };
}
// Consume one token
await redis
.multi()
.hmset(key, { tokens: String(newTokens - 1), lastRefill: String(now) })
.expire(key, 3600)
.exec();
return { allowed: true, remaining: Math.floor(newTokens - 1) };
}Redis makes this trivially scalable. The key insight is using hmset to atomically update both token count and timestamp. Without atomicity, concurrent requests create race conditions where two requests both think the bucket has tokens.
For production, wrap this in a Lua script for true atomic operations. The pattern above is correct for most load levels but will have occasional over-allowance under extreme concurrency.
Token bucket handles request rate. Quotas handle total usage over time.
This distinction matters enormously for AI applications. A user might stay well within your per-minute rate limits while consuming massive amounts of tokens per request. Ten requests per minute, each with a 100,000-token context window, adds up fast. Rate limits catch burst attacks. Quotas catch expensive-but-polite usage.
Track both request counts and token consumption. Display both to users in their dashboard. People who can see their usage self-manage. People who cannot see their usage blast through limits and then complain about being cut off without warning.
The structure I recommend:
interface UsageRecord {
userId: string;
requestCount: number;
tokenCount: number; // total tokens (input + output)
costEstimate: number; // in USD cents
windowStart: Date; // when this window started
windowType: 'daily' | 'monthly';
}
const QUOTA_LIMITS = {
free: {
daily: { requests: 50, tokens: 100_000, costCents: 20 },
monthly: { requests: 500, tokens: 1_000_000, costCents: 150 },
},
pro: {
daily: { requests: 500, tokens: 2_000_000, costCents: 400 },
monthly: { requests: 10000, tokens: 40_000_000, costCents: 6000 },
},
};Set daily and monthly quotas per tier. Daily quotas prevent a single bad day from exhausting a monthly budget. Monthly quotas provide the overall cost ceiling.
When a user hits 80% of their quota, send a notification. Not an error. A heads-up. When they hit 100%, downgrade to a restricted mode rather than cutting them off entirely. Completely blocking a user creates support tickets and churn. Instead, reduce their context window size, throttle response length, or queue their requests with lower priority.
The goal is graceful degradation, not hard cutoffs. A user who experiences slower responses stays. A user who gets unexplained errors cancels and tweets about it.
Implement a cost estimation step before processing. Before calling the AI provider, estimate the token count and display the approximate cost. Users who see "this request will consume approximately 50,000 tokens" make different decisions than users operating with no visibility.
import Anthropic from '@anthropic-ai/sdk';
// Rough token estimation before making the actual call
function estimateTokenCost(prompt: string, maxOutputTokens: number): {
estimatedInputTokens: number;
estimatedOutputTokens: number;
estimatedCostCents: number;
} {
// Claude tokenization is roughly 4 chars per token
const estimatedInputTokens = Math.ceil(prompt.length / 4);
const estimatedOutputTokens = maxOutputTokens;
// claude-sonnet-4-6 pricing: $3/MTok input, $15/MTok output
const inputCost = (estimatedInputTokens / 1_000_000) * 300; // cents
const outputCost = (estimatedOutputTokens / 1_000_000) * 1500; // cents
return {
estimatedInputTokens,
estimatedOutputTokens,
estimatedCostCents: Math.ceil(inputCost + outputCost),
};
}Rate limiting and quotas handle normal abuse scenarios. Circuit breakers handle the catastrophic ones.
A circuit breaker monitors system health and automatically stops processing when something is clearly wrong. In the context of AI API costs, the trigger is abnormal spending velocity.
Here is the pattern:
interface CircuitBreakerState {
status: 'closed' | 'open' | 'half-open';
openedAt?: number;
failureCount: number;
spendingRate: number; // USD per hour
}
async function checkCircuitBreaker(): Promise<boolean> {
const state = await redis.get('circuit_breaker:ai_spending');
const parsed: CircuitBreakerState = state
? JSON.parse(state)
: { status: 'closed', failureCount: 0, spendingRate: 0 };
if (parsed.status === 'open') {
// Check if enough time has passed to try half-open
const openDuration = Date.now() - (parsed.openedAt || 0);
if (openDuration > 5 * 60 * 1000) { // 5 minutes
parsed.status = 'half-open';
await redis.set('circuit_breaker:ai_spending', JSON.stringify(parsed));
return true; // Allow one test request
}
return false; // Circuit is open, block all requests
}
return true; // Circuit closed or half-open, allow request
}
async function recordSpending(costCents: number): Promise<void> {
const hourlySpending = await getHourlySpendingRate();
// If spending rate exceeds 3x normal, open the circuit
const normalHourlyRate = await getNormalHourlyRate(); // historical average
if (hourlySpending > normalHourlyRate * 3) {
const state: CircuitBreakerState = {
status: 'open',
openedAt: Date.now(),
failureCount: 0,
spendingRate: hourlySpending,
};
await redis.set('circuit_breaker:ai_spending', JSON.stringify(state));
await alertEngineering(`Circuit breaker opened. Spending rate: $${hourlySpending}/hour`);
}
}The circuit breaker trips when your AI spending rate exceeds 3x the normal rate for more than ten minutes. It automatically halts non-critical AI processing, keeps essential features running, and alerts engineering.
This is your emergency stop. Rate limits are the normal safeguard. The circuit breaker is what saves you when someone finds a bug in your rate limiting code.
Sophisticated abusers learn your rate limits. They stay just under the threshold while still extracting value. Anomaly detection catches behavior that is technically within limits but clearly abnormal.
Patterns worth monitoring:
Sudden usage spikes. A user who averaged 50 requests per day for two months suddenly sending 5,000 is suspicious. Compromised API key, bot takeover, or an accidentally deployed loop.
Off-hours activity. A business account sending thousands of requests at 3am probably has an automated process running, which may or may not be intentional.
Unusual prompt patterns. Requests that are all nearly identical (scraping), or requests that are structured to maximize output length (cost farming), or prompts that look like injection attempts.
Cost per request outliers. If 99% of requests cost under 10 cents and one request costs $5, that is worth investigating.
interface UsageAnomaly {
userId: string;
type: 'spike' | 'off-hours' | 'high-cost' | 'pattern-match';
severity: 'low' | 'medium' | 'high';
details: string;
detectedAt: Date;
}
async function detectAnomalies(userId: string, request: AIRequest): Promise<UsageAnomaly[]> {
const anomalies: UsageAnomaly[] = [];
const history = await getUserHistory(userId, 7); // 7 days of history
// Check for usage spike
const todayCount = await getTodayRequestCount(userId);
const avgDailyCount = history.avgDailyRequests;
if (todayCount > avgDailyCount * 10) {
anomalies.push({
userId,
type: 'spike',
severity: 'high',
details: `${todayCount} requests today vs ${avgDailyCount} average`,
detectedAt: new Date(),
});
}
// Check for high-cost individual request
const estimatedCost = estimateTokenCost(request.prompt, request.maxTokens);
if (estimatedCost.estimatedCostCents > 500) { // $5 per single request
anomalies.push({
userId,
type: 'high-cost',
severity: 'medium',
details: `Estimated cost: $${estimatedCost.estimatedCostCents / 100}`,
detectedAt: new Date(),
});
}
return anomalies;
}Do not automatically block on anomalies. Automatic blocks cause false positives and generate support tickets. Instead, flag for review, throttle automatically, and notify the user that their account is under review. Most anomalies are accidents.
All of the above assumes your rate limiting code works perfectly. It will not. Every system has bugs. Every developer has tired late-night coding sessions.
Set hard spending limits per API key at the provider level. This is your last line of defense. If everything else fails, the hard limit prevents catastrophic bills.
Set it at 2x your expected maximum monthly cost. Check it monthly and adjust upward as you grow. Every major AI provider offers this:
Do this right now. Before you finish reading this article. It takes five minutes and will save you at some point.
Also set up email alerts at 50%, 75%, and 90% of your monthly budget. You want to know about abnormal spending with enough time to investigate before hitting the cap.
If you are starting from scratch, here is the sequence that maximizes protection per hour of engineering time:
Day one: Hard spending limits at the provider level. Emergency brake. Non-negotiable.
Week one: Token bucket rate limiting per user. Redis-backed. Prevents accidental abuse and runaway loops. This alone prevents 90% of financial incidents.
Week two: Per-user quotas with usage dashboards. Daily and monthly limits. Usage visibility for users. This handles the expensive-but-slow abuse cases.
Month one: Anomaly detection and circuit breakers. Cost estimation per request. Spending velocity alerts.
Month two: Tiered limits based on user plans. Graduated degradation instead of hard cutoffs. Detailed analytics on usage patterns. Fine-tuning based on real data.
I have seen teams skip this sequence and build everything at once. They spend three months building sophisticated anomaly detection while never setting provider-level spending limits. Those teams are the ones with financial emergencies.
Start with the emergency brake. Add sophistication incrementally.
The hard spending limit at the provider level has personally saved me from at least three incidents where a bug or a bot would have generated thousands in unexpected costs. Set it first. Everything else is optimization.
Rate limiting is only as good as the feedback you give users when they hit it. Bad feedback creates support tickets. Good feedback creates self-managing users.
Return proper HTTP status codes: 429 for rate limit exceeded, with Retry-After headers. Include in the response body: current usage, limit, when it resets, and for quota exhaustion, what they need to upgrade to.
// What a good rate limit response looks like
const rateLimitResponse = {
error: 'rate_limit_exceeded',
message: 'You have exceeded your request rate limit.',
details: {
limit: 60,
remaining: 0,
resetAt: '2026-02-15T14:32:00Z',
retryAfter: 45, // seconds
upgradeUrl: 'https://yourapp.com/pricing',
},
};
// For quota exhaustion
const quotaExhaustedResponse = {
error: 'quota_exceeded',
message: 'You have used 100% of your monthly token quota.',
details: {
used: 1_000_000,
limit: 1_000_000,
resetsAt: '2026-03-01T00:00:00Z',
plan: 'free',
upgradeUrl: 'https://yourapp.com/pricing',
},
};Users who understand why they are being limited and know exactly when they can retry do not file support tickets. Users who see a cryptic error do.
A production rate limiting stack for an AI API looks like this:
Every step adds latency. Keep it under 5ms total for the rate limiting layer. Redis operations are sub-millisecond. The anomaly detection can run asynchronously after the request is dispatched to avoid blocking.
For AI applications with security requirements, the rate limiting layer also serves as the first defense against prompt injection attacks and API key abuse.
For monitoring your AI systems, connect your rate limiting metrics to your observability stack. Alerts on unusual patterns, dashboards showing quota utilization, and cost attribution per user are all valuable.
Q: What is API rate limiting for AI applications?
API rate limiting controls how many AI API calls your application makes within time windows to prevent cost emergencies, stay within provider limits, and ensure fair resource allocation. For AI applications, this is critical because a single runaway loop can generate thousands of expensive API calls in minutes.
Q: What rate limiting patterns work best for AI APIs?
Use token bucket for bursty AI workloads (allowing short bursts while enforcing long-term limits), sliding window for consistent rate enforcement, per-user limits for multi-tenant applications, and circuit breakers that stop all calls when error rates spike. Implement at both the application and infrastructure layers.
Q: How do you prevent AI cost emergencies?
Prevent cost emergencies through hard spending caps per day/month, per-request cost tracking, alerts at 50% and 80% of budget thresholds, automatic degradation to cheaper models when nearing limits, and circuit breakers that halt AI calls when anomalous usage is detected.
Full-stack developer and AI architect with years of experience shipping production applications across SaaS, mobile, and enterprise. Gareth built Agentik {OS} to prove that one person with the right AI system can outperform an entire traditional development team. He has personally architected and shipped 7+ production applications using AI-first workflows.

AI API Development: Schema to Production in Hours
14 endpoints, full validation, auth, pagination, rate limiting, and 67 passing tests. Three hours. AI API development is a different game now.

AI Security: Prompt Injection Is the New SQLi
Prompt injection is the SQL injection of 2026. Your AI app is almost certainly vulnerable. Here are the defense layers that actually work.

Monitoring AI Apps: What You're Not Tracking
Your API returns 200 OK while the AI generates nonsense. Standard monitoring misses this entirely. Here's the AI-specific observability stack you need.
Stop reading about AI and start building with it. Book a free discovery call and see how AI agents can accelerate your business.