Loading...
Loading...
Weekly AI insights —
Real strategies, no fluff. Unsubscribe anytime.
Written by Gareth Simono, Founder and CEO of Agentik {OS}. Full-stack developer and AI architect with years of experience shipping production applications across SaaS, mobile, and enterprise platforms. Gareth orchestrates 267 specialized AI agents to deliver production software 10x faster than traditional development teams.
Founder & CEO, Agentik {OS}
Calling an AI API in a loop is not a production pattern. Here are the architectural patterns that handle rate limits, failures, and costs.

The difference between a toy AI integration and a production one is how you handle the 20% of the time when things go wrong.
Rate limits hit. Models time out. Costs explode because a loop ran longer than expected. Concurrent requests corrupt shared state. These are not edge cases. They are the normal operating conditions of any AI system that handles real traffic at scale.
Most tutorials show you the happy path. This guide shows you the production patterns.
AI APIs have different failure characteristics than typical REST APIs.
Latency is variable and high. A simple database query takes 5-50ms. An LLM completion might take 2-30 seconds depending on output length and model load. Patterns designed for low-latency APIs do not translate.
Errors are often transient. Rate limits reset. Overloaded servers recover. Model degradation is temporary. Retry logic matters more than with databases or simple APIs.
Costs are consumption-based and unbounded. A bug in a loop can generate a $5,000 API bill in minutes. Cost controls need to be baked into the integration, not bolted on later.
Outputs are stochastic. The same input produces different outputs across calls. Caching strategies need to account for acceptable similarity thresholds, not exact key matching.
Each of these differences requires a specific pattern.
When an AI API is degraded, naive code retries indefinitely. Each retry adds load to an already struggling service, making recovery slower. The circuit breaker stops calling a failing service and resumes cautiously after a timeout.
enum CircuitState {
CLOSED, // Normal operation
OPEN, // Failing, reject calls immediately
HALF_OPEN, // Testing recovery
}
class AICircuitBreaker {
private state = CircuitState.CLOSED;
private failureCount = 0;
private lastFailureTime?: number;
private halfOpenSuccesses = 0;
constructor(
private readonly failureThreshold: number = 5,
private readonly recoveryTimeoutMs: number = 60_000,
private readonly halfOpenSuccessThreshold: number = 2
) {}
async execute<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === CircuitState.OPEN) {
const elapsed = Date.now() - (this.lastFailureTime ?? 0);
if (elapsed < this.recoveryTimeoutMs) {
throw new Error(
`Circuit OPEN. Service unavailable. Retry after ${
Math.ceil((this.recoveryTimeoutMs - elapsed) / 1000)
}s.`
);
}
// Timeout elapsed: try cautiously
this.state = CircuitState.HALF_OPEN;
this.halfOpenSuccesses = 0;
}
try {
const result = await fn();
this.recordSuccess();
return result;
} catch (error) {
this.recordFailure();
throw error;
}
}
private recordSuccess() {
if (this.state === CircuitState.HALF_OPEN) {
this.halfOpenSuccesses++;
if (this.halfOpenSuccesses >= this.halfOpenSuccessThreshold) {
this.state = CircuitState.CLOSED;
this.failureCount = 0;
}
} else {
this.failureCount = Math.max(0, this.failureCount - 1);
}
}
private recordFailure() {
this.failureCount++;
this.lastFailureTime = Date.now();
if (
this.state === CircuitState.HALF_OPEN ||
this.failureCount >= this.failureThreshold
) {
this.state = CircuitState.OPEN;
}
}
}The circuit breaker protects your system from cascading failures and gives a degraded service time to recover without being overwhelmed.
When the circuit is closed but a call fails, retry with increasing delays. The jitter prevents thundering herd: when many clients fail simultaneously and all retry at the same interval, they hit the service in synchronized waves.
async function withExponentialBackoff<T>(
fn: () => Promise<T>,
options: {
maxAttempts?: number;
initialDelayMs?: number;
maxDelayMs?: number;
retryableStatuses?: number[];
} = {}
): Promise<T> {
const {
maxAttempts = 4,
initialDelayMs = 1_000,
maxDelayMs = 32_000,
retryableStatuses = [429, 500, 502, 503, 529],
} = options;
let lastError: Error;
for (let attempt = 0; attempt < maxAttempts; attempt++) {
try {
return await fn();
} catch (error: any) {
lastError = error;
const isRetryable =
retryableStatuses.includes(error?.status) ||
error?.message?.includes("timeout") ||
error?.message?.includes("network");
if (!isRetryable || attempt === maxAttempts - 1) {
throw error;
}
// Full jitter: random delay between 0 and the exponential cap
const exponentialCap = Math.min(
initialDelayMs * Math.pow(2, attempt),
maxDelayMs
);
const delay = Math.random() * exponentialCap;
console.log(
`Attempt ${attempt + 1} failed (${
error?.status || error?.message
}). Retrying in ${Math.round(delay)}ms`
);
await sleep(delay);
}
}
throw lastError!;
}
const sleep = (ms: number) => new Promise(r => setTimeout(r, ms));Firing 50 simultaneous LLM requests hits rate limits immediately and produces unpredictable latency. A queue with controlled concurrency smooths the load.
class AIRequestQueue {
private queue: Array<{
fn: () => Promise<any>;
resolve: (value: any) => void;
reject: (error: any) => void;
}> = [];
private running = 0;
constructor(
private readonly maxConcurrent: number,
private readonly requestsPerMinute?: number
) {}
async add<T>(fn: () => Promise<T>): Promise<T> {
return new Promise<T>((resolve, reject) => {
this.queue.push({ fn, resolve, reject });
this.processNext();
});
}
private processNext() {
while (this.running < this.maxConcurrent && this.queue.length > 0) {
const task = this.queue.shift()!;
this.running++;
task.fn()
.then(task.resolve)
.catch(task.reject)
.finally(() => {
this.running--;
this.processNext();
});
}
}
get queueDepth(): number {
return this.queue.length;
}
get activeRequests(): number {
return this.running;
}
}
// Process 500 items with controlled concurrency
const queue = new AIRequestQueue(8); // Max 8 concurrent requests
const results = await Promise.all(
largeItemList.map(item =>
queue.add(() =>
withExponentialBackoff(() => callAnthropicAPI(item))
)
)
);
console.log(`Queue depth: ${queue.queueDepth}, Active: ${queue.activeRequests}`);Waiting for a 2,000-word AI response before showing anything produces terrible user experience. Stream the response as it generates.
import Anthropic from "@anthropic-ai/sdk";
const anthropic = new Anthropic();
async function* streamAIResponse(
prompt: string,
systemPrompt?: string
): AsyncGenerator<string, void, unknown> {
const stream = anthropic.messages.stream({
model: "claude-sonnet-4-6",
max_tokens: 2000,
system: systemPrompt,
messages: [{ role: "user", content: prompt }],
});
for await (const event of stream) {
if (
event.type === "content_block_delta" &&
event.delta.type === "text_delta"
) {
yield event.delta.text;
}
}
}
// Next.js App Router streaming response
export async function GET(request: Request) {
const { searchParams } = new URL(request.url);
const prompt = searchParams.get("prompt") ?? "";
const encoder = new TextEncoder();
const stream = new ReadableStream({
async start(controller) {
try {
for await (const chunk of streamAIResponse(prompt)) {
controller.enqueue(encoder.encode(chunk));
}
} catch (error) {
controller.error(error);
} finally {
controller.close();
}
},
});
return new Response(stream, {
headers: {
"Content-Type": "text/plain; charset=utf-8",
"Transfer-Encoding": "chunked",
"X-Content-Type-Options": "nosniff",
},
});
}Some AI calls receive near-identical inputs repeatedly. Standard key-value caching misses these because the text is not exactly identical. Semantic caching returns cached responses for similar but not identical queries.
class SemanticResponseCache {
private entries: Array<{
queryEmbedding: number[];
originalQuery: string;
response: string;
hitCount: number;
cachedAt: Date;
}> = [];
constructor(
private readonly similarityThreshold: number = 0.95,
private readonly maxEntries: number = 1000,
private readonly ttlMs: number = 24 * 60 * 60 * 1000 // 24 hours
) {}
async get(query: string): Promise<string | null> {
const queryEmbedding = await embedText(query);
const now = Date.now();
// Remove expired entries
this.entries = this.entries.filter(
e => now - e.cachedAt.getTime() < this.ttlMs
);
// Find best semantic match above threshold
let bestMatch = null;
let bestScore = 0;
for (const entry of this.entries) {
const score = cosineSimilarity(queryEmbedding, entry.queryEmbedding);
if (score > this.similarityThreshold && score > bestScore) {
bestScore = score;
bestMatch = entry;
}
}
if (bestMatch) {
bestMatch.hitCount++;
console.log(`Cache hit (score: ${bestScore.toFixed(3)}): "${bestMatch.originalQuery}"`);
return bestMatch.response;
}
return null;
}
async set(query: string, response: string): Promise<void> {
const queryEmbedding = await embedText(query);
// Evict oldest if at capacity
if (this.entries.length >= this.maxEntries) {
this.entries.sort((a, b) => a.cachedAt.getTime() - b.cachedAt.getTime());
this.entries.splice(0, Math.floor(this.maxEntries * 0.1));
}
this.entries.push({
queryEmbedding,
originalQuery: query,
response,
hitCount: 0,
cachedAt: new Date(),
});
}
}For a customer support bot, "What is your return policy?" and "Can you explain how returns work?" will share a cached response. Significant cost reduction on high-traffic, FAQ-adjacent queries.
AI API costs can grow unexpectedly. Build hard limits at every level.
class CostAwareAPIClient {
private sessionTokens = 0;
private sessionCost = 0;
// Approximate costs per million tokens (adjust as rates change)
private readonly COSTS_PER_MILLION: Record<string, { input: number; output: number }> = {
"claude-opus-4-6": { input: 15, output: 75 },
"claude-sonnet-4-6": { input: 3, output: 15 },
"claude-haiku-4-5": { input: 0.25, output: 1.25 },
};
constructor(
private readonly maxSessionCost: number = 5.00,
private readonly maxRequestCost: number = 0.50
) {}
async call(
params: Anthropic.MessageCreateParams
): Promise<Anthropic.Message> {
// Estimate cost before calling
const estimatedInputTokens = Math.ceil(
JSON.stringify(params.messages).length / 4
);
const costs = this.COSTS_PER_MILLION[params.model] ??
this.COSTS_PER_MILLION["claude-sonnet-4-6"];
const estimatedCost = (estimatedInputTokens / 1_000_000) * costs.input;
if (estimatedCost > this.maxRequestCost) {
throw new Error(
`Estimated request cost ($${estimatedCost.toFixed(4)}) exceeds limit ($${this.maxRequestCost})`
);
}
if (this.sessionCost + estimatedCost > this.maxSessionCost) {
throw new Error(
`Session budget exhausted. Spent: $${this.sessionCost.toFixed(4)}, limit: $${this.maxSessionCost}`
);
}
const response = await anthropic.messages.create(params);
// Track actual costs
const actualInputTokens = response.usage.input_tokens;
const actualOutputTokens = response.usage.output_tokens;
const actualCost =
(actualInputTokens / 1_000_000) * costs.input +
(actualOutputTokens / 1_000_000) * costs.output;
this.sessionTokens += actualInputTokens + actualOutputTokens;
this.sessionCost += actualCost;
return response;
}
getSessionStats() {
return {
totalTokens: this.sessionTokens,
totalCost: this.sessionCost,
budgetRemaining: this.maxSessionCost - this.sessionCost,
};
}
}const breaker = new AICircuitBreaker(5, 60_000);
const queue = new AIRequestQueue(8);
const cache = new SemanticResponseCache(0.95);
const costClient = new CostAwareAPIClient(10.00, 0.75);
async function productionAICall(
prompt: string,
options: { useCache?: boolean; model?: string } = {}
): Promise<string> {
const { useCache = true, model = "claude-sonnet-4-6" } = options;
// Check semantic cache first
if (useCache) {
const cached = await cache.get(prompt);
if (cached) return cached;
}
// Queue, circuit-break, and retry
const result = await queue.add(() =>
breaker.execute(() =>
withExponentialBackoff(() =>
costClient.call({
model,
max_tokens: 1000,
messages: [{ role: "user", content: prompt }],
})
)
)
);
const text = result.content[0].type === "text" ? result.content[0].text : "";
// Cache the result
if (useCache) await cache.set(prompt, text);
return text;
}For the broader architectural patterns that govern how agents use APIs throughout complex multi-step workflows, see tool use patterns for AI agents.
Instrumentation is not optional at scale.
const metrics = {
callsTotal: 0,
callsSucceeded: 0,
callsFailed: 0,
cacheHits: 0,
circuitBreakerTrips: 0,
totalLatencyMs: 0,
totalCostUSD: 0,
};
// Track these metrics and alert on:
// - Error rate > 5% over 5 minutes
// - p99 latency > 30 seconds
// - Hourly cost > $50
// - Circuit breaker trips > 3 in 10 minutesMonitoring AI-driven applications covers the full observability stack.
Q: What are the best patterns for AI API integration?
Best patterns include retry with exponential backoff (handling transient failures), circuit breakers (preventing cascade failures), request queuing (managing rate limits), response caching (reducing costs and latency), and fallback chains (degrading to simpler models when primary fails). Layer these for robust production integrations.
Q: How do you handle AI API rate limits?
Handle rate limits through request queuing with priority levels, token bucket rate limiting at the application layer, automatic retries with backoff when limits are hit, pre-emptive throttling based on usage patterns, and fallback to cached responses or alternative models during rate limit periods.
Q: What is the cost of AI API integration mistakes?
Common mistakes — missing retry logic, no rate limiting, unbounded loops, and missing error handling — can cause cost emergencies ($1000s in unexpected API charges), cascading failures, poor user experience, and data loss. Investing in robust integration patterns upfront prevents these issues and pays back immediately.
Full-stack developer and AI architect with years of experience shipping production applications across SaaS, mobile, and enterprise. Gareth built Agentik {OS} to prove that one person with the right AI system can outperform an entire traditional development team. He has personally architected and shipped 7+ production applications using AI-first workflows.

Tool Use Patterns for AI Agents: What Actually Works
An agent without tools is a chatbot with delusions. The tool matters less than how you describe it. Here are the patterns that work.

Monitoring AI Agents in Production: What You Actually Need
AI agents fail differently than traditional software. Silent hallucinations. Cost explosions. Loops. The monitoring setup that catches them first.

Build a RAG System That Actually Works in Production
Basic RAG works in demos and breaks in production. Here is what naive implementations get wrong and how to build the version that handles real users.
Stop reading about AI and start building with it. Book a free discovery call and see how AI agents can accelerate your business.