Loading...
Loading...
Weekly AI insights —
Real strategies, no fluff. Unsubscribe anytime.
Founder & CEO, Agentik {OS}
Every agent demo works flawlessly. Every production agent fails constantly. The difference between success and failure is how gracefully it breaks.

Every agent demo works flawlessly.
Inputs are well-chosen. Services respond normally. Edge cases have been mentally excluded. Everything flows smoothly because the demo was designed to flow smoothly.
Every production agent fails constantly.
APIs time out. Models return malformed responses. Rate limits hit at the worst moments. Database connections drop. External services go dark at 2 AM without notice. Users send inputs nobody anticipated.
The difference between a successful production agent and a failed one isn't intelligence or architecture quality. It's whether the system was designed to handle failure as a constant rather than an exception.
This is that design.
Different failures need different responses. Applying the wrong recovery strategy to a failure makes it worse.
Transient failures are temporary by nature. Rate limits, network blips, momentary service unavailability. They resolve with time. Correct response: wait and retry.
Capacity failures occur when resources are exhausted. Token budget depleted. Model API at capacity. Database connection pool full. Won't resolve by waiting. Correct responses: queue, degrade gracefully, or prioritize.
Data failures originate in inputs or context. Malformed user input. Corrupted context from a previous step. Missing required information. Conflicting instructions. Correct responses: validate and transform, request clarification, or fail the step with a useful error.
Logic failures mean the agent did the wrong thing. Misunderstood the task. Called the wrong tool. Produced output violating constraints. Correct responses: reformulate and retry with adjusted instructions.
Catastrophic failures mean fundamental unavailability. Model API down for hours. Critical dependency gone. Correct responses: graceful degradation, operator notification, user communication.
enum FailureCategory {
TRANSIENT = "transient",
CAPACITY = "capacity",
DATA = "data",
LOGIC = "logic",
CATASTROPHIC = "catastrophic"
}
function classifyFailure(error: AgentError): FailureCategory {
if (error.code === "RATE_LIMIT" || error.code === "TIMEOUT") return FailureCategory.TRANSIENT;
if (error.code === "BUDGET_EXHAUSTED") return FailureCategory.CAPACITY;
if (error.code === "INVALID_INPUT" || error.code === "MISSING_CONTEXT") return FailureCategory.DATA;
if (error.code === "CONSTRAINT_VIOLATION") return FailureCategory.LOGIC;
return FailureCategory.CATASTROPHIC;
}Misclassification is expensive. Treating a logic failure as transient burns retries on the same flawed approach. Get the taxonomy right first.
Naive retrying sends the exact same request again. Usually fails again for the same reason.
A retry hierarchy escalates through different recovery strategies:
Level 1: Retry with backoff. Same request, increasing delays. 1s, 2s, 4s, 8s. Handles transient failures. Maximum 3-5 attempts.
Level 2: Retry with reformulation. Same intent, different approach.
class ReformulatingRetryStrategy {
async executeWithReformulation(task: AgentTask, maxAttempts: number = 3): Promise<AgentResult> {
let instructions = task.instructions;
let lastError: Error;
for (let attempt = 0; attempt < maxAttempts; attempt++) {
try {
return await this.agent.execute({ ...task, instructions });
} catch (error) {
lastError = error;
const analysis = await this.failureAnalyzer.analyze(error, task);
instructions = this.instructionAdjuster.adjust(instructions, analysis.recommendation);
}
}
throw new MaxRetriesExceededError(lastError);
}
}Level 3: Retry with model fallback. Same task, different model. Quality may differ, but a response beats no response for most use cases.
Level 4: Retry with scope reduction. Cannot generate a comprehensive report? Generate a summary. Cannot complete all steps? Complete what you can and note what remains.
Level 5: Graceful failure. All retries exhausted. Return what was accomplished with a clear explanation: what was completed, what failed, and what the user can do next.
The hierarchy ensures maximum value delivery before final failure. Partial results with honest explanation beat generic errors.
class FailurePatternLearner {
async getRecommendedRecovery(input: AgentInput): Promise<string | null> {
const signature = this.computeSignature(input);
const pattern = this.patternDatabase.get(signature);
if (pattern && pattern.successfulRecoveries.length > 0) {
return pattern.successfulRecoveries[pattern.successfulRecoveries.length - 1];
}
return null;
}
async recordOutcome(input: AgentInput, strategy: string, succeeded: boolean): Promise<void> {
const signature = this.computeSignature(input);
const pattern = this.patternDatabase.get(signature) ?? this.createPattern(signature);
if (succeeded) pattern.successfulRecoveries.push(strategy);
pattern.lastSeen = new Date();
this.patternDatabase.set(signature, pattern);
}
}Inputs that consistently fail with a specific pattern get routed directly to the recovery strategy that works. This compounds as the pattern database grows.
async function preemptivelyValidate(
operation: AgentOperation,
context: ExecutionContext
): Promise<ValidationResult> {
const checks = await Promise.all([
checkApiHealth(operation.requiredApis),
validateInputBounds(operation.input, operation.expectedBounds),
checkTokenBudget(operation.estimatedTokens, context.remainingBudget),
]);
const failures = checks.filter(c => !c.passed);
return {
canProceed: failures.length === 0,
blockers: failures,
recommendation: failures.length > 0 ? generateRecommendation(failures) : undefined,
};
}Checking preconditions before attempting avoids predictable failures and produces much better error information when they do occur.
For long-running workflows, losing work to a failure is the most common source of user frustration.
class CheckpointingWorkflowEngine {
async execute(input: WorkflowInput, config: WorkflowConfig): Promise<WorkflowResult> {
const workflowId = generateId();
for (let stepIndex = 0; stepIndex < config.steps.length; stepIndex++) {
try {
await this.executeStep(config.steps[stepIndex], this.buildStepContext(stepIndex));
await this.checkpointStore.save({
workflowId,
currentStep: stepIndex + 1,
completedSteps: this.completedSteps,
accumulatedContext: this.context,
intermediateResults: this.results,
});
} catch (error) {
await this.checkpointStore.saveFailure(workflowId, stepIndex, error);
throw error;
}
}
return this.buildFinalResult();
}
async resumeFrom(workflowId: string): Promise<WorkflowResult> {
const checkpoint = await this.checkpointStore.loadLatest(workflowId);
this.context = checkpoint.accumulatedContext;
this.results = checkpoint.intermediateResults;
return this.executeFromStep(checkpoint.currentStep);
}
}Checkpoint storage must survive process crashes. Database, message queue, or object storage. Not in-memory.
Design every step to be idempotent. If resuming re-executes a step, steps that send emails or charge payments cannot safely run twice.
When a dependency starts failing, continuing to attempt calls wastes resources and worsens recovery time.
class AgentCircuitBreaker {
private state: "closed" | "open" | "half-open" = "closed";
private failureCount = 0;
async execute<T>(operation: () => Promise<T>): Promise<T> {
if (this.state === "open") {
if (this.shouldAttemptReset()) {
this.state = "half-open";
} else {
throw new CircuitOpenError(`${this.dependencyName} circuit is open`);
}
}
try {
const result = await operation();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
private onFailure() {
this.failureCount++;
if (this.failureCount >= this.config.failureThreshold) this.state = "open";
if (this.state === "half-open") this.state = "open";
}
}Apply circuit breakers to every external dependency independently. Each has its own circuit tuned to normal error rates and calling costs. This pairs with scaling architecture patterns for production-grade resilience.
How agents communicate failures matters as much as the technical recovery.
Bad: "An error occurred. Please try again."
Good: what was accomplished before the failure, what specifically failed, what the user can do now, whether retry is likely to help.
function buildFailureCommunication(
completedSteps: CompletedStep[],
failure: FailureRecord,
): FailureCommunication {
return {
summary: completedSteps.length > 0
? `I completed ${completedSteps.length} of ${totalSteps} steps before hitting an issue.`
: "I wasn't able to get started on this task.",
partialResults: extractPartialResults(completedSteps),
explanation: translateToUserLanguage(failure),
recommendation: generateUserRecommendation(failure),
canRetry: failure.category !== FailureCategory.CATASTROPHIC,
};
}Users who receive clear failure communication know what happened and what to do next. Users who receive opaque errors file support tickets and churn.
Error recovery is untested in most agent systems. Happy path: extensive testing. Recovery paths: discovered in production.
describe("Rate limit recovery", () => {
it("completes task after rate limit with backoff", async () => {
const mockLLM = new RateLimitingMockLLM({ failFirstN: 2 });
const agent = new ResilientAgent({ llm: mockLLM });
const result = await agent.execute(testTask);
expect(result.status).toBe("success");
expect(mockLLM.totalAttempts).toBe(3);
});
it("returns partial results when all retries exhausted", async () => {
const mockLLM = new PermanentlyFailingMockLLM();
const agent = new ResilientAgent({ llm: mockLLM });
const result = await agent.execute(multiStepTask);
expect(result.status).toBe("partial_completion");
expect(result.completedSteps.length).toBeGreaterThan(0);
expect(result.failureExplanation).toBeDefined();
});
});Run recovery tests in CI/CD alongside functional tests. Monitor recovery metrics via agent observability.
The goal isn't an agent that never fails. Impossible.
The goal is an agent that fails so gracefully users barely notice. Partial results with clear explanation beat silent failures. Transparent retrying beats opaque delays. Honest "I couldn't complete this" beats confident wrong answers.
Design every failure mode as if a user is watching. What would you want them to see? Work backwards from that to the recovery strategy.
The agents that earn long-term trust fail gracefully when they fail and recover cleanly when they can. That's a harder design problem than an agent that works when conditions are perfect. It's also the problem that determines whether users stick around.
Q: How do AI agents recover from errors?
AI agents recover through a structured loop: detect the error, diagnose the root cause, attempt a fix, validate the fix works, and resume the original task. Sophisticated agents maintain checkpoints so they can revert to a known-good state rather than starting over. The recovery loop continues until success or a retry limit triggers human escalation.
Q: What error recovery patterns work best for AI agents?
The most effective patterns are retry with exponential backoff (transient failures), checkpoint-and-resume (long-running tasks), fallback to simpler approaches (model failures), human escalation (ambiguous errors), and circuit breakers (preventing cascade failures). Layer these patterns for comprehensive coverage.
Q: How many retries should an AI agent attempt before escalating?
Typically 3 retries for transient errors (API timeouts, rate limits) and 1-2 retries for logic errors (with different approaches each time). If the agent cannot resolve after retries, escalate to a human with full context: what was attempted, what failed, and why. Over-retrying wastes resources and delays resolution.
Full-stack developer and AI architect with years of experience shipping production applications across SaaS, mobile, and enterprise. Gareth built Agentik {OS} to prove that one person with the right AI system can outperform an entire traditional development team. He has personally architected and shipped 7+ production applications using AI-first workflows.

Agent Deployment Patterns: What Production Actually Demands
Deploying an AI agent is nothing like deploying a REST API. Agents are stateful, expensive, non-deterministic, and slow. Every standard assumption breaks.

Production Agent Teams: From Demo to Reality
The demo worked perfectly. Three weeks into production, they pulled it. The gap between prototype and production is always the same set of problems.

Scaling Agent Systems: Architecture That Survives Growth
Every agent system hits a wall. The architecture decisions made on day one determine whether that wall arrives at 1,000 users or 1,000,000.
Stop reading about AI and start building with it. Book a free discovery call and see how AI agents can accelerate your business.