Loading...
Loading...
Weekly AI insights —
Real strategies, no fluff. Unsubscribe anytime.
Written by Gareth Simono, Founder and CEO of Agentik {OS}. Full-stack developer and AI architect with years of experience shipping production applications across SaaS, mobile, and enterprise platforms. Gareth orchestrates 267 specialized AI agents to deliver production software 10x faster than traditional development teams.
Founder & CEO, Agentik {OS}
AI agents fail differently than traditional software. Silent hallucinations. Cost explosions. Loops. The monitoring setup that catches them first.

Traditional software fails loudly. Exceptions. 500 errors. Timeouts. The monitoring you built for traditional software does not catch most AI agent failures.
AI agents fail silently. An agent that's hallucinating still returns 200 OK. An agent that's entered a reasoning loop still returns responses. An agent whose costs have exploded 10x still appears functional. Your Datadog dashboard is green while your AI system is doing something completely wrong.
This guide covers the monitoring layer you actually need for AI agents in production.
Understanding the failure modes shapes the monitoring strategy.
| Failure Type | Traditional Monitoring Catches It | What Catches It |
|---|---|---|
| API timeout | Yes | Traditional monitoring |
| Rate limit exceeded | Partially | Cost/rate monitoring |
| Hallucination | No | Output quality monitoring |
| Off-topic responses | No | Intent accuracy monitoring |
| Cost explosion | No | Cost monitoring |
| Reasoning loops | Partially | Latency + cost monitoring |
| Context window overflow | No | Input length monitoring |
| Prompt injection | No | Security monitoring |
| Model degradation | No | Quality trend monitoring |
The rightmost column is your gap. You need monitoring that catches failures in that column.
Every AI agent interaction should be logged with enough data to diagnose failures and improve performance.
// lib/ai-telemetry.ts
export interface AgentTelemetry {
// Identity
traceId: string;
sessionId: string;
agentId: string;
userId?: string;
// Input
inputTokens: number;
inputLength: number;
systemPromptLength: number;
contextDocuments: number; // For RAG agents
// Processing
model: string;
temperature: number;
latencyMs: number;
retryCount: number;
// Output
outputTokens: number;
outputLength: number;
finishReason: "end_turn" | "max_tokens" | "stop_sequence" | "tool_use";
toolsUsed: string[];
toolCallCount: number;
// Cost
inputCostUsd: number;
outputCostUsd: number;
totalCostUsd: number;
// Quality signals
userFeedback?: "positive" | "negative";
escalated: boolean;
errorCode?: string;
errorMessage?: string;
// Timestamps
timestamp: number;
}
// Cost calculation per model
const COST_PER_MILLION_TOKENS = {
"claude-haiku-4-20250514": { input: 0.25, output: 1.25 },
"claude-sonnet-4-20250514": { input: 3.0, output: 15.0 },
"claude-opus-4-20250514": { input: 15.0, output: 75.0 },
} as const;
export function calculateCost(
model: string,
inputTokens: number,
outputTokens: number
): { inputCostUsd: number; outputCostUsd: number; totalCostUsd: number } {
const costs = COST_PER_MILLION_TOKENS[model as keyof typeof COST_PER_MILLION_TOKENS] ??
{ input: 3.0, output: 15.0 }; // Default to Sonnet pricing
const inputCostUsd = (inputTokens / 1_000_000) * costs.input;
const outputCostUsd = (outputTokens / 1_000_000) * costs.output;
return {
inputCostUsd,
outputCostUsd,
totalCostUsd: inputCostUsd + outputCostUsd,
};
}Wrap your AI calls to automatically collect telemetry:
// lib/monitored-agent.ts
import Anthropic from "@anthropic-ai/sdk";
import { nanoid } from "nanoid";
import { calculateCost, AgentTelemetry } from "./ai-telemetry.js";
const anthropic = new Anthropic();
interface CreateMessageOptions {
model: string;
system?: string;
messages: Anthropic.MessageParam[];
max_tokens: number;
temperature?: number;
tools?: Anthropic.Tool[];
agentId: string;
sessionId: string;
userId?: string;
}
export async function createMonitoredMessage(
options: CreateMessageOptions,
onTelemetry: (telemetry: AgentTelemetry) => Promise<void>
): Promise<Anthropic.Message> {
const traceId = nanoid();
const startTime = Date.now();
let retryCount = 0;
let lastError: Error | null = null;
const { agentId, sessionId, userId, ...claudeOptions } = options;
while (retryCount < 3) {
try {
const response = await anthropic.messages.create(claudeOptions);
const latencyMs = Date.now() - startTime;
const costs = calculateCost(
claudeOptions.model,
response.usage.input_tokens,
response.usage.output_tokens
);
const toolsUsed: string[] = [];
let toolCallCount = 0;
for (const block of response.content) {
if (block.type === "tool_use") {
toolsUsed.push(block.name);
toolCallCount++;
}
}
const telemetry: AgentTelemetry = {
traceId,
sessionId,
agentId,
userId,
inputTokens: response.usage.input_tokens,
inputLength: claudeOptions.messages.reduce(
(sum, m) => sum + (typeof m.content === "string" ? m.content.length : 0), 0
),
systemPromptLength: claudeOptions.system?.length ?? 0,
contextDocuments: 0,
model: claudeOptions.model,
temperature: claudeOptions.temperature ?? 1,
latencyMs,
retryCount,
outputTokens: response.usage.output_tokens,
outputLength: response.content
.filter(b => b.type === "text")
.reduce((sum, b) => sum + (b as Anthropic.TextBlock).text.length, 0),
finishReason: response.stop_reason as AgentTelemetry["finishReason"],
toolsUsed,
toolCallCount,
...costs,
escalated: false,
timestamp: Date.now(),
};
await onTelemetry(telemetry);
return response;
} catch (error) {
lastError = error as Error;
// Don't retry on non-transient errors
if (error instanceof Anthropic.APIError && error.status === 400) {
break;
}
retryCount++;
if (retryCount < 3) {
await new Promise(resolve => setTimeout(resolve, 1000 * retryCount));
}
}
}
// Log the failed attempt
await onTelemetry({
traceId,
sessionId,
agentId,
userId,
inputTokens: 0,
inputLength: 0,
systemPromptLength: 0,
contextDocuments: 0,
model: claudeOptions.model,
temperature: claudeOptions.temperature ?? 1,
latencyMs: Date.now() - startTime,
retryCount,
outputTokens: 0,
outputLength: 0,
finishReason: "end_turn",
toolsUsed: [],
toolCallCount: 0,
inputCostUsd: 0,
outputCostUsd: 0,
totalCostUsd: 0,
escalated: false,
errorCode: lastError?.name,
errorMessage: lastError?.message,
timestamp: Date.now(),
});
throw lastError;
}Cost explosions are one of the most common production AI failures. Set hard limits:
// lib/cost-guard.ts
export interface CostLimits {
perSessionUsd: number; // Max cost per conversation
perUserDayUsd: number; // Max daily cost per user
globalHourUsd: number; // Max hourly total cost
}
const DEFAULT_LIMITS: CostLimits = {
perSessionUsd: 0.50, // $0.50 per conversation max
perUserDayUsd: 2.00, // $2.00 per user per day
globalHourUsd: 50.00, // $50 per hour total
};
export class CostGuard {
private sessionCosts = new Map<string, number>();
private userDayCosts = new Map<string, number>();
private globalHourlyCost = 0;
private hourWindowStart = Date.now();
check(
sessionId: string,
userId: string | undefined,
cost: number,
limits: CostLimits = DEFAULT_LIMITS
): { allowed: boolean; reason?: string } {
// Reset hourly window
if (Date.now() - this.hourWindowStart > 3600000) {
this.globalHourlyCost = 0;
this.hourWindowStart = Date.now();
}
const sessionCost = (this.sessionCosts.get(sessionId) ?? 0) + cost;
if (sessionCost > limits.perSessionUsd) {
return { allowed: false, reason: "Session cost limit exceeded" };
}
if (userId) {
const userKey = `${userId}:${new Date().toDateString()}`;
const userCost = (this.userDayCosts.get(userKey) ?? 0) + cost;
if (userCost > limits.perUserDayUsd) {
return { allowed: false, reason: "Daily user cost limit exceeded" };
}
this.userDayCosts.set(userKey, userCost);
}
if (this.globalHourlyCost + cost > limits.globalHourUsd) {
return { allowed: false, reason: "Global hourly cost limit exceeded" };
}
// Update costs
this.sessionCosts.set(sessionId, sessionCost);
this.globalHourlyCost += cost;
return { allowed: true };
}
}
export const costGuard = new CostGuard();Detect when your agent is going off the rails:
// lib/quality-monitor.ts
import Anthropic from "@anthropic-ai/sdk";
export interface QualityCheck {
passed: boolean;
issues: string[];
score: number; // 0-1
}
const anthropic = new Anthropic();
export async function checkResponseQuality(
userMessage: string,
agentResponse: string,
context: {
agentPurpose: string;
expectedTopics: string[];
}
): Promise<QualityCheck> {
const response = await anthropic.messages.create({
model: "claude-haiku-4-20250514", // Use cheaper model for monitoring
max_tokens: 512,
system: `You are a quality evaluator for AI agent responses. Return JSON only.`,
messages: [
{
role: "user",
content: `Evaluate this AI response for quality issues.
Agent purpose: ${context.agentPurpose}
Expected topics: ${context.expectedTopics.join(", ")}
User message: ${userMessage}
Agent response: ${agentResponse}
Check for:
1. Is the response on-topic?
2. Does it contain hallucinated facts?
3. Is it appropriate (no harmful content)?
4. Does it answer the user's actual question?
5. Is it coherent?
Return: { passed: boolean, issues: string[], score: 0-1 }`,
},
],
});
const text = response.content[0].type === "text" ? response.content[0].text : "{}";
try {
return JSON.parse(text);
} catch {
return { passed: true, issues: [], score: 0.8 }; // Default to passing
}
}Build dashboards around these metrics:
| Metric | Alert Threshold | What It Indicates |
|---|---|---|
| Avg response latency | > 5s p95 | Slow model, long prompts, loops |
| Cost per session | > 2x baseline | Prompt injection, loops, misuse |
| Quality score | < 0.7 | Model degradation, prompt issues |
| Error rate | > 1% | API issues, invalid requests |
| Token ratio (output/input) | > 3x | Verbosity issues, loops |
| Escalation rate | > 20% | Agent capability gaps |
finish_reason: max_tokens | > 5% | Responses being cut off |
For a simple setup that works immediately:
// Save telemetry to a file or simple database
import * as fs from "fs/promises";
export async function logTelemetry(telemetry: AgentTelemetry): Promise<void> {
// Append to JSONL file (one JSON object per line)
await fs.appendFile(
"./logs/agent-telemetry.jsonl",
JSON.stringify(telemetry) + "\n"
);
// Alert on cost anomalies
if (telemetry.totalCostUsd > 0.10) {
console.warn(`High cost session: $${telemetry.totalCostUsd.toFixed(4)} for ${telemetry.sessionId}`);
// Add Slack/email notification here
}
// Alert on slow responses
if (telemetry.latencyMs > 10000) {
console.warn(`Slow response: ${telemetry.latencyMs}ms for ${telemetry.traceId}`);
}
}Query the JSONL file with tools like DuckDB for ad-hoc analysis, or pipe into a time-series database for dashboards.
Q: How do you set up monitoring for AI agents?
Set up monitoring in three layers: infrastructure (uptime, latency, error rates with tools like Vercel Analytics or Datadog), application (user interactions, conversion funnels, feature usage), and AI-specific (model latency, token usage, output quality scores, cost per request). Alert on anomalies in all three layers.
Q: What should you monitor in AI agent applications?
Monitor model response latency, token consumption, cost per interaction, error rates by type, output quality scores, user satisfaction, hallucination rates, and escalation frequency. These AI-specific metrics reveal issues that standard application monitoring misses.
Q: How do you detect AI quality degradation?
Detect degradation through automated quality scoring on a sample of AI outputs, trend analysis on user satisfaction metrics, comparison against baseline benchmarks, alert thresholds for key metrics (latency, error rate, cost), and regular human evaluation of random AI interactions.
Full-stack developer and AI architect with years of experience shipping production applications across SaaS, mobile, and enterprise. Gareth built Agentik {OS} to prove that one person with the right AI system can outperform an entire traditional development team. He has personally architected and shipped 7+ production applications using AI-first workflows.

Agent Observability: Seeing Inside the Black Box
Request received. Response sent. 200 OK. Latency 3.2s. None of that tells you why the agent gave wrong advice. Here's what actually does.

Build Your First AI Agent in One Afternoon
Stop watching tutorials about tutorials. Here's how to actually build an AI agent that does something useful, from zero, in one sitting.

Automated Testing: The Stack That Won't Slow You
Testing setups that make developers slow are abandoned. Here's the fast, modern testing stack for Next.js apps that actually gets used and maintained.
Stop reading about AI and start building with it. Book a free discovery call and see how AI agents can accelerate your business.