Loading...
Loading...
Weekly AI insights —
Real strategies, no fluff. Unsubscribe anytime.
Written by Gareth Simono, Founder and CEO of Agentik {OS}. Full-stack developer and AI architect with years of experience shipping production applications across SaaS, mobile, and enterprise platforms. Gareth orchestrates 267 specialized AI agents to deliver production software 10x faster than traditional development teams.
Founder & CEO, Agentik {OS}
Request received. Response sent. 200 OK. Latency 3.2s. None of that tells you why the agent gave wrong advice. Here's what actually does.

Your agent is in production. Requests are coming in. Responses are going out. Everything looks fine on the dashboard.
Then a user reports completely wrong advice. You open monitoring. Here is what you see:
Request received: 14:23:41. Response sent: 14:23:44. HTTP 200. Latency: 3.2 seconds.
Absolutely nothing in that data tells you why the agent gave wrong advice. You don't know what it considered. What alternatives it weighed. What context it used. Why it produced that particular output.
This is the observability gap that kills agent systems in production. You can see that things happened. You cannot see why.
Traditional monitoring was designed for deterministic systems. Same input, same output. Something wrong? Find the error, trace the code path, fix it.
Agent systems are non-deterministic. The same input produces different outputs on different runs. Behavior emerges from instructions, model reasoning, and specific context. There is no single code path to trace.
What you actually need:
None of this appears in standard monitoring. You build it deliberately or you fly blind.
Distributed tracing adapted for AI is the foundation of agent observability. Not optional.
import { trace, Span, SpanStatusCode } from "@opentelemetry/api";
const tracer = trace.getTracer("agent-orchestrator");
async function executeAgentWorkflow(
task: AgentTask,
workflowId: string
): Promise<WorkflowResult> {
return tracer.startActiveSpan(
`workflow.${task.type}`,
{ attributes: { workflowId, taskType: task.type, userId: task.userId } },
async (workflowSpan: Span) => {
try {
const result = await runWorkflow(task, workflowSpan);
workflowSpan.setAttributes({
"workflow.quality_score": result.qualityScore,
"workflow.total_tokens": result.totalTokens,
"workflow.total_cost_usd": result.totalCostUsd,
});
return result;
} catch (error) {
workflowSpan.setStatus({ code: SpanStatusCode.ERROR, message: String(error) });
throw error;
} finally {
workflowSpan.end();
}
}
);
}
async function runLLMCall(agent: string, prompt: string, model: string): Promise<LLMResponse> {
return tracer.startActiveSpan(`llm.call.${agent}`, {}, async (llmSpan: Span) => {
const inputTokens = countTokens(prompt);
const response = await callLLM(prompt, model);
const outputTokens = countTokens(response.content);
llmSpan.setAttributes({
"llm.model": model,
"llm.input_tokens": inputTokens,
"llm.output_tokens": outputTokens,
"llm.cost_usd": calculateCost(inputTokens, outputTokens, model),
});
// Store full prompt and response for reasoning analysis
await reasoningStore.save({ spanId: llmSpan.spanContext().spanId, agent, prompt, response: response.content });
llmSpan.end();
return response;
});
}Every significant step gets a span. Orchestrator decisions. LLM calls with full input and output. Tool executions with parameters and results. Agent handoffs. The trace gives you the complete execution map.
When something goes wrong, drill from the top-level workflow span through nested spans to find exactly where the failure occurred and what state the system was in.
Latency, throughput, and error rate tell you about system health. Nothing about agent quality. You need both, and the quality metrics are harder and more important.
const AGENT_QUALITY_METRICS = [
{
name: "factual_accuracy",
measuredBy: "llm_evaluator" as const,
alertThreshold: 0.85,
description: "Factual claims are correct and supported by context"
},
{
name: "task_completion_rate",
measuredBy: "human" as const,
alertThreshold: 0.75,
description: "User accomplished their goal through the interaction"
},
{
name: "response_relevance",
measuredBy: "llm_evaluator" as const,
alertThreshold: 0.90,
description: "Response directly addresses the user question"
},
{
name: "hallucination_rate",
measuredBy: "deterministic" as const,
alertThreshold: 0.03,
description: "Rate of responses containing verifiable false claims"
}
];Track each metric per agent, per behavioral version, per task type, per time period. Granular tracking lets you identify exactly where quality changes when a deployment happens.
Cost per task matters equally and is equally undertracked. Total spend grows with usage (expected). Cost per task should be stable or declining. Growing cost-per-task signals inefficiency before it shows up in the total bill.
The piece most teams don't build. The most important one for debugging.
Every execution should produce a reasoning log capturing the complete chain:
interface ReasoningLog {
executionId: string;
agentId: string;
timestamp: Date;
retrievedContext: ContextItem[];
contextScores: Record<string, number>;
reasoningSteps: Array<{
type: "analysis" | "decision" | "synthesis" | "validation";
content: string;
basedOn: string[];
confidence: "high" | "medium" | "low";
alternatives?: string[];
}>;
toolCalls: Array<{
tool: string;
params: unknown;
result: unknown;
agentInterpretation: string;
}>;
confidenceIndicators: string[];
uncertaintyFlags: string[];
}Store reasoning logs linked to traces. User reports a bad response? Pull the reasoning log. See exactly what went wrong.
Was context retrieval off? Was reasoning sound but context incomplete? Was a tool result misinterpreted? Each failure mode has a different fix. The reasoning log tells you which one.
Without reasoning logs, you're guessing at root cause. With them, you're diagnosing with evidence. The debugging speed difference is an order of magnitude.
Agent systems generate massive observability data. Without careful strategy, you drown in noise and start ignoring everything.
const ALERT_RULES: AlertRule[] = [
// Immediate: page someone
{
metric: "agent.quality.factual_accuracy",
condition: "below_threshold",
threshold: 0.80,
windowMinutes: 15,
severity: "critical",
channel: "pagerduty"
},
{
metric: "agent.cost.daily_spend_usd",
condition: "above_threshold",
threshold: DAILY_BUDGET * 0.90,
windowMinutes: 60,
severity: "critical",
channel: "pagerduty"
},
// Warning: team channel notification
{
metric: "agent.quality.task_completion_rate",
condition: "declining_trend",
thresholdChange: -0.05,
windowHours: 48,
severity: "warning",
channel: "slack-engineering"
},
];Three tiers: immediate action (page on-call), needs attention today (team channel), review in weekly metrics (dashboard only). Route to right people. Quality alerts to ML team. Infrastructure to ops. Security anomalies to security. Single channel for everything means everything gets ignored.
Continuous evaluation of production traffic separates teams that catch problems early from teams that find out from user complaints.
class ProductionEvaluationPipeline {
async processSample(interaction: ProductionInteraction): Promise<void> {
const result = await this.evaluator.evaluate(
interaction.input,
interaction.output,
this.criteria
);
await this.store.record(result, interaction.metadata);
if (result.overallScore < this.config.reviewThreshold) {
await this.reviewQueue.enqueue({ interaction, result });
}
if (await this.noveltyDetector.isNovel(result)) {
await this.evalDataset.addCase(interaction, result);
}
}
}Sample 10-20% of production interactions. Run through your evaluation pipeline. Compare to deployment baseline. Catch regression as it emerges rather than when users notice.
The production evaluation pipeline is also how your eval system learns. Static test suites only test what you anticipated. Production samples expose what you missed. Every novel failure added to the test suite makes future deployments safer.
Tracing: OpenTelemetry as vendor-neutral instrumentation. Custom spans for LLM calls and tool executions. Backend of choice: Jaeger, Grafana Tempo, Honeycomb, or Datadog.
Reasoning logs: Structured JSON in a searchable backend. Elasticsearch for full-text search. Minimum 30-day retention.
Metrics: Prometheus for custom metrics, Grafana for dashboards. Key views: quality metrics by agent and version, cost per task by type, tool health by endpoint.
Evaluation pipeline: Separate service sampling production interactions and running quality evaluation. Stores to metrics system for trend analysis.
Specialized tools: LangSmith for LangChain stacks. Helicone as an LLM gateway with built-in logging. Braintrust for evaluation-focused workflows. These accelerate setup significantly.
Start with distributed tracing and reasoning logs. Most debugging value fastest. Layer in quality metrics once basics are solid. Build continuous evaluation last but don't skip it.
For teams setting up initial deployment infrastructure, the agent deployment patterns article covers the deployment side that observability connects to.
Q: How do you monitor AI agents in production?
Monitor agents across four dimensions: operational health (uptime, response times, error rates), output quality (task completion rate, accuracy, user satisfaction), cost efficiency (tokens per task, cost per outcome), and safety (boundary violations, escalation frequency, anomalous behavior).
Q: What observability tools work for AI agents?
Use OpenTelemetry for distributed tracing across agent interactions, structured logging for decision audits, custom dashboards for AI-specific metrics, alerting on quality degradation, and cost tracking per agent and task type. Combine standard APM tools with custom AI observability layers.
Q: What metrics matter most for AI agent observability?
The critical metrics are task completion rate, output quality score, cost per task, latency (time from request to result), error recovery success rate, escalation frequency, and user satisfaction. Track these over time to detect quality degradation before it impacts users.
Full-stack developer and AI architect with years of experience shipping production applications across SaaS, mobile, and enterprise. Gareth built Agentik {OS} to prove that one person with the right AI system can outperform an entire traditional development team. He has personally architected and shipped 7+ production applications using AI-first workflows.

Monitoring AI Apps: What You're Not Tracking
Your API returns 200 OK while the AI generates nonsense. Standard monitoring misses this entirely. Here's the AI-specific observability stack you need.

Agent Deployment Patterns: What Production Actually Demands
Deploying an AI agent is nothing like deploying a REST API. Agents are stateful, expensive, non-deterministic, and slow. Every standard assumption breaks.

Testing AI Agents: QA When There's No Right Answer
You cannot assertEquals your way through agent testing. Here's how to build evaluation frameworks that actually measure quality in non-deterministic systems.
Stop reading about AI and start building with it. Book a free discovery call and see how AI agents can accelerate your business.