AI AgentsJanuary 31, 202620 min read

Agent Observability: Seeing Inside the Black Box

Founder & CEO, Agentik{OS}

Request received. Response sent. 200 OK. Latency 3.2s. None of that tells you why the agent gave wrong advice. Here's what actually does.

Agent Observability: Seeing Inside the Black Box

Your agent is in production. Requests are coming in. Responses are going out. Everything looks fine on the dashboard.

Then a user reports completely wrong advice. You open monitoring. Here is what you see:

Request received: 14:23:41. Response sent: 14:23:44. HTTP 200. Latency: 3.2 seconds.

Absolutely nothing in that data tells you why the agent gave wrong advice. You don't know what it considered. What alternatives it weighed. What context it used. Why it produced that particular output.

This is the observability gap that kills agent systems in production. You can see that things happened. You cannot see why.

Why Traditional Monitoring Is Structurally Insufficient

Traditional monitoring was designed for deterministic systems. Same input, same output. Something wrong? Find the error, trace the code path, fix it.

Agent systems are non-deterministic. The same input produces different outputs on different runs. Behavior emerges from instructions, model reasoning, and specific context. There is no single code path to trace.

What you actually need:

The full reasoning chain, not just inputs and outputs
Every tool call and the agent's interpretation of results
What context was retrieved and how it was weighted
What alternatives were considered and rejected
Where confidence was high versus uncertain
How decisions cascaded through multi-agent pipelines

None of this appears in standard monitoring. You build it deliberately or you fly blind.

Distributed Tracing: The Foundation

Distributed tracing adapted for AI is the foundation of agent observability. Not optional.

typescript

import { trace, Span, SpanStatusCode } from "@opentelemetry/api";

const tracer = trace.getTracer("agent-orchestrator");

async function executeAgentWorkflow(
  task: AgentTask,
  workflowId: string
): Promise<WorkflowResult> {
  return tracer.startActiveSpan(
    `workflow.${task.type}`,
    { attributes: { workflowId, taskType: task.type, userId: task.userId } },
    async (workflowSpan: Span) => {
      try {
        const result = await runWorkflow(task, workflowSpan);
        workflowSpan.setAttributes({
          "workflow.quality_score": result.qualityScore,
          "workflow.total_tokens": result.totalTokens,
          "workflow.total_cost_usd": result.totalCostUsd,
        });
        return result;
      } catch (error) {
        workflowSpan.setStatus({ code: SpanStatusCode.ERROR, message: String(error) });
        throw error;
      } finally {
        workflowSpan.end();
      }
    }
  );
}

async function runLLMCall(agent: string, prompt: string, model: string): Promise<LLMResponse> {
  return tracer.startActiveSpan(`llm.call.${agent}`, {}, async (llmSpan: Span) => {
    const inputTokens = countTokens(prompt);
    const response = await callLLM(prompt, model);
    const outputTokens = countTokens(response.content);
    
    llmSpan.setAttributes({
      "llm.model": model,
      "llm.input_tokens": inputTokens,
      "llm.output_tokens": outputTokens,
      "llm.cost_usd": calculateCost(inputTokens, outputTokens, model),
    });
    
    // Store full prompt and response for reasoning analysis
    await reasoningStore.save({ spanId: llmSpan.spanContext().spanId, agent, prompt, response: response.content });
    
    llmSpan.end();
    return response;
  });
}

Every significant step gets a span. Orchestrator decisions. LLM calls with full input and output. Tool executions with parameters and results. Agent handoffs. The trace gives you the complete execution map.

When something goes wrong, drill from the top-level workflow span through nested spans to find exactly where the failure occurred and what state the system was in.

Quality Metrics That Actually Matter

Latency, throughput, and error rate tell you about system health. Nothing about agent quality. You need both, and the quality metrics are harder and more important.

typescript

const AGENT_QUALITY_METRICS = [
  {
    name: "factual_accuracy",
    measuredBy: "llm_evaluator" as const,
    alertThreshold: 0.85,
    description: "Factual claims are correct and supported by context"
  },
  {
    name: "task_completion_rate",
    measuredBy: "human" as const,
    alertThreshold: 0.75,
    description: "User accomplished their goal through the interaction"
  },
  {
    name: "response_relevance",
    measuredBy: "llm_evaluator" as const,
    alertThreshold: 0.90,
    description: "Response directly addresses the user question"
  },
  {
    name: "hallucination_rate",
    measuredBy: "deterministic" as const,
    alertThreshold: 0.03,
    description: "Rate of responses containing verifiable false claims"
  }
];

Track each metric per agent, per behavioral version, per task type, per time period. Granular tracking lets you identify exactly where quality changes when a deployment happens.

Cost per task matters equally and is equally undertracked. Total spend grows with usage (expected). Cost per task should be stable or declining. Growing cost-per-task signals inefficiency before it shows up in the total bill.

The Reasoning Log: What Most Teams Miss

The piece most teams don't build. The most important one for debugging.

Every execution should produce a reasoning log capturing the complete chain:

typescript

interface ReasoningLog {
  executionId: string;
  agentId: string;
  timestamp: Date;
  
  retrievedContext: ContextItem[];
  contextScores: Record<string, number>;
  
  reasoningSteps: Array<{
    type: "analysis" | "decision" | "synthesis" | "validation";
    content: string;
    basedOn: string[];
    confidence: "high" | "medium" | "low";
    alternatives?: string[];
  }>;
  
  toolCalls: Array<{
    tool: string;
    params: unknown;
    result: unknown;
    agentInterpretation: string;
  }>;
  
  confidenceIndicators: string[];
  uncertaintyFlags: string[];
}

Store reasoning logs linked to traces. User reports a bad response? Pull the reasoning log. See exactly what went wrong.

Was context retrieval off? Was reasoning sound but context incomplete? Was a tool result misinterpreted? Each failure mode has a different fix. The reasoning log tells you which one.

Without reasoning logs, you're guessing at root cause. With them, you're diagnosing with evidence. The debugging speed difference is an order of magnitude.

Alerting Without Alert Fatigue

Agent systems generate massive observability data. Without careful strategy, you drown in noise and start ignoring everything.

typescript

const ALERT_RULES: AlertRule[] = [
  // Immediate: page someone
  {
    metric: "agent.quality.factual_accuracy",
    condition: "below_threshold",
    threshold: 0.80,
    windowMinutes: 15,
    severity: "critical",
    channel: "pagerduty"
  },
  {
    metric: "agent.cost.daily_spend_usd",
    condition: "above_threshold",
    threshold: DAILY_BUDGET * 0.90,
    windowMinutes: 60,
    severity: "critical",
    channel: "pagerduty"
  },
  // Warning: team channel notification
  {
    metric: "agent.quality.task_completion_rate",
    condition: "declining_trend",
    thresholdChange: -0.05,
    windowHours: 48,
    severity: "warning",
    channel: "slack-engineering"
  },
];

Three tiers: immediate action (page on-call), needs attention today (team channel), review in weekly metrics (dashboard only). Route to right people. Quality alerts to ML team. Infrastructure to ops. Security anomalies to security. Single channel for everything means everything gets ignored.

Production Sampling and Continuous Evaluation

Continuous evaluation of production traffic separates teams that catch problems early from teams that find out from user complaints.

typescript

class ProductionEvaluationPipeline {
  async processSample(interaction: ProductionInteraction): Promise<void> {
    const result = await this.evaluator.evaluate(
      interaction.input,
      interaction.output,
      this.criteria
    );
    
    await this.store.record(result, interaction.metadata);
    
    if (result.overallScore < this.config.reviewThreshold) {
      await this.reviewQueue.enqueue({ interaction, result });
    }
    
    if (await this.noveltyDetector.isNovel(result)) {
      await this.evalDataset.addCase(interaction, result);
    }
  }
}

Sample 10-20% of production interactions. Run through your evaluation pipeline. Compare to deployment baseline. Catch regression as it emerges rather than when users notice.

The production evaluation pipeline is also how your eval system learns. Static test suites only test what you anticipated. Production samples expose what you missed. Every novel failure added to the test suite makes future deployments safer.

Building the Stack

Tracing: OpenTelemetry as vendor-neutral instrumentation. Custom spans for LLM calls and tool executions. Backend of choice: Jaeger, Grafana Tempo, Honeycomb, or Datadog.

Reasoning logs: Structured JSON in a searchable backend. Elasticsearch for full-text search. Minimum 30-day retention.

Metrics: Prometheus for custom metrics, Grafana for dashboards. Key views: quality metrics by agent and version, cost per task by type, tool health by endpoint.

Evaluation pipeline: Separate service sampling production interactions and running quality evaluation. Stores to metrics system for trend analysis.

Specialized tools: LangSmith for LangChain stacks. Helicone as an LLM gateway with built-in logging. Braintrust for evaluation-focused workflows. These accelerate setup significantly.

Start with distributed tracing and reasoning logs. Most debugging value fastest. Layer in quality metrics once basics are solid. Build continuous evaluation last but don't skip it.

For teams setting up initial deployment infrastructure, the agent deployment patterns article covers the deployment side that observability connects to.

FAQ

Q: How do you monitor AI agents in production?

Monitor agents across four dimensions: operational health (uptime, response times, error rates), output quality (task completion rate, accuracy, user satisfaction), cost efficiency (tokens per task, cost per outcome), and safety (boundary violations, escalation frequency, anomalous behavior).

Q: What observability tools work for AI agents?

Use OpenTelemetry for distributed tracing across agent interactions, structured logging for decision audits, custom dashboards for AI-specific metrics, alerting on quality degradation, and cost tracking per agent and task type. Combine standard APM tools with custom AI observability layers.

Q: What metrics matter most for AI agent observability?

The critical metrics are task completion rate, output quality score, cost per task, latency (time from request to result), error recovery success rate, escalation frequency, and user satisfaction. Track these over time to detect quality degradation before it impacts users.

Sources

Why Traditional Monitoring Is Structurally Insufficient

Traditional monitoring was designed for deterministic systems. Same input, same output. Something wrong? Find the error, trace the code path, fix it.

What you actually need:

The full reasoning chain, not just inputs and outputs
Every tool call and the agent's interpretation of results
What context was retrieved and how it was weighted
What alternatives were considered and rejected
Where confidence was high versus uncertain
How decisions cascaded through multi-agent pipelines

None of this appears in standard monitoring. You build it deliberately or you fly blind.

Distributed Tracing: The Foundation

Distributed tracing adapted for AI is the foundation of agent observability. Not optional.

typescript

import { trace, Span, SpanStatusCode } from "@opentelemetry/api";

const tracer = trace.getTracer("agent-orchestrator");

async function executeAgentWorkflow(
  task: AgentTask,
  workflowId: string
): Promise<WorkflowResult> {
  return tracer.startActiveSpan(
    `workflow.${task.type}`,
    { attributes: { workflowId, taskType: task.type, userId: task.userId } },
    async (workflowSpan: Span) => {
      try {
        const result = await runWorkflow(task, workflowSpan);
        workflowSpan.setAttributes({
          "workflow.quality_score": result.qualityScore,
          "workflow.total_tokens": result.totalTokens,
          "workflow.total_cost_usd": result.totalCostUsd,
        });
        return result;
      } catch (error) {
        workflowSpan.setStatus({ code: SpanStatusCode.ERROR, message: String(error) });
        throw error;
      } finally {
        workflowSpan.end();
      }
    }
  );
}

async function runLLMCall(agent: string, prompt: string, model: string): Promise<LLMResponse> {
  return tracer.startActiveSpan(`llm.call.${agent}`, {}, async (llmSpan: Span) => {
    const inputTokens = countTokens(prompt);
    const response = await callLLM(prompt, model);
    const outputTokens = countTokens(response.content);
    
    llmSpan.setAttributes({
      "llm.model": model,
      "llm.input_tokens": inputTokens,
      "llm.output_tokens": outputTokens,
      "llm.cost_usd": calculateCost(inputTokens, outputTokens, model),
    });
    
    // Store full prompt and response for reasoning analysis
    await reasoningStore.save({ spanId: llmSpan.spanContext().spanId, agent, prompt, response: response.content });
    
    llmSpan.end();
    return response;
  });
}

When something goes wrong, drill from the top-level workflow span through nested spans to find exactly where the failure occurred and what state the system was in.

Quality Metrics That Actually Matter

Latency, throughput, and error rate tell you about system health. Nothing about agent quality. You need both, and the quality metrics are harder and more important.

typescript

const AGENT_QUALITY_METRICS = [
  {
    name: "factual_accuracy",
    measuredBy: "llm_evaluator" as const,
    alertThreshold: 0.85,
    description: "Factual claims are correct and supported by context"
  },
  {
    name: "task_completion_rate",
    measuredBy: "human" as const,
    alertThreshold: 0.75,
    description: "User accomplished their goal through the interaction"
  },
  {
    name: "response_relevance",
    measuredBy: "llm_evaluator" as const,
    alertThreshold: 0.90,
    description: "Response directly addresses the user question"
  },
  {
    name: "hallucination_rate",
    measuredBy: "deterministic" as const,
    alertThreshold: 0.03,
    description: "Rate of responses containing verifiable false claims"
  }
];

Track each metric per agent, per behavioral version, per task type, per time period. Granular tracking lets you identify exactly where quality changes when a deployment happens.

The Reasoning Log: What Most Teams Miss

The piece most teams don't build. The most important one for debugging.

Every execution should produce a reasoning log capturing the complete chain:

typescript

interface ReasoningLog {
  executionId: string;
  agentId: string;
  timestamp: Date;
  
  retrievedContext: ContextItem[];
  contextScores: Record<string, number>;
  
  reasoningSteps: Array<{
    type: "analysis" | "decision" | "synthesis" | "validation";
    content: string;
    basedOn: string[];
    confidence: "high" | "medium" | "low";
    alternatives?: string[];
  }>;
  
  toolCalls: Array<{
    tool: string;
    params: unknown;
    result: unknown;
    agentInterpretation: string;
  }>;
  
  confidenceIndicators: string[];
  uncertaintyFlags: string[];
}

Store reasoning logs linked to traces. User reports a bad response? Pull the reasoning log. See exactly what went wrong.

Was context retrieval off? Was reasoning sound but context incomplete? Was a tool result misinterpreted? Each failure mode has a different fix. The reasoning log tells you which one.

Without reasoning logs, you're guessing at root cause. With them, you're diagnosing with evidence. The debugging speed difference is an order of magnitude.

Alerting Without Alert Fatigue

Agent systems generate massive observability data. Without careful strategy, you drown in noise and start ignoring everything.

typescript

const ALERT_RULES: AlertRule[] = [
  // Immediate: page someone
  {
    metric: "agent.quality.factual_accuracy",
    condition: "below_threshold",
    threshold: 0.80,
    windowMinutes: 15,
    severity: "critical",
    channel: "pagerduty"
  },
  {
    metric: "agent.cost.daily_spend_usd",
    condition: "above_threshold",
    threshold: DAILY_BUDGET * 0.90,
    windowMinutes: 60,
    severity: "critical",
    channel: "pagerduty"
  },
  // Warning: team channel notification
  {
    metric: "agent.quality.task_completion_rate",
    condition: "declining_trend",
    thresholdChange: -0.05,
    windowHours: 48,
    severity: "warning",
    channel: "slack-engineering"
  },
];

Production Sampling and Continuous Evaluation

Continuous evaluation of production traffic separates teams that catch problems early from teams that find out from user complaints.

typescript

class ProductionEvaluationPipeline {
  async processSample(interaction: ProductionInteraction): Promise<void> {
    const result = await this.evaluator.evaluate(
      interaction.input,
      interaction.output,
      this.criteria
    );
    
    await this.store.record(result, interaction.metadata);
    
    if (result.overallScore < this.config.reviewThreshold) {
      await this.reviewQueue.enqueue({ interaction, result });
    }
    
    if (await this.noveltyDetector.isNovel(result)) {
      await this.evalDataset.addCase(interaction, result);
    }
  }
}

Sample 10-20% of production interactions. Run through your evaluation pipeline. Compare to deployment baseline. Catch regression as it emerges rather than when users notice.

Building the Stack

Tracing: OpenTelemetry as vendor-neutral instrumentation. Custom spans for LLM calls and tool executions. Backend of choice: Jaeger, Grafana Tempo, Honeycomb, or Datadog.

Reasoning logs: Structured JSON in a searchable backend. Elasticsearch for full-text search. Minimum 30-day retention.

Metrics: Prometheus for custom metrics, Grafana for dashboards. Key views: quality metrics by agent and version, cost per task by type, tool health by endpoint.

Evaluation pipeline: Separate service sampling production interactions and running quality evaluation. Stores to metrics system for trend analysis.

Specialized tools: LangSmith for LangChain stacks. Helicone as an LLM gateway with built-in logging. Braintrust for evaluation-focused workflows. These accelerate setup significantly.

Start with distributed tracing and reasoning logs. Most debugging value fastest. Layer in quality metrics once basics are solid. Build continuous evaluation last but don't skip it.

For teams setting up initial deployment infrastructure, the agent deployment patterns article covers the deployment side that observability connects to.

FAQ

Q: How do you monitor AI agents in production?

Q: What observability tools work for AI agents?

Q: What metrics matter most for AI agent observability?

Agent Observability: Seeing Inside the Black Box

Why Traditional Monitoring Is Structurally Insufficient

Distributed Tracing: The Foundation

Quality Metrics That Actually Matter

The Reasoning Log: What Most Teams Miss

Alerting Without Alert Fatigue

Production Sampling and Continuous Evaluation

Building the Stack

FAQ

Sources

Further Reading

Related Articles

Want to Implement This?

Agent Observability: Seeing Inside the Black Box

Why Traditional Monitoring Is Structurally Insufficient

Distributed Tracing: The Foundation

Quality Metrics That Actually Matter

The Reasoning Log: What Most Teams Miss

Alerting Without Alert Fatigue

Production Sampling and Continuous Evaluation

Building the Stack

FAQ

Sources

Further Reading

Related Articles

Want to Implement This?