AI AgentsFebruary 10, 202619 min read

Agent Evaluation: Measuring What Actually Matters

Founder & CEO, Agentik{OS}

97% accuracy sounds impressive until you ask what was measured. Most evaluation frameworks produce numbers without insight. Build ones that work.

Agent Evaluation: Measuring What Actually Matters

You can not improve what you can not measure. Except with AI agents, most teams are measuring the wrong things.

"Our agent has 97% accuracy" is the most meaningless sentence in AI development. Accuracy at what? Measured against what baseline? Evaluated by whom? On which test set? With which edge cases excluded? Without answers to those questions, the number is not just useless. It is actively harmful because it gives teams false confidence that something is working when it might be quietly failing on the exact inputs that matter most to users.

I have watched three teams ship agents to production with high benchmark scores that fell apart in the first week. Not because the models were bad. Because the evaluation frameworks were measuring what was easy to measure, not what actually determined whether users trusted the system. Task completion rate looks great. But what the numbers hid was that half of "completed" tasks required the user to correct the agent mid-flow, and a third of completions produced outputs the user immediately discarded.

Agent quality is not a scalar. It is a multi-dimensional space of tradeoffs. Your evaluation framework needs to capture those dimensions, surface the tradeoffs clearly, and connect directly to the behaviors you care about in production.

Task Completion Is Table Stakes, Not the Point

Yes, track task completion. You need to know whether the agent finishes what it starts. But raw completion rate is the least informative metric past an 80% baseline.

What makes completion rate deceptive: agents can "complete" a task by doing the easy version of it, omitting the hard parts, or producing output that technically satisfies the task description but fails the user's actual intent.

A customer service agent that closes tickets by saying "I understand your frustration, please contact us at support@company.com" has a 100% completion rate and is completely useless.

The metrics that matter around completion:

Completion efficiency. How many steps and tokens did it take to complete the task? An agent that finishes in 3 tool calls and 800 tokens is better than one that finishes the same task in 11 tool calls and 4,200 tokens, even if both show 100% completion. Efficiency tells you whether the agent is planning effectively or flailing toward the right answer.

Autonomous completion rate. What fraction of completions required zero human intervention? Track this separately from total completion rate. An agent where 95% of completions are fully autonomous is excellent. One where 99% complete but 40% required mid-flow corrections is actually a problem masquerading as a good metric.

Time to completion. Separate from step count. A fast three-step process is better than a slow three-step process. Latency is a quality dimension, not just a performance dimension.

typescript

interface CompletionMetrics {
  taskId: string;
  completed: boolean;
  autonomousCompletion: boolean; // No human intervention
  stepCount: number;
  tokenCount: number;
  latencyMs: number;
  humanInterventions: number; // Corrections, clarifications, overrides
  outputQuality?: number; // 1-5 if human-rated
  userDiscarded: boolean; // Did user immediately discard the output?
}

Look at the distribution of these metrics, not just averages. The mean step count might be fine, but if there is a long tail of tasks taking 30+ steps, you have an instability problem that averages hide.

Reasoning Quality: The Metric That Separates Good from Trustworthy

This is the evaluation dimension I care most about and that almost no team measures.

An agent can produce the correct output with flawed reasoning. Pattern-matched to training examples. Got lucky on this particular input. The reasoning steps were wrong or missing, but the output happened to be right. Your evaluation marks it as a pass. You ship it with confidence.

Two weeks later, a slightly different input triggers the same flawed reasoning and produces a wrong output. You cannot understand why the behavior changed because nothing changed. The reasoning was always broken. You just did not catch it.

Chain-of-thought evaluation catches this. You evaluate not just whether the final output is correct but whether the reasoning steps leading to it are valid:

typescript

interface ReasoningEvaluation {
  taskId: string;
  chainOfThought: ReasoningStep[];
  finalOutput: string;
  evaluation: {
    outputCorrect: boolean;
    reasoningSound: boolean; // Was the reasoning valid even if output happened to be right?
    assumptionsValid: boolean; // Did the agent make assumptions that were true in this case?
    informationUsedCorrectly: boolean; // Did it use available context correctly?
    problemDecompositionQuality: number; // 1-5
  };
}

interface ReasoningStep {
  step: number;
  action: string;
  reasoning: string;
  toolCalled?: string;
  toolOutput?: string;
  conclusionDrawn: string;
}

Evaluating reasoning quality requires human review or LLM evaluation of the reasoning chain. Both are more expensive than automated output comparison. Both are worth the cost.

I use a lightweight LLM-as-judge approach for this at scale:

typescript

async function evaluateReasoning(
  task: string,
  steps: ReasoningStep[],
  output: string
): Promise<ReasoningEvaluation> {
  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-5",
    max_tokens: 500,
    system: `You are evaluating AI agent reasoning quality. You will see a task, the agent's reasoning steps, and its final output.

Evaluate:
1. Is the output correct? (true/false)
2. Is the reasoning sound? (true/false) - correct even if the output happened to be right by luck
3. Were the agent's assumptions valid? (true/false)
4. Did the agent use available information correctly? (true/false)
5. Rate problem decomposition quality 1-5

Return JSON only.`,
    messages: [
      {
        role: "user",
        content: `Task: ${task}\n\nReasoning steps:\n${JSON.stringify(steps, null, 2)}\n\nFinal output: ${output}`
      }
    ]
  });

  return JSON.parse(
    response.content[0].type === "text" ? response.content[0].text : "{}"
  );
}

An agent that produces correct outputs with flawed reasoning is a liability waiting to become an incident. The correct output is coincidental. The flawed reasoning will surface eventually, at the worst possible time.

Baselines That Give Your Metrics Meaning

A number without a reference point is not information. Your agent achieving 85% task success means nothing without knowing what the baseline is.

Every evaluation should include three baselines.

Rule-based baseline. Can a simple if-else system do this task at a meaningful success rate? If your agent achieves 85% success and a rule-based system achieves 80%, you have a 5-point improvement at significant cost and complexity. Worth it? Maybe not. If the rule-based system achieves 40%, you have a genuine breakthrough.

Previous version baseline. Every evaluation run should compare against the previous version. Catches regressions before users report them. A prompt change that improves performance on one task type while degrading performance on another shows up in baseline comparison.

Human performance baseline. Measure how well a human expert performs on the same evaluation set. This gives you both a ceiling (what is theoretically achievable) and context (the agent is operating at 60% of human expert level on this task type). Useful for communicating with stakeholders who want to understand what "good" looks like.

typescript

interface EvaluationReport {
  currentVersion: EvaluationResults;
  previousVersion: EvaluationResults;
  ruleBasedBaseline: EvaluationResults;
  humanBaseline: EvaluationResults;
  regressions: string[]; // Task types where current < previous
  improvements: string[]; // Task types where current > previous
  vsHuman: { percentage: number; gapAreas: string[] };
}

Building the Test Suite

A good test suite is curated, not comprehensive. You want representative coverage of task types, not maximum number of tests.

Stratified sampling. Categorize your production tasks into types (factual lookup, multi-step reasoning, tool orchestration, creative generation, refusals). Ensure your test suite has representation from each category proportional to production frequency.

Edge case inclusion. Explicitly include edge cases from known failure modes. If your agent has a history of failing on queries with ambiguous time references, include 5-10 of those. Test suites without edge cases produce artificially high scores that do not predict production behavior.

Golden set creation. For each test case, define what a correct output looks like. Not an exact string match (too brittle), but a rubric. "Correct output should contain the resolution deadline, reference the correct policy section, and not include the deprecated escalation path." This allows automated comparison against a standard.

typescript

interface TestCase {
  id: string;
  category: TaskCategory;
  input: string;
  context?: Record<string, any>; // Injected context/tools state
  correctOutputRubric: {
    mustContain: string[]; // Required elements
    mustNotContain: string[]; // Prohibited elements
    sentiment?: "positive" | "neutral" | "negative";
    maxLength?: number;
    requiresCitation?: boolean;
  };
  isEdgeCase: boolean;
  difficulty: "easy" | "medium" | "hard";
  humanBaselineTime?: number; // Seconds a human expert takes
}

Aim for 100-200 test cases for a production agent. More is better if you can afford the evaluation runtime, but 100 well-chosen cases outperforms 1000 random ones.

Continuous Evaluation in Production

Evaluation as a quarterly or pre-release activity is insufficient. Agent behavior drifts. Upstream data changes. User patterns shift. Problems that did not exist last month exist today.

Build continuous evaluation into the production pipeline:

typescript

// Log every production interaction for evaluation
interface ProductionLog {
  sessionId: string;
  taskInput: string;
  agentSteps: AgentStep[];
  finalOutput: string;
  timestamp: Date;
  userId?: string;
  userFeedback?: { helpful: boolean; comment?: string };
  latencyMs: number;
  totalTokens: number;
  estimatedCostUsd: number;
}

// Sample 5% of production logs for automated evaluation
async function runContinuousEvaluation(logs: ProductionLog[]) {
  const sample = sampleLogs(logs, 0.05);
  const results = await Promise.all(
    sample.map(log => evaluateAgainstRubric(log))
  );

  const report = aggregateResults(results);

  // Alert if key metrics drop below thresholds
  if (report.autonomousCompletionRate < 0.85) {
    await alertTeam("Autonomous completion rate below threshold", report);
  }
  if (report.reasoningSoundnessRate < 0.90) {
    await alertTeam("Reasoning quality degradation detected", report);
  }
}

This connects directly to agent monitoring and observability, where continuous evaluation feeds into the alerting infrastructure that tells you when something is going wrong before users start complaining.

User Satisfaction: The Final Arbiter

All technical metrics are proxies for the thing that actually matters: whether users find the agent useful.

Capture satisfaction signals at every opportunity.

Explicit feedback. Post-interaction rating (thumbs up/down or 1-5 stars). Low friction. High value. Even 20-30% response rates give you meaningful signal.

Implicit signals. Task abandonment rate. Session length after agent interaction. Return usage. Whether users copy agent outputs or retype them. These reveal satisfaction without requiring the user to rate anything.

Correction rate. How often do users edit agent outputs before using them? A high correction rate means the output is close but not quite right. A low correction rate means either the output is excellent or it is being used uncritically.

Satisfaction often reveals problems that technical metrics miss. An agent might have 90% task completion and 75% user satisfaction. The gap is a UX problem, not an AI problem. Maybe the output format is confusing. Maybe the agent asks too many clarifying questions. Maybe it is technically correct but communicates in a way that makes users feel talked down to.

User satisfaction is the metric you optimize for. Technical metrics are leading indicators that help you understand why satisfaction is where it is.

The human-in-the-loop patterns that you design will significantly affect satisfaction metrics. Getting those patterns right, knowing when to ask for help versus act autonomously, is often the difference between a tool users enjoy and one they tolerate.

Building the Evaluation Dashboard

All of these metrics are only useful if someone looks at them. Build a dashboard that surfaces actionable information, not just data.

The metrics worth tracking on a daily dashboard:

Metric	Why It Matters	Alert Threshold
Autonomous completion rate	Core measure of reliability	< 85%
Avg completion steps	Efficiency signal	> 2 SD from baseline
Reasoning soundness rate	Quality of decisions	< 90%
User satisfaction (7-day)	What users actually think	< 4.0/5.0
Edge case success rate	Robustness signal	< 70%
Regression count vs. prev version	Detects degradation	> 0

Evaluation is a steering wheel, not a speedometer. The goal is not to produce impressive numbers. It is to understand where the agent is failing and why, so you can fix the right things.

FAQ

Q: How do you evaluate AI agent performance?

Evaluate agents across multiple dimensions: task completion rate (does it finish the job?), output quality (is the result correct and useful?), efficiency (cost and time per task), reliability (consistency across runs), and safety (does it stay within boundaries?). Use automated benchmarks plus human evaluation for subjective quality.

Q: What benchmarks exist for AI coding agents?

Key benchmarks include SWE-bench (real GitHub issue resolution), HumanEval (code generation), MBPP (Python programming), and custom project-specific evaluations. SWE-bench is the most relevant for production coding agents as it tests real-world software engineering tasks end-to-end.

Q: How often should you evaluate AI agents?

Evaluate continuously: automated quality checks on every agent output, weekly aggregate metrics review, monthly benchmark comparisons, and quarterly capability assessments. Model updates, prompt changes, and new tool additions all warrant immediate re-evaluation against your benchmark suite.

Sources

Task Completion Is Table Stakes, Not the Point

Yes, track task completion. You need to know whether the agent finishes what it starts. But raw completion rate is the least informative metric past an 80% baseline.

A customer service agent that closes tickets by saying "I understand your frustration, please contact us at support@company.com" has a 100% completion rate and is completely useless.

The metrics that matter around completion:

Time to completion. Separate from step count. A fast three-step process is better than a slow three-step process. Latency is a quality dimension, not just a performance dimension.

typescript

interface CompletionMetrics {
  taskId: string;
  completed: boolean;
  autonomousCompletion: boolean; // No human intervention
  stepCount: number;
  tokenCount: number;
  latencyMs: number;
  humanInterventions: number; // Corrections, clarifications, overrides
  outputQuality?: number; // 1-5 if human-rated
  userDiscarded: boolean; // Did user immediately discard the output?
}

Reasoning Quality: The Metric That Separates Good from Trustworthy

This is the evaluation dimension I care most about and that almost no team measures.

Chain-of-thought evaluation catches this. You evaluate not just whether the final output is correct but whether the reasoning steps leading to it are valid:

typescript

interface ReasoningEvaluation {
  taskId: string;
  chainOfThought: ReasoningStep[];
  finalOutput: string;
  evaluation: {
    outputCorrect: boolean;
    reasoningSound: boolean; // Was the reasoning valid even if output happened to be right?
    assumptionsValid: boolean; // Did the agent make assumptions that were true in this case?
    informationUsedCorrectly: boolean; // Did it use available context correctly?
    problemDecompositionQuality: number; // 1-5
  };
}

interface ReasoningStep {
  step: number;
  action: string;
  reasoning: string;
  toolCalled?: string;
  toolOutput?: string;
  conclusionDrawn: string;
}

Evaluating reasoning quality requires human review or LLM evaluation of the reasoning chain. Both are more expensive than automated output comparison. Both are worth the cost.

I use a lightweight LLM-as-judge approach for this at scale:

typescript

async function evaluateReasoning(
  task: string,
  steps: ReasoningStep[],
  output: string
): Promise<ReasoningEvaluation> {
  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-5",
    max_tokens: 500,
    system: `You are evaluating AI agent reasoning quality. You will see a task, the agent's reasoning steps, and its final output.

Evaluate:
1. Is the output correct? (true/false)
2. Is the reasoning sound? (true/false) - correct even if the output happened to be right by luck
3. Were the agent's assumptions valid? (true/false)
4. Did the agent use available information correctly? (true/false)
5. Rate problem decomposition quality 1-5

Return JSON only.`,
    messages: [
      {
        role: "user",
        content: `Task: ${task}\n\nReasoning steps:\n${JSON.stringify(steps, null, 2)}\n\nFinal output: ${output}`
      }
    ]
  });

  return JSON.parse(
    response.content[0].type === "text" ? response.content[0].text : "{}"
  );
}

An agent that produces correct outputs with flawed reasoning is a liability waiting to become an incident. The correct output is coincidental. The flawed reasoning will surface eventually, at the worst possible time.

Baselines That Give Your Metrics Meaning

A number without a reference point is not information. Your agent achieving 85% task success means nothing without knowing what the baseline is.

Every evaluation should include three baselines.

typescript

interface EvaluationReport {
  currentVersion: EvaluationResults;
  previousVersion: EvaluationResults;
  ruleBasedBaseline: EvaluationResults;
  humanBaseline: EvaluationResults;
  regressions: string[]; // Task types where current < previous
  improvements: string[]; // Task types where current > previous
  vsHuman: { percentage: number; gapAreas: string[] };
}

Building the Test Suite

A good test suite is curated, not comprehensive. You want representative coverage of task types, not maximum number of tests.

typescript

interface TestCase {
  id: string;
  category: TaskCategory;
  input: string;
  context?: Record<string, any>; // Injected context/tools state
  correctOutputRubric: {
    mustContain: string[]; // Required elements
    mustNotContain: string[]; // Prohibited elements
    sentiment?: "positive" | "neutral" | "negative";
    maxLength?: number;
    requiresCitation?: boolean;
  };
  isEdgeCase: boolean;
  difficulty: "easy" | "medium" | "hard";
  humanBaselineTime?: number; // Seconds a human expert takes
}

Aim for 100-200 test cases for a production agent. More is better if you can afford the evaluation runtime, but 100 well-chosen cases outperforms 1000 random ones.

Continuous Evaluation in Production

Evaluation as a quarterly or pre-release activity is insufficient. Agent behavior drifts. Upstream data changes. User patterns shift. Problems that did not exist last month exist today.

Build continuous evaluation into the production pipeline:

typescript

// Log every production interaction for evaluation
interface ProductionLog {
  sessionId: string;
  taskInput: string;
  agentSteps: AgentStep[];
  finalOutput: string;
  timestamp: Date;
  userId?: string;
  userFeedback?: { helpful: boolean; comment?: string };
  latencyMs: number;
  totalTokens: number;
  estimatedCostUsd: number;
}

// Sample 5% of production logs for automated evaluation
async function runContinuousEvaluation(logs: ProductionLog[]) {
  const sample = sampleLogs(logs, 0.05);
  const results = await Promise.all(
    sample.map(log => evaluateAgainstRubric(log))
  );

  const report = aggregateResults(results);

  // Alert if key metrics drop below thresholds
  if (report.autonomousCompletionRate < 0.85) {
    await alertTeam("Autonomous completion rate below threshold", report);
  }
  if (report.reasoningSoundnessRate < 0.90) {
    await alertTeam("Reasoning quality degradation detected", report);
  }
}

User Satisfaction: The Final Arbiter

All technical metrics are proxies for the thing that actually matters: whether users find the agent useful.

Capture satisfaction signals at every opportunity.

Explicit feedback. Post-interaction rating (thumbs up/down or 1-5 stars). Low friction. High value. Even 20-30% response rates give you meaningful signal.

User satisfaction is the metric you optimize for. Technical metrics are leading indicators that help you understand why satisfaction is where it is.

Building the Evaluation Dashboard

All of these metrics are only useful if someone looks at them. Build a dashboard that surfaces actionable information, not just data.

The metrics worth tracking on a daily dashboard:

Metric	Why It Matters	Alert Threshold
Autonomous completion rate	Core measure of reliability	< 85%
Avg completion steps	Efficiency signal	> 2 SD from baseline
Reasoning soundness rate	Quality of decisions	< 90%
User satisfaction (7-day)	What users actually think	< 4.0/5.0
Edge case success rate	Robustness signal	< 70%
Regression count vs. prev version	Detects degradation	> 0

Evaluation is a steering wheel, not a speedometer. The goal is not to produce impressive numbers. It is to understand where the agent is failing and why, so you can fix the right things.

FAQ

Q: How do you evaluate AI agent performance?

Q: What benchmarks exist for AI coding agents?

Q: How often should you evaluate AI agents?

Agent Evaluation: Measuring What Actually Matters

Task Completion Is Table Stakes, Not the Point

Reasoning Quality: The Metric That Separates Good from Trustworthy

Baselines That Give Your Metrics Meaning

Building the Test Suite

Continuous Evaluation in Production

User Satisfaction: The Final Arbiter

Building the Evaluation Dashboard

FAQ

Sources

Further Reading

Related Articles

Want to Implement This?

Agent Evaluation: Measuring What Actually Matters

Task Completion Is Table Stakes, Not the Point

Reasoning Quality: The Metric That Separates Good from Trustworthy

Baselines That Give Your Metrics Meaning

Building the Test Suite

Continuous Evaluation in Production

User Satisfaction: The Final Arbiter

Building the Evaluation Dashboard

FAQ

Sources

Further Reading

Related Articles

Want to Implement This?