AI AgentsJanuary 19, 202619 min read

Testing AI Agents: QA When There's No Right Answer

Founder & CEO, Agentik{OS}

You cannot assertEquals your way through agent testing. Here's how to build evaluation frameworks that actually measure quality in non-deterministic systems.

Testing AI Agents: QA When There's No Right Answer

You cannot assertEquals your way through AI agent testing. I've watched teams try. Beautiful test suites. Hundreds of carefully crafted assertions. Deterministic expectations applied to a non-deterministic system.

Different failure every run. Different output every run, actually. And all of them valid.

The fundamental problem with applying traditional QA to agents: the same input can produce different but equally correct outputs. "Summarize this document" yields ten valid summaries. "Write a function to sort this array" returns merge sort, quicksort, bubble sort, or something creative. Traditional testing assumes one right answer. Agent testing cannot.

This isn't a limitation you work around. It's a property you design for. And once you accept it, a completely different and actually better testing methodology emerges.

The Paradigm Shift: From Equality to Criteria

The first and most important shift: stop asking "does the output match this expected string" and start asking "does the output satisfy these requirements."

For a customer support agent, the requirements might be:

Response addresses the actual question asked
Tone is professional and empathetic
No factually incorrect statements about the product
Suggested actions are specific and actionable
Length is appropriate to question complexity
No sensitive customer data leaked in the response

Each requirement becomes a scoring function. Some automate easily: length validation, format checking, forbidden content detection, presence of required elements. Others require an LLM evaluator that scores output against detailed rubrics. The best systems use both.

typescript

interface EvaluationCriteria {
  id: string;
  name: string;
  weight: number; // Relative importance
  evaluator: CriteriaEvaluator;
}

type CriteriaEvaluator = 
  | DeterministicEvaluator  // Regex, length checks, format validation
  | HeuristicEvaluator      // Pattern matching, contradiction detection
  | LLMEvaluator;            // LLM-graded against rubric

interface EvaluationResult {
  criteriaId: string;
  score: number;        // 0-1
  reasoning: string;    // Why this score
  examples: string[];   // Specific evidence from the output
}

async function evaluateAgentOutput(
  input: AgentInput,
  output: AgentOutput,
  criteria: EvaluationCriteria[]
): Promise<EvaluationReport> {
  const results = await Promise.all(
    criteria.map(c => c.evaluator.evaluate(input, output, c))
  );
  
  const weightedScore = results.reduce(
    (sum, r, i) => sum + r.score * criteria[i].weight,
    0
  ) / criteria.reduce((sum, c) => sum + c.weight, 0);
  
  return {
    overallScore: weightedScore,
    criteriaResults: results,
    passed: weightedScore >= PASSING_THRESHOLD,
    timestamp: new Date(),
  };
}

This is harder to write than expect(output).toBe(expected). It's also fundamentally more honest about what you're actually testing: not whether the agent produces identical output, but whether it produces good output.

The Three-Layer Evaluation Stack

A production-ready evaluation framework has three distinct layers, each serving a different purpose.

Layer 1: Deterministic Checks

Fast, cheap, run on every single output. These catch obvious failures instantly.

Valid JSON/structured output when structure is required
Required fields present in responses
Output within length bounds
Forbidden content absent (competitor names, internal info, sensitive data)
Format constraints met (markdown when expected, plain text when required)
Response time within acceptable bounds

These should run in milliseconds. They should never fail in ways that require investigation. Either the output is valid JSON or it isn't.

Layer 2: Heuristic Checks

Slower and more nuanced, but still automated. These catch systematic problems.

Hallucinated URLs that return 404
References to out-of-scope topics or systems
Internal contradictions within a single response
Contradictions with the provided context
Suspiciously high similarity to training examples (possible verbatim reproduction)
Statistical anomalies in response patterns

Run these on every output, but they can be async. Failure doesn't block immediate response but triggers review and alerting.

Layer 3: LLM-Graded Evaluation

The most expensive layer. A separate LLM evaluates output against detailed rubrics. This is where you catch nuanced quality issues that patterns miss.

typescript

const FACTUAL_ACCURACY_RUBRIC = `
You are evaluating whether an AI agent's response is factually accurate.

Given the context provided to the agent and the agent's response:

Score 5: All factual claims are accurate and well-supported by the context.
Score 4: Minor inaccuracies that don't affect the core answer.
Score 3: Some inaccuracies present but the overall answer is directionally correct.
Score 2: Significant inaccuracies that could mislead the user.
Score 1: Primarily inaccurate or contradicts the provided context directly.

Context provided: {context}
Agent response: {response}

Provide your score (1-5) and cite specific claims and whether they are supported by the context.
`;

async function gradedEvaluation(
  context: string,
  response: string,
  rubric: string
): Promise<GradedResult> {
  const evalPrompt = rubric
    .replace("{context}", context)
    .replace("{response}", response);
    
  const evaluation = await evaluatorModel.complete(evalPrompt);
  return parseGradedResult(evaluation);
}

Run Layer 3 on samples rather than every output. 10-20% sampling gives solid signal at manageable cost. Run full evaluation on every output before deployments and on flagged outputs from Layers 1 and 2.

The three layers work as a funnel. Layer 1 filters garbage. Layer 2 catches systematic problems. Layer 3 measures quality. Run cheapest first, escalate when needed.

Building the Evaluation Dataset

Your evaluation is only as good as your test cases. This is where most teams dramatically underinvest, then wonder why their eval suite gives them false confidence.

Production-quality evaluation requires several hundred test cases minimum. Each test case should include:

Input: The query or task
Context: Relevant information provided to the agent
Expected criteria: Specific requirements this response must meet
Difficulty rating: Easy / Medium / Hard / Adversarial
Category tags: Task type, domain, failure mode being tested
Known good example: Optional reference of what excellent looks like

Build from real usage, not imagination.

Production logs are your best source. Real user queries expose scenarios you'd never think to create. Every user-reported problem becomes a test case. Every surprisingly good response gets captured. Every discovered failure mode spawns multiple variants testing the same edge.

Failure mode mining is especially valuable. If the agent failed on input X, what's the class of inputs similar to X? Create 5-10 variants testing that class. Your eval suite should map failure modes as exhaustively as it maps success cases.

Adversarial inputs need their own category. Inputs designed to confuse, manipulate, or break the agent. Prompt injection attempts. Contradictory instructions. Ambiguous requests where the agent should ask for clarification rather than guess.

Version your dataset rigorously. Every addition, modification, and removal tracked. When eval scores change, you need to know whether the agent changed or the tests changed.

typescript

interface TestCase {
  id: string;
  version: number;
  createdAt: Date;
  source: "production" | "manual" | "generated" | "adversarial";
  
  input: AgentInput;
  context?: RelevantContext;
  
  criteria: EvaluationCriteria[];
  difficulty: "easy" | "medium" | "hard" | "adversarial";
  categories: string[];
  
  // Optional: reference of what excellent looks like
  referenceOutput?: string;
  
  // Tracking
  addedReason: string;
  relatedIncident?: string; // If added because of a failure
}

Regression Testing for Non-Determinism

Traditional regression testing: previously working functionality still works after changes. For agents this is complicated by non-determinism. The agent produces different output every run. How do you know if a change is a regression or just normal variation?

The answer: aggregate scoring.

Run the full eval suite before and after any change. Individual test case results will vary. That's expected. What matters is the distribution of scores across the entire suite.

Average factual accuracy drops from 4.2 to 3.8 after a prompt change? That's a regression.
Average task completion improves from 0.87 to 0.91 after a tool update? That's an improvement.
Individual test case varies between 3 and 5 across runs? Normal variance, not a regression.

Set threshold policies:

Any metric drops more than 0.3 points: block deployment
Any metric improves while another drops significantly: require review
All metrics maintain or improve: proceed

Run the suite multiple times per change. Non-determinism means single runs are unreliable. Three to five runs give solid signal. Expensive. Cheaper than deploying a regression that affects real users.

typescript

async function runRegressionTest(
  baseline: EvalResults,
  candidate: AgentVersion,
  suite: TestCase[],
  runs: number = 3
): Promise<RegressionReport> {
  const results: EvalResults[] = [];
  
  for (let i = 0; i < runs; i++) {
    results.push(await runEvalSuite(candidate, suite));
  }
  
  const aggregated = aggregateResults(results);
  const comparison = compareToBaseline(baseline, aggregated);
  
  return {
    approved: comparison.regressions.length === 0,
    improvements: comparison.improvements,
    regressions: comparison.regressions,
    neutral: comparison.neutral,
    recommendation: comparison.regressions.length === 0
      ? "safe to deploy"
      : `blocked by regressions in: ${comparison.regressions.join(", ")}`
  };
}

Edge Cases That Destroy Production Systems

Agents handle the happy path well. Every demo proves this. The production killers are inputs nobody anticipated during development.

Ambiguous queries are the most common. "Fix the thing from yesterday." What thing? Which yesterday? A good agent asks for clarification. A bad one guesses confidently and acts on wrong assumptions.

Build an entire test category around ambiguity. For each ambiguous input, the expected behavior is clarification-seeking. Score the response on whether it identified the ambiguity and what clarifying question it asked.

Contradictory instructions expose reasoning quality. "Make it shorter but include all details." "Be casual but professional." The agent needs to recognize the tension and negotiate rather than arbitrarily choosing one side.

Out-of-scope requests test boundary enforcement. A coding agent asked to write poetry. A customer support agent asked for financial advice. A research agent asked to make purchases. These should be graceful, helpful declines, not hallucinated attempts.

Adversarial inputs are where you find the scary stuff. Prompt injection attempts. Social engineering trying to expand permissions. Instructions embedded in data the agent is processing. Build a dedicated adversarial test suite and update it continuously as new attack vectors are discovered. This overlaps directly with agent security considerations.

Long context degradation is insidious. Agent performs well on short inputs. Performance degrades as context grows. This is common with complex multi-turn tasks and RAG-heavy architectures. Test explicitly with varying context lengths.

The adversarial test suite isn't optional for production. It's the difference between an agent that's safe and one that will eventually be weaponized.

Continuous Evaluation in Production

Pre-deployment testing is necessary but insufficient. The real signal comes from production.

Sample 10-20% of production interactions automatically and run them through the evaluation pipeline. This catches:

Scenarios your test suite missed entirely
Quality drift from changes in user behavior you didn't anticipate
Distribution shift where users start asking different types of questions
Performance degradation from upstream changes (model updates, tool changes, data changes)

Build feedback loops that flow production signal back into evaluation.

typescript

interface ProductionEvalPipeline {
  // Sampling strategy
  samplingRate: number;        // 0.1 = 10% of interactions
  prioritySampling: SamplingRule[]; // Always sample certain categories
  
  // Evaluation
  evaluators: EvaluationCriteria[];
  batchSize: number;
  
  // Feedback loops
  flagThreshold: number;       // Score below this triggers review
  datasetContribution: boolean; // Auto-add flagged cases to test suite
  alertThresholds: AlertRule[];
}

// Auto-add production failures to eval dataset
async function processProductionSample(
  interaction: ProductionInteraction,
  evalResult: EvaluationResult,
  pipeline: ProductionEvalPipeline
): Promise<void> {
  if (evalResult.overallScore < pipeline.flagThreshold) {
    // Add to review queue
    await flagForReview(interaction, evalResult);
    
    // Auto-add to dataset if pattern is new
    if (pipeline.datasetContribution) {
      const isNovelFailure = await checkNovelty(interaction, evalResult);
      if (isNovelFailure) {
        await addToEvalDataset(interaction, evalResult);
      }
    }
  }
}

The production evaluation pipeline is where your evaluation system learns. Static suites can only test what you anticipated. Production samples expose what you missed.

The Evaluator Model Problem

Using an LLM to evaluate another LLM's output introduces bias. The evaluator model has its own style preferences, its own knowledge cutoffs, its own failure modes.

Several mitigations:

Use a different model family as evaluator. If Claude generates outputs, use GPT-4 as evaluator and vice versa. Different training reduces systematic corroboration.

Calibrate evaluators against human judgments. Have humans rate a sample of outputs. Compare evaluator scores to human scores. Calibrate the rubric until evaluator and human scores correlate strongly.

Use multiple evaluator models. Average scores from different evaluators. Different models flag different issues. Ensemble evaluation is more reliable than single-model evaluation.

Make rubrics specific. Vague rubrics produce vague scores that correlate poorly with quality. Specific rubrics with concrete examples produce consistent, useful scores.

What Good Testing Infrastructure Enables

The teams I've seen ship and maintain production agents reliably share a common trait: they invested heavily in evaluation infrastructure early and never stopped improving it.

With solid testing:

Deploy changes with confidence because regressions are caught automatically
Diagnose quality issues quickly because you know which criteria are failing and why
Measure improvements from prompt changes, tool updates, and model upgrades with real numbers
Build the data foundation for systematic quality improvement over time

Without it: you're deploying changes and hoping. You're diagnosing problems by reading logs. You're arguing about whether things got better based on vibes.

The initial investment in evaluation infrastructure is significant. The compounding returns make it one of the highest-leverage decisions in any agent project.

Pair your testing framework with robust agent monitoring and observability to catch issues that only emerge in production patterns.

FAQ

Q: How do you test AI agents?

Testing AI agents requires a layered approach: unit tests for individual tool calls and decision logic, integration tests for multi-step workflows, evaluation benchmarks for output quality, chaos tests for error recovery, and human evaluation for judgment-dependent tasks. Unlike traditional software, agent tests must account for non-deterministic outputs.

Q: What makes testing AI agents different from testing traditional software?

Agent testing differs because outputs are non-deterministic (same input can produce different valid outputs), agents make autonomous decisions that affect subsequent steps, and quality is subjective for many tasks. Effective agent testing uses evaluation criteria (did the output meet the goal?) rather than exact output matching.

Q: What is an agent evaluation framework?

An agent evaluation framework measures agent performance across multiple dimensions: task completion rate, output quality, cost efficiency, latency, error recovery success, and safety compliance. Frameworks like SWE-bench for coding or custom evaluation suites provide standardized benchmarks for comparing agent performance.

Sources

The Paradigm Shift: From Equality to Criteria

The first and most important shift: stop asking "does the output match this expected string" and start asking "does the output satisfy these requirements."

For a customer support agent, the requirements might be:

Response addresses the actual question asked
Tone is professional and empathetic
No factually incorrect statements about the product
Suggested actions are specific and actionable
Length is appropriate to question complexity
No sensitive customer data leaked in the response

typescript

interface EvaluationCriteria {
  id: string;
  name: string;
  weight: number; // Relative importance
  evaluator: CriteriaEvaluator;
}

type CriteriaEvaluator = 
  | DeterministicEvaluator  // Regex, length checks, format validation
  | HeuristicEvaluator      // Pattern matching, contradiction detection
  | LLMEvaluator;            // LLM-graded against rubric

interface EvaluationResult {
  criteriaId: string;
  score: number;        // 0-1
  reasoning: string;    // Why this score
  examples: string[];   // Specific evidence from the output
}

async function evaluateAgentOutput(
  input: AgentInput,
  output: AgentOutput,
  criteria: EvaluationCriteria[]
): Promise<EvaluationReport> {
  const results = await Promise.all(
    criteria.map(c => c.evaluator.evaluate(input, output, c))
  );
  
  const weightedScore = results.reduce(
    (sum, r, i) => sum + r.score * criteria[i].weight,
    0
  ) / criteria.reduce((sum, c) => sum + c.weight, 0);
  
  return {
    overallScore: weightedScore,
    criteriaResults: results,
    passed: weightedScore >= PASSING_THRESHOLD,
    timestamp: new Date(),
  };
}

The Three-Layer Evaluation Stack

A production-ready evaluation framework has three distinct layers, each serving a different purpose.

Layer 1: Deterministic Checks

Fast, cheap, run on every single output. These catch obvious failures instantly.

Valid JSON/structured output when structure is required
Required fields present in responses
Output within length bounds
Forbidden content absent (competitor names, internal info, sensitive data)
Format constraints met (markdown when expected, plain text when required)
Response time within acceptable bounds

These should run in milliseconds. They should never fail in ways that require investigation. Either the output is valid JSON or it isn't.

Layer 2: Heuristic Checks

Slower and more nuanced, but still automated. These catch systematic problems.

Hallucinated URLs that return 404
References to out-of-scope topics or systems
Internal contradictions within a single response
Contradictions with the provided context
Suspiciously high similarity to training examples (possible verbatim reproduction)
Statistical anomalies in response patterns

Run these on every output, but they can be async. Failure doesn't block immediate response but triggers review and alerting.

Layer 3: LLM-Graded Evaluation

The most expensive layer. A separate LLM evaluates output against detailed rubrics. This is where you catch nuanced quality issues that patterns miss.

typescript

const FACTUAL_ACCURACY_RUBRIC = `
You are evaluating whether an AI agent's response is factually accurate.

Given the context provided to the agent and the agent's response:

Score 5: All factual claims are accurate and well-supported by the context.
Score 4: Minor inaccuracies that don't affect the core answer.
Score 3: Some inaccuracies present but the overall answer is directionally correct.
Score 2: Significant inaccuracies that could mislead the user.
Score 1: Primarily inaccurate or contradicts the provided context directly.

Context provided: {context}
Agent response: {response}

Provide your score (1-5) and cite specific claims and whether they are supported by the context.
`;

async function gradedEvaluation(
  context: string,
  response: string,
  rubric: string
): Promise<GradedResult> {
  const evalPrompt = rubric
    .replace("{context}", context)
    .replace("{response}", response);
    
  const evaluation = await evaluatorModel.complete(evalPrompt);
  return parseGradedResult(evaluation);
}

The three layers work as a funnel. Layer 1 filters garbage. Layer 2 catches systematic problems. Layer 3 measures quality. Run cheapest first, escalate when needed.

Building the Evaluation Dataset

Your evaluation is only as good as your test cases. This is where most teams dramatically underinvest, then wonder why their eval suite gives them false confidence.

Production-quality evaluation requires several hundred test cases minimum. Each test case should include:

Input: The query or task
Context: Relevant information provided to the agent
Expected criteria: Specific requirements this response must meet
Difficulty rating: Easy / Medium / Hard / Adversarial
Category tags: Task type, domain, failure mode being tested
Known good example: Optional reference of what excellent looks like

Build from real usage, not imagination.

Version your dataset rigorously. Every addition, modification, and removal tracked. When eval scores change, you need to know whether the agent changed or the tests changed.

typescript

interface TestCase {
  id: string;
  version: number;
  createdAt: Date;
  source: "production" | "manual" | "generated" | "adversarial";
  
  input: AgentInput;
  context?: RelevantContext;
  
  criteria: EvaluationCriteria[];
  difficulty: "easy" | "medium" | "hard" | "adversarial";
  categories: string[];
  
  // Optional: reference of what excellent looks like
  referenceOutput?: string;
  
  // Tracking
  addedReason: string;
  relatedIncident?: string; // If added because of a failure
}

Regression Testing for Non-Determinism

The answer: aggregate scoring.

Run the full eval suite before and after any change. Individual test case results will vary. That's expected. What matters is the distribution of scores across the entire suite.

Average factual accuracy drops from 4.2 to 3.8 after a prompt change? That's a regression.
Average task completion improves from 0.87 to 0.91 after a tool update? That's an improvement.
Individual test case varies between 3 and 5 across runs? Normal variance, not a regression.

Set threshold policies:

Any metric drops more than 0.3 points: block deployment
Any metric improves while another drops significantly: require review
All metrics maintain or improve: proceed

typescript

async function runRegressionTest(
  baseline: EvalResults,
  candidate: AgentVersion,
  suite: TestCase[],
  runs: number = 3
): Promise<RegressionReport> {
  const results: EvalResults[] = [];
  
  for (let i = 0; i < runs; i++) {
    results.push(await runEvalSuite(candidate, suite));
  }
  
  const aggregated = aggregateResults(results);
  const comparison = compareToBaseline(baseline, aggregated);
  
  return {
    approved: comparison.regressions.length === 0,
    improvements: comparison.improvements,
    regressions: comparison.regressions,
    neutral: comparison.neutral,
    recommendation: comparison.regressions.length === 0
      ? "safe to deploy"
      : `blocked by regressions in: ${comparison.regressions.join(", ")}`
  };
}

Edge Cases That Destroy Production Systems

Agents handle the happy path well. Every demo proves this. The production killers are inputs nobody anticipated during development.

The adversarial test suite isn't optional for production. It's the difference between an agent that's safe and one that will eventually be weaponized.

Continuous Evaluation in Production

Pre-deployment testing is necessary but insufficient. The real signal comes from production.

Sample 10-20% of production interactions automatically and run them through the evaluation pipeline. This catches:

Scenarios your test suite missed entirely
Quality drift from changes in user behavior you didn't anticipate
Distribution shift where users start asking different types of questions
Performance degradation from upstream changes (model updates, tool changes, data changes)

Build feedback loops that flow production signal back into evaluation.

typescript

interface ProductionEvalPipeline {
  // Sampling strategy
  samplingRate: number;        // 0.1 = 10% of interactions
  prioritySampling: SamplingRule[]; // Always sample certain categories
  
  // Evaluation
  evaluators: EvaluationCriteria[];
  batchSize: number;
  
  // Feedback loops
  flagThreshold: number;       // Score below this triggers review
  datasetContribution: boolean; // Auto-add flagged cases to test suite
  alertThresholds: AlertRule[];
}

// Auto-add production failures to eval dataset
async function processProductionSample(
  interaction: ProductionInteraction,
  evalResult: EvaluationResult,
  pipeline: ProductionEvalPipeline
): Promise<void> {
  if (evalResult.overallScore < pipeline.flagThreshold) {
    // Add to review queue
    await flagForReview(interaction, evalResult);
    
    // Auto-add to dataset if pattern is new
    if (pipeline.datasetContribution) {
      const isNovelFailure = await checkNovelty(interaction, evalResult);
      if (isNovelFailure) {
        await addToEvalDataset(interaction, evalResult);
      }
    }
  }
}

The production evaluation pipeline is where your evaluation system learns. Static suites can only test what you anticipated. Production samples expose what you missed.

The Evaluator Model Problem

Using an LLM to evaluate another LLM's output introduces bias. The evaluator model has its own style preferences, its own knowledge cutoffs, its own failure modes.

Several mitigations:

Use a different model family as evaluator. If Claude generates outputs, use GPT-4 as evaluator and vice versa. Different training reduces systematic corroboration.

Use multiple evaluator models. Average scores from different evaluators. Different models flag different issues. Ensemble evaluation is more reliable than single-model evaluation.

Make rubrics specific. Vague rubrics produce vague scores that correlate poorly with quality. Specific rubrics with concrete examples produce consistent, useful scores.

What Good Testing Infrastructure Enables

The teams I've seen ship and maintain production agents reliably share a common trait: they invested heavily in evaluation infrastructure early and never stopped improving it.

With solid testing:

Deploy changes with confidence because regressions are caught automatically
Diagnose quality issues quickly because you know which criteria are failing and why
Measure improvements from prompt changes, tool updates, and model upgrades with real numbers
Build the data foundation for systematic quality improvement over time

Without it: you're deploying changes and hoping. You're diagnosing problems by reading logs. You're arguing about whether things got better based on vibes.

The initial investment in evaluation infrastructure is significant. The compounding returns make it one of the highest-leverage decisions in any agent project.

Pair your testing framework with robust agent monitoring and observability to catch issues that only emerge in production patterns.

FAQ

Q: How do you test AI agents?

Q: What makes testing AI agents different from testing traditional software?

Q: What is an agent evaluation framework?

Testing AI Agents: QA When There's No Right Answer

The Paradigm Shift: From Equality to Criteria

The Three-Layer Evaluation Stack

Layer 1: Deterministic Checks

Layer 2: Heuristic Checks

Layer 3: LLM-Graded Evaluation

Building the Evaluation Dataset

Regression Testing for Non-Determinism

Edge Cases That Destroy Production Systems

Continuous Evaluation in Production

The Evaluator Model Problem

What Good Testing Infrastructure Enables

FAQ

Sources

Further Reading

Related Articles

Want to Implement This?

Testing AI Agents: QA When There's No Right Answer

The Paradigm Shift: From Equality to Criteria

The Three-Layer Evaluation Stack

Layer 1: Deterministic Checks

Layer 2: Heuristic Checks

Layer 3: LLM-Graded Evaluation

Building the Evaluation Dataset

Regression Testing for Non-Determinism

Edge Cases That Destroy Production Systems

Continuous Evaluation in Production

The Evaluator Model Problem

What Good Testing Infrastructure Enables

FAQ

Sources

Further Reading

Related Articles

Want to Implement This?