Loading...
Loading...
Weekly AI insights —
Real strategies, no fluff. Unsubscribe anytime.
Written by Gareth Simono, Founder and CEO of Agentik {OS}. Full-stack developer and AI architect with years of experience shipping production applications across SaaS, mobile, and enterprise platforms. Gareth orchestrates 267 specialized AI agents to deliver production software 10x faster than traditional development teams.
Founder & CEO, Agentik {OS}
You cannot assertEquals your way through agent testing. Here's how to build evaluation frameworks that actually measure quality in non-deterministic systems.

You cannot assertEquals your way through AI agent testing. I've watched teams try. Beautiful test suites. Hundreds of carefully crafted assertions. Deterministic expectations applied to a non-deterministic system.
Different failure every run. Different output every run, actually. And all of them valid.
The fundamental problem with applying traditional QA to agents: the same input can produce different but equally correct outputs. "Summarize this document" yields ten valid summaries. "Write a function to sort this array" returns merge sort, quicksort, bubble sort, or something creative. Traditional testing assumes one right answer. Agent testing cannot.
This isn't a limitation you work around. It's a property you design for. And once you accept it, a completely different and actually better testing methodology emerges.
The first and most important shift: stop asking "does the output match this expected string" and start asking "does the output satisfy these requirements."
For a customer support agent, the requirements might be:
Each requirement becomes a scoring function. Some automate easily: length validation, format checking, forbidden content detection, presence of required elements. Others require an LLM evaluator that scores output against detailed rubrics. The best systems use both.
interface EvaluationCriteria {
id: string;
name: string;
weight: number; // Relative importance
evaluator: CriteriaEvaluator;
}
type CriteriaEvaluator =
| DeterministicEvaluator // Regex, length checks, format validation
| HeuristicEvaluator // Pattern matching, contradiction detection
| LLMEvaluator; // LLM-graded against rubric
interface EvaluationResult {
criteriaId: string;
score: number; // 0-1
reasoning: string; // Why this score
examples: string[]; // Specific evidence from the output
}
async function evaluateAgentOutput(
input: AgentInput,
output: AgentOutput,
criteria: EvaluationCriteria[]
): Promise<EvaluationReport> {
const results = await Promise.all(
criteria.map(c => c.evaluator.evaluate(input, output, c))
);
const weightedScore = results.reduce(
(sum, r, i) => sum + r.score * criteria[i].weight,
0
) / criteria.reduce((sum, c) => sum + c.weight, 0);
return {
overallScore: weightedScore,
criteriaResults: results,
passed: weightedScore >= PASSING_THRESHOLD,
timestamp: new Date(),
};
}This is harder to write than expect(output).toBe(expected). It's also fundamentally more honest about what you're actually testing: not whether the agent produces identical output, but whether it produces good output.
A production-ready evaluation framework has three distinct layers, each serving a different purpose.
Fast, cheap, run on every single output. These catch obvious failures instantly.
These should run in milliseconds. They should never fail in ways that require investigation. Either the output is valid JSON or it isn't.
Slower and more nuanced, but still automated. These catch systematic problems.
Run these on every output, but they can be async. Failure doesn't block immediate response but triggers review and alerting.
The most expensive layer. A separate LLM evaluates output against detailed rubrics. This is where you catch nuanced quality issues that patterns miss.
const FACTUAL_ACCURACY_RUBRIC = `
You are evaluating whether an AI agent's response is factually accurate.
Given the context provided to the agent and the agent's response:
Score 5: All factual claims are accurate and well-supported by the context.
Score 4: Minor inaccuracies that don't affect the core answer.
Score 3: Some inaccuracies present but the overall answer is directionally correct.
Score 2: Significant inaccuracies that could mislead the user.
Score 1: Primarily inaccurate or contradicts the provided context directly.
Context provided: {context}
Agent response: {response}
Provide your score (1-5) and cite specific claims and whether they are supported by the context.
`;
async function gradedEvaluation(
context: string,
response: string,
rubric: string
): Promise<GradedResult> {
const evalPrompt = rubric
.replace("{context}", context)
.replace("{response}", response);
const evaluation = await evaluatorModel.complete(evalPrompt);
return parseGradedResult(evaluation);
}Run Layer 3 on samples rather than every output. 10-20% sampling gives solid signal at manageable cost. Run full evaluation on every output before deployments and on flagged outputs from Layers 1 and 2.
The three layers work as a funnel. Layer 1 filters garbage. Layer 2 catches systematic problems. Layer 3 measures quality. Run cheapest first, escalate when needed.
Your evaluation is only as good as your test cases. This is where most teams dramatically underinvest, then wonder why their eval suite gives them false confidence.
Production-quality evaluation requires several hundred test cases minimum. Each test case should include:
Build from real usage, not imagination.
Production logs are your best source. Real user queries expose scenarios you'd never think to create. Every user-reported problem becomes a test case. Every surprisingly good response gets captured. Every discovered failure mode spawns multiple variants testing the same edge.
Failure mode mining is especially valuable. If the agent failed on input X, what's the class of inputs similar to X? Create 5-10 variants testing that class. Your eval suite should map failure modes as exhaustively as it maps success cases.
Adversarial inputs need their own category. Inputs designed to confuse, manipulate, or break the agent. Prompt injection attempts. Contradictory instructions. Ambiguous requests where the agent should ask for clarification rather than guess.
Version your dataset rigorously. Every addition, modification, and removal tracked. When eval scores change, you need to know whether the agent changed or the tests changed.
interface TestCase {
id: string;
version: number;
createdAt: Date;
source: "production" | "manual" | "generated" | "adversarial";
input: AgentInput;
context?: RelevantContext;
criteria: EvaluationCriteria[];
difficulty: "easy" | "medium" | "hard" | "adversarial";
categories: string[];
// Optional: reference of what excellent looks like
referenceOutput?: string;
// Tracking
addedReason: string;
relatedIncident?: string; // If added because of a failure
}Traditional regression testing: previously working functionality still works after changes. For agents this is complicated by non-determinism. The agent produces different output every run. How do you know if a change is a regression or just normal variation?
The answer: aggregate scoring.
Run the full eval suite before and after any change. Individual test case results will vary. That's expected. What matters is the distribution of scores across the entire suite.
Set threshold policies:
Run the suite multiple times per change. Non-determinism means single runs are unreliable. Three to five runs give solid signal. Expensive. Cheaper than deploying a regression that affects real users.
async function runRegressionTest(
baseline: EvalResults,
candidate: AgentVersion,
suite: TestCase[],
runs: number = 3
): Promise<RegressionReport> {
const results: EvalResults[] = [];
for (let i = 0; i < runs; i++) {
results.push(await runEvalSuite(candidate, suite));
}
const aggregated = aggregateResults(results);
const comparison = compareToBaseline(baseline, aggregated);
return {
approved: comparison.regressions.length === 0,
improvements: comparison.improvements,
regressions: comparison.regressions,
neutral: comparison.neutral,
recommendation: comparison.regressions.length === 0
? "safe to deploy"
: `blocked by regressions in: ${comparison.regressions.join(", ")}`
};
}Agents handle the happy path well. Every demo proves this. The production killers are inputs nobody anticipated during development.
Ambiguous queries are the most common. "Fix the thing from yesterday." What thing? Which yesterday? A good agent asks for clarification. A bad one guesses confidently and acts on wrong assumptions.
Build an entire test category around ambiguity. For each ambiguous input, the expected behavior is clarification-seeking. Score the response on whether it identified the ambiguity and what clarifying question it asked.
Contradictory instructions expose reasoning quality. "Make it shorter but include all details." "Be casual but professional." The agent needs to recognize the tension and negotiate rather than arbitrarily choosing one side.
Out-of-scope requests test boundary enforcement. A coding agent asked to write poetry. A customer support agent asked for financial advice. A research agent asked to make purchases. These should be graceful, helpful declines, not hallucinated attempts.
Adversarial inputs are where you find the scary stuff. Prompt injection attempts. Social engineering trying to expand permissions. Instructions embedded in data the agent is processing. Build a dedicated adversarial test suite and update it continuously as new attack vectors are discovered. This overlaps directly with agent security considerations.
Long context degradation is insidious. Agent performs well on short inputs. Performance degrades as context grows. This is common with complex multi-turn tasks and RAG-heavy architectures. Test explicitly with varying context lengths.
The adversarial test suite isn't optional for production. It's the difference between an agent that's safe and one that will eventually be weaponized.
Pre-deployment testing is necessary but insufficient. The real signal comes from production.
Sample 10-20% of production interactions automatically and run them through the evaluation pipeline. This catches:
Build feedback loops that flow production signal back into evaluation.
interface ProductionEvalPipeline {
// Sampling strategy
samplingRate: number; // 0.1 = 10% of interactions
prioritySampling: SamplingRule[]; // Always sample certain categories
// Evaluation
evaluators: EvaluationCriteria[];
batchSize: number;
// Feedback loops
flagThreshold: number; // Score below this triggers review
datasetContribution: boolean; // Auto-add flagged cases to test suite
alertThresholds: AlertRule[];
}
// Auto-add production failures to eval dataset
async function processProductionSample(
interaction: ProductionInteraction,
evalResult: EvaluationResult,
pipeline: ProductionEvalPipeline
): Promise<void> {
if (evalResult.overallScore < pipeline.flagThreshold) {
// Add to review queue
await flagForReview(interaction, evalResult);
// Auto-add to dataset if pattern is new
if (pipeline.datasetContribution) {
const isNovelFailure = await checkNovelty(interaction, evalResult);
if (isNovelFailure) {
await addToEvalDataset(interaction, evalResult);
}
}
}
}The production evaluation pipeline is where your evaluation system learns. Static suites can only test what you anticipated. Production samples expose what you missed.
Using an LLM to evaluate another LLM's output introduces bias. The evaluator model has its own style preferences, its own knowledge cutoffs, its own failure modes.
Several mitigations:
Use a different model family as evaluator. If Claude generates outputs, use GPT-4 as evaluator and vice versa. Different training reduces systematic corroboration.
Calibrate evaluators against human judgments. Have humans rate a sample of outputs. Compare evaluator scores to human scores. Calibrate the rubric until evaluator and human scores correlate strongly.
Use multiple evaluator models. Average scores from different evaluators. Different models flag different issues. Ensemble evaluation is more reliable than single-model evaluation.
Make rubrics specific. Vague rubrics produce vague scores that correlate poorly with quality. Specific rubrics with concrete examples produce consistent, useful scores.
The teams I've seen ship and maintain production agents reliably share a common trait: they invested heavily in evaluation infrastructure early and never stopped improving it.
With solid testing:
Without it: you're deploying changes and hoping. You're diagnosing problems by reading logs. You're arguing about whether things got better based on vibes.
The initial investment in evaluation infrastructure is significant. The compounding returns make it one of the highest-leverage decisions in any agent project.
Pair your testing framework with robust agent monitoring and observability to catch issues that only emerge in production patterns.
Q: How do you test AI agents?
Testing AI agents requires a layered approach: unit tests for individual tool calls and decision logic, integration tests for multi-step workflows, evaluation benchmarks for output quality, chaos tests for error recovery, and human evaluation for judgment-dependent tasks. Unlike traditional software, agent tests must account for non-deterministic outputs.
Q: What makes testing AI agents different from testing traditional software?
Agent testing differs because outputs are non-deterministic (same input can produce different valid outputs), agents make autonomous decisions that affect subsequent steps, and quality is subjective for many tasks. Effective agent testing uses evaluation criteria (did the output meet the goal?) rather than exact output matching.
Q: What is an agent evaluation framework?
An agent evaluation framework measures agent performance across multiple dimensions: task completion rate, output quality, cost efficiency, latency, error recovery success, and safety compliance. Frameworks like SWE-bench for coding or custom evaluation suites provide standardized benchmarks for comparing agent performance.
Full-stack developer and AI architect with years of experience shipping production applications across SaaS, mobile, and enterprise. Gareth built Agentik {OS} to prove that one person with the right AI system can outperform an entire traditional development team. He has personally architected and shipped 7+ production applications using AI-first workflows.

Agent Evaluation: Measuring What Actually Matters
97% accuracy sounds impressive until you ask what was measured. Most evaluation frameworks produce numbers without insight. Build ones that work.

Autonomous AI Decisions: Real Trust and Control Patterns
Can we just let the agent run on its own? The answer depends entirely on what happens when it's wrong. Here's the engineering behind real autonomy.

Agent Observability: Seeing Inside the Black Box
Request received. Response sent. 200 OK. Latency 3.2s. None of that tells you why the agent gave wrong advice. Here's what actually does.
Stop reading about AI and start building with it. Book a free discovery call and see how AI agents can accelerate your business.