Loading...
Loading...
Weekly AI insights —
Real strategies, no fluff. Unsubscribe anytime.
Written by Gareth Simono, Founder and CEO of Agentik {OS}. Full-stack developer and AI architect with years of experience shipping production applications across SaaS, mobile, and enterprise platforms. Gareth orchestrates 267 specialized AI agents to deliver production software 10x faster than traditional development teams.
Founder & CEO, Agentik {OS}
97% accuracy sounds impressive until you ask what was measured. Most evaluation frameworks produce numbers without insight. Build ones that work.

You can not improve what you can not measure. Except with AI agents, most teams are measuring the wrong things.
"Our agent has 97% accuracy" is the most meaningless sentence in AI development. Accuracy at what? Measured against what baseline? Evaluated by whom? On which test set? With which edge cases excluded? Without answers to those questions, the number is not just useless. It is actively harmful because it gives teams false confidence that something is working when it might be quietly failing on the exact inputs that matter most to users.
I have watched three teams ship agents to production with high benchmark scores that fell apart in the first week. Not because the models were bad. Because the evaluation frameworks were measuring what was easy to measure, not what actually determined whether users trusted the system. Task completion rate looks great. But what the numbers hid was that half of "completed" tasks required the user to correct the agent mid-flow, and a third of completions produced outputs the user immediately discarded.
Agent quality is not a scalar. It is a multi-dimensional space of tradeoffs. Your evaluation framework needs to capture those dimensions, surface the tradeoffs clearly, and connect directly to the behaviors you care about in production.
Yes, track task completion. You need to know whether the agent finishes what it starts. But raw completion rate is the least informative metric past an 80% baseline.
What makes completion rate deceptive: agents can "complete" a task by doing the easy version of it, omitting the hard parts, or producing output that technically satisfies the task description but fails the user's actual intent.
A customer service agent that closes tickets by saying "I understand your frustration, please contact us at support@company.com" has a 100% completion rate and is completely useless.
The metrics that matter around completion:
Completion efficiency. How many steps and tokens did it take to complete the task? An agent that finishes in 3 tool calls and 800 tokens is better than one that finishes the same task in 11 tool calls and 4,200 tokens, even if both show 100% completion. Efficiency tells you whether the agent is planning effectively or flailing toward the right answer.
Autonomous completion rate. What fraction of completions required zero human intervention? Track this separately from total completion rate. An agent where 95% of completions are fully autonomous is excellent. One where 99% complete but 40% required mid-flow corrections is actually a problem masquerading as a good metric.
Time to completion. Separate from step count. A fast three-step process is better than a slow three-step process. Latency is a quality dimension, not just a performance dimension.
interface CompletionMetrics {
taskId: string;
completed: boolean;
autonomousCompletion: boolean; // No human intervention
stepCount: number;
tokenCount: number;
latencyMs: number;
humanInterventions: number; // Corrections, clarifications, overrides
outputQuality?: number; // 1-5 if human-rated
userDiscarded: boolean; // Did user immediately discard the output?
}Look at the distribution of these metrics, not just averages. The mean step count might be fine, but if there is a long tail of tasks taking 30+ steps, you have an instability problem that averages hide.
This is the evaluation dimension I care most about and that almost no team measures.
An agent can produce the correct output with flawed reasoning. Pattern-matched to training examples. Got lucky on this particular input. The reasoning steps were wrong or missing, but the output happened to be right. Your evaluation marks it as a pass. You ship it with confidence.
Two weeks later, a slightly different input triggers the same flawed reasoning and produces a wrong output. You cannot understand why the behavior changed because nothing changed. The reasoning was always broken. You just did not catch it.
Chain-of-thought evaluation catches this. You evaluate not just whether the final output is correct but whether the reasoning steps leading to it are valid:
interface ReasoningEvaluation {
taskId: string;
chainOfThought: ReasoningStep[];
finalOutput: string;
evaluation: {
outputCorrect: boolean;
reasoningSound: boolean; // Was the reasoning valid even if output happened to be right?
assumptionsValid: boolean; // Did the agent make assumptions that were true in this case?
informationUsedCorrectly: boolean; // Did it use available context correctly?
problemDecompositionQuality: number; // 1-5
};
}
interface ReasoningStep {
step: number;
action: string;
reasoning: string;
toolCalled?: string;
toolOutput?: string;
conclusionDrawn: string;
}Evaluating reasoning quality requires human review or LLM evaluation of the reasoning chain. Both are more expensive than automated output comparison. Both are worth the cost.
I use a lightweight LLM-as-judge approach for this at scale:
async function evaluateReasoning(
task: string,
steps: ReasoningStep[],
output: string
): Promise<ReasoningEvaluation> {
const response = await anthropic.messages.create({
model: "claude-sonnet-4-5",
max_tokens: 500,
system: `You are evaluating AI agent reasoning quality. You will see a task, the agent's reasoning steps, and its final output.
Evaluate:
1. Is the output correct? (true/false)
2. Is the reasoning sound? (true/false) - correct even if the output happened to be right by luck
3. Were the agent's assumptions valid? (true/false)
4. Did the agent use available information correctly? (true/false)
5. Rate problem decomposition quality 1-5
Return JSON only.`,
messages: [
{
role: "user",
content: `Task: ${task}\n\nReasoning steps:\n${JSON.stringify(steps, null, 2)}\n\nFinal output: ${output}`
}
]
});
return JSON.parse(
response.content[0].type === "text" ? response.content[0].text : "{}"
);
}An agent that produces correct outputs with flawed reasoning is a liability waiting to become an incident. The correct output is coincidental. The flawed reasoning will surface eventually, at the worst possible time.
A number without a reference point is not information. Your agent achieving 85% task success means nothing without knowing what the baseline is.
Every evaluation should include three baselines.
Rule-based baseline. Can a simple if-else system do this task at a meaningful success rate? If your agent achieves 85% success and a rule-based system achieves 80%, you have a 5-point improvement at significant cost and complexity. Worth it? Maybe not. If the rule-based system achieves 40%, you have a genuine breakthrough.
Previous version baseline. Every evaluation run should compare against the previous version. Catches regressions before users report them. A prompt change that improves performance on one task type while degrading performance on another shows up in baseline comparison.
Human performance baseline. Measure how well a human expert performs on the same evaluation set. This gives you both a ceiling (what is theoretically achievable) and context (the agent is operating at 60% of human expert level on this task type). Useful for communicating with stakeholders who want to understand what "good" looks like.
interface EvaluationReport {
currentVersion: EvaluationResults;
previousVersion: EvaluationResults;
ruleBasedBaseline: EvaluationResults;
humanBaseline: EvaluationResults;
regressions: string[]; // Task types where current < previous
improvements: string[]; // Task types where current > previous
vsHuman: { percentage: number; gapAreas: string[] };
}A good test suite is curated, not comprehensive. You want representative coverage of task types, not maximum number of tests.
Stratified sampling. Categorize your production tasks into types (factual lookup, multi-step reasoning, tool orchestration, creative generation, refusals). Ensure your test suite has representation from each category proportional to production frequency.
Edge case inclusion. Explicitly include edge cases from known failure modes. If your agent has a history of failing on queries with ambiguous time references, include 5-10 of those. Test suites without edge cases produce artificially high scores that do not predict production behavior.
Golden set creation. For each test case, define what a correct output looks like. Not an exact string match (too brittle), but a rubric. "Correct output should contain the resolution deadline, reference the correct policy section, and not include the deprecated escalation path." This allows automated comparison against a standard.
interface TestCase {
id: string;
category: TaskCategory;
input: string;
context?: Record<string, any>; // Injected context/tools state
correctOutputRubric: {
mustContain: string[]; // Required elements
mustNotContain: string[]; // Prohibited elements
sentiment?: "positive" | "neutral" | "negative";
maxLength?: number;
requiresCitation?: boolean;
};
isEdgeCase: boolean;
difficulty: "easy" | "medium" | "hard";
humanBaselineTime?: number; // Seconds a human expert takes
}Aim for 100-200 test cases for a production agent. More is better if you can afford the evaluation runtime, but 100 well-chosen cases outperforms 1000 random ones.
Evaluation as a quarterly or pre-release activity is insufficient. Agent behavior drifts. Upstream data changes. User patterns shift. Problems that did not exist last month exist today.
Build continuous evaluation into the production pipeline:
// Log every production interaction for evaluation
interface ProductionLog {
sessionId: string;
taskInput: string;
agentSteps: AgentStep[];
finalOutput: string;
timestamp: Date;
userId?: string;
userFeedback?: { helpful: boolean; comment?: string };
latencyMs: number;
totalTokens: number;
estimatedCostUsd: number;
}
// Sample 5% of production logs for automated evaluation
async function runContinuousEvaluation(logs: ProductionLog[]) {
const sample = sampleLogs(logs, 0.05);
const results = await Promise.all(
sample.map(log => evaluateAgainstRubric(log))
);
const report = aggregateResults(results);
// Alert if key metrics drop below thresholds
if (report.autonomousCompletionRate < 0.85) {
await alertTeam("Autonomous completion rate below threshold", report);
}
if (report.reasoningSoundnessRate < 0.90) {
await alertTeam("Reasoning quality degradation detected", report);
}
}This connects directly to agent monitoring and observability, where continuous evaluation feeds into the alerting infrastructure that tells you when something is going wrong before users start complaining.
All technical metrics are proxies for the thing that actually matters: whether users find the agent useful.
Capture satisfaction signals at every opportunity.
Explicit feedback. Post-interaction rating (thumbs up/down or 1-5 stars). Low friction. High value. Even 20-30% response rates give you meaningful signal.
Implicit signals. Task abandonment rate. Session length after agent interaction. Return usage. Whether users copy agent outputs or retype them. These reveal satisfaction without requiring the user to rate anything.
Correction rate. How often do users edit agent outputs before using them? A high correction rate means the output is close but not quite right. A low correction rate means either the output is excellent or it is being used uncritically.
Satisfaction often reveals problems that technical metrics miss. An agent might have 90% task completion and 75% user satisfaction. The gap is a UX problem, not an AI problem. Maybe the output format is confusing. Maybe the agent asks too many clarifying questions. Maybe it is technically correct but communicates in a way that makes users feel talked down to.
User satisfaction is the metric you optimize for. Technical metrics are leading indicators that help you understand why satisfaction is where it is.
The human-in-the-loop patterns that you design will significantly affect satisfaction metrics. Getting those patterns right, knowing when to ask for help versus act autonomously, is often the difference between a tool users enjoy and one they tolerate.
All of these metrics are only useful if someone looks at them. Build a dashboard that surfaces actionable information, not just data.
The metrics worth tracking on a daily dashboard:
| Metric | Why It Matters | Alert Threshold |
|---|---|---|
| Autonomous completion rate | Core measure of reliability | < 85% |
| Avg completion steps | Efficiency signal | > 2 SD from baseline |
| Reasoning soundness rate | Quality of decisions | < 90% |
| User satisfaction (7-day) | What users actually think | < 4.0/5.0 |
| Edge case success rate | Robustness signal | < 70% |
| Regression count vs. prev version | Detects degradation | > 0 |
Evaluation is a steering wheel, not a speedometer. The goal is not to produce impressive numbers. It is to understand where the agent is failing and why, so you can fix the right things.
Q: How do you evaluate AI agent performance?
Evaluate agents across multiple dimensions: task completion rate (does it finish the job?), output quality (is the result correct and useful?), efficiency (cost and time per task), reliability (consistency across runs), and safety (does it stay within boundaries?). Use automated benchmarks plus human evaluation for subjective quality.
Q: What benchmarks exist for AI coding agents?
Key benchmarks include SWE-bench (real GitHub issue resolution), HumanEval (code generation), MBPP (Python programming), and custom project-specific evaluations. SWE-bench is the most relevant for production coding agents as it tests real-world software engineering tasks end-to-end.
Q: How often should you evaluate AI agents?
Evaluate continuously: automated quality checks on every agent output, weekly aggregate metrics review, monthly benchmark comparisons, and quarterly capability assessments. Model updates, prompt changes, and new tool additions all warrant immediate re-evaluation against your benchmark suite.
Full-stack developer and AI architect with years of experience shipping production applications across SaaS, mobile, and enterprise. Gareth built Agentik {OS} to prove that one person with the right AI system can outperform an entire traditional development team. He has personally architected and shipped 7+ production applications using AI-first workflows.

Testing AI Agents: QA When There's No Right Answer
You cannot assertEquals your way through agent testing. Here's how to build evaluation frameworks that actually measure quality in non-deterministic systems.

Agent Observability: Seeing Inside the Black Box
Request received. Response sent. 200 OK. Latency 3.2s. None of that tells you why the agent gave wrong advice. Here's what actually does.

Human-in-the-Loop: Where to Put Humans in Agent Systems
Full autonomy is a myth for any system that matters. The question is where to position humans so they add value without becoming the bottleneck.
Stop reading about AI and start building with it. Book a free discovery call and see how AI agents can accelerate your business.