Loading...
Loading...
Weekly AI insights —
Real strategies, no fluff. Unsubscribe anytime.
Founder & CEO, Agentik {OS}
Your LLM is lying to you confidently and you have no idea how often. Automated scoring, human evaluation frameworks, and continuous monitoring fix this.

Your LLM is producing wrong answers right now. Confidently. Plausibly. Some of your users are reading those wrong answers and making decisions based on them.
How often? You probably have no idea. Neither does anyone else running AI in production without a systematic evaluation framework.
This is the most expensive blind spot in AI development. You build, you ship, you demo, you watch users hit the feature. Things seem fine. The LLM sounds good. But "sounds good" is not an evaluation methodology. It is vibes. Vibes are not a production monitoring strategy.
With traditional software, correctness is binary. The function either returns the right value or it does not. You write tests. The tests pass or fail. Straightforward.
With LLMs, there are several layers of complexity that make evaluation genuinely hard:
Outputs are not deterministic. The same input can produce different outputs across runs, especially at higher temperatures. Which output is "correct"?
Correctness is often subjective. For summarization, is a summary that covers the three most important points better than one that covers five points with less depth? It depends on the use case.
The output space is enormous. You cannot enumerate all possible good and bad outputs. Test cases only cover a fraction of what will happen in production.
Failure modes are subtle. An LLM that answers correctly 95% of the time looks excellent in spot checks. The 5% failure rate only becomes visible at scale or when you know specifically what to test.
Building an evaluation system is the engineering work that separates AI features that are trusted from AI features that are "interesting experiments."
A production-grade evaluation framework has three layers:
Layer 1: Automated scoring (runs on every response) Layer 2: Offline evaluation (runs on batches, measures quality systematically) Layer 3: Human review (validates the automated metrics, catches what automation misses)
None of these replaces the others. Each catches different failure modes.
Automated scoring runs on every response in production. It is fast, cheap, and catches broad quality issues.
Use a separate model to evaluate the output of your primary model. Give it the input, the output, and a rubric. Ask it to score.
async function llmJudge(
input: string,
output: string,
criteria: string[]
): Promise<{ scores: Record<string, number>; reasoning: string }> {
const criteriaText = criteria.map((c, i) => `${i + 1}. ${c}`).join('\n');
const response = await anthropic.messages.create({
model: 'claude-haiku-20241022', // Use fast/cheap model for evaluation
max_tokens: 512,
messages: [{
role: 'user',
content: `You are evaluating an AI assistant's response.
User Input:
${input}
AI Response:
${output}
Score the response on each criterion from 1-5:
${criteriaText}
Return JSON only:
{
"scores": {
"criterion_1": <1-5>,
"criterion_2": <1-5>,
...
},
"reasoning": "brief explanation of scores"
}`,
}],
});
const text = response.content[0].type === 'text' ? response.content[0].text : '{}';
return JSON.parse(text);
}
// Usage for a customer support AI:
const evaluation = await llmJudge(
userMessage,
aiResponse,
[
'Accuracy: Does the response contain only factually correct information?',
'Completeness: Does it address all aspects of the user\'s question?',
'Tone: Is it appropriately professional and empathetic?',
'Actionability: Does it give the user a clear next step?',
]
);LLM-as-Judge is powerful but has biases. The judge model prefers responses that match its own style and tends to favor longer responses over shorter ones. Calibrate against human judgments to understand and account for these biases.
For specific quality requirements, rules are faster and cheaper than LLM judgment.
interface QualityCheck {
name: string;
check: (input: string, output: string) => boolean;
severity: 'critical' | 'warning';
}
const standardChecks: QualityCheck[] = [
{
name: 'no_hallucinated_urls',
check: (_, output) => {
const urls = output.match(/https?:\/\/[^\s]+/g) || [];
// Flag URLs that weren't in the system prompt or input
return urls.every(url => isKnownDomain(url));
},
severity: 'critical',
},
{
name: 'within_length_bounds',
check: (_, output) => output.length >= 50 && output.length <= 5000,
severity: 'warning',
},
{
name: 'no_sensitive_data_exposure',
check: (_, output) => {
const patterns = [
/\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b/, // Credit card
/\b\d{3}-\d{2}-\d{4}\b/, // SSN
/\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z]{2,}\b/i, // Email from internal data
];
return !patterns.some(p => p.test(output));
},
severity: 'critical',
},
{
name: 'response_language_matches_input',
check: (input, output) => {
const inputLang = detectLanguage(input);
const outputLang = detectLanguage(output);
return inputLang === outputLang;
},
severity: 'warning',
},
];
function runQualityChecks(
input: string,
output: string
): { passed: boolean; failures: Array<{ check: string; severity: string }> } {
const failures = standardChecks
.filter(check => !check.check(input, output))
.map(check => ({ check: check.name, severity: check.severity }));
const hasCritical = failures.some(f => f.severity === 'critical');
return { passed: !hasCritical, failures };
}Offline evaluation measures quality systematically against a curated test set. Run it before major prompt changes, model updates, or new feature releases.
Your eval dataset is worth more than almost any other investment in your AI system. Build it carefully.
Start by logging production inputs. After you accumulate a few hundred real requests, sample them for variety:
For each sampled input, create a ground truth label. This might be:
interface EvalCase {
id: string;
input: string;
reference_output?: string; // For tasks with a clear correct answer
evaluation_criteria: string[]; // For open-ended tasks
expected_properties: {
must_contain?: string[];
must_not_contain?: string[];
min_length?: number;
max_length?: number;
must_be_json?: boolean;
};
tags: string[]; // For filtering and analysis
}
const evalDataset: EvalCase[] = [
{
id: 'cs-001',
input: 'How do I cancel my subscription?',
reference_output: undefined, // Open-ended, use criteria
evaluation_criteria: [
'Provides clear cancellation steps',
'Mentions data retention policy',
'Offers alternative (downgrade vs cancel)',
],
expected_properties: {
must_contain: ['cancel', 'settings'],
must_not_contain: ['sorry to see you go', 'unfortunately'],
min_length: 100,
},
tags: ['customer-support', 'cancellation', 'easy'],
},
// ... more cases
];async function runEvaluation(
systemPrompt: string,
dataset: EvalCase[]
): Promise<EvaluationReport> {
const results = await Promise.all(
dataset.map(async (evalCase) => {
const output = await getModelOutput(systemPrompt, evalCase.input);
const propertyCheck = checkProperties(output, evalCase.expected_properties);
const llmScore = evalCase.evaluation_criteria.length > 0
? await llmJudge(evalCase.input, output, evalCase.evaluation_criteria)
: null;
return {
caseId: evalCase.id,
input: evalCase.input,
output,
propertyCheck,
llmScore,
tags: evalCase.tags,
passed: propertyCheck.passed &&
(llmScore ? Object.values(llmScore.scores).every(s => s >= 3) : true),
};
})
);
const passRate = results.filter(r => r.passed).length / results.length;
const byTag = groupBy(results, r => r.tags);
return {
passRate,
totalCases: results.length,
passed: results.filter(r => r.passed).length,
failed: results.filter(r => !r.passed).length,
byTag: Object.fromEntries(
Object.entries(byTag).map(([tag, cases]) => [
tag,
cases.filter(c => c.passed).length / cases.length,
])
),
failedCases: results.filter(r => !r.passed),
};
}Automation does not replace human judgment. It filters down to the cases that most need human attention.
Set up a review queue that surfaces:
Have at least two reviewers evaluate each case independently, then discuss disagreements. Inter-rater agreement tells you how well-defined your quality criteria are. Low agreement means the criteria are ambiguous. Clarify them.
interface HumanReviewCase {
id: string;
input: string;
output: string;
automated_score?: number;
user_feedback?: 'thumbs_up' | 'thumbs_down';
review_priority: 'high' | 'normal' | 'sample';
}
function selectForReview(
responses: Array<{ id: string; score: number; feedback?: string }>,
targetSampleSize: number
): HumanReviewCase[] {
const highPriority = responses.filter(r =>
r.score < 3 || r.feedback === 'thumbs_down'
);
const randomSample = shuffle(responses)
.slice(0, targetSampleSize - highPriority.length)
.map(r => ({ ...r, review_priority: 'sample' as const }));
return [...highPriority.map(r => ({ ...r, review_priority: 'high' as const })), ...randomSample];
}Evaluation is not a one-time exercise. Quality drifts. Models update. User behavior changes. The real world surprises you.
Track these metrics over time:
| Metric | What It Tells You | Alert Threshold |
|---|---|---|
| Average LLM judge score | Overall quality trend | Drop > 0.3 points |
| Rule check failure rate | Specific quality violations | Any increase |
| User negative feedback rate | User-perceived quality | Rise above baseline |
| Response length distribution | Prompt regression signals | Significant shift |
| Latency by query type | Performance degradation | p95 > threshold |
class QualityMonitor {
private metrics: MetricsDB;
async recordResponse(
requestId: string,
input: string,
output: string,
modelId: string,
latencyMs: number
): Promise<void> {
// Run lightweight automated checks
const checks = runQualityChecks(input, output);
// Sample 10% for LLM judge (too expensive to run on all)
const runJudge = Math.random() < 0.1;
const judgeScore = runJudge
? await this.runLLMJudge(input, output)
: null;
await this.metrics.record({
requestId,
modelId,
timestamp: new Date(),
latencyMs,
outputLength: output.length,
checksPass: checks.passed,
checkFailures: checks.failures,
judgeScore: judgeScore?.averageScore,
});
// Alert on critical failures
if (!checks.passed && checks.failures.some(f => f.severity === 'critical')) {
await this.alert(`Critical quality failure in request ${requestId}`, checks);
}
}
}If you ship AI features with no evaluation today, here is the fastest path to something defensible:
Five steps. The first three can be done this week. You will immediately know more about your AI's quality than you do today.
Q: How do you evaluate LLMs in production?
Evaluate production LLMs across five dimensions: accuracy (are responses correct?), consistency (same quality over time?), latency (fast enough?), cost (economically viable?), and safety (no harmful outputs?). Combine automated metrics with human evaluation and A/B testing against baseline performance.
Q: What metrics should you track for LLM quality?
Track task completion rate, factual accuracy, response relevance, latency percentiles (p50, p95, p99), cost per request, hallucination rate, user satisfaction scores, and safety violation rate. Automated evaluation catches quantity issues while human evaluation catches quality issues.
Q: How often should you evaluate LLM performance?
Evaluate continuously with automated checks on every response, daily aggregate quality reviews, weekly benchmark comparisons, and immediate re-evaluation after model updates or prompt changes. Set up alerts for quality degradation so issues are caught before they impact users at scale.
Full-stack developer and AI architect with years of experience shipping production applications across SaaS, mobile, and enterprise. Gareth built Agentik {OS} to prove that one person with the right AI system can outperform an entire traditional development team. He has personally architected and shipped 7+ production applications using AI-first workflows.

Prompt Engineering: The Craft Behind Reliable AI Output
The difference between garbage and production-quality AI output is not magic. It is craft. System prompts, few-shot examples, chain-of-thought, structure.

Agent Evaluation: Measuring What Actually Matters
97% accuracy sounds impressive until you ask what was measured. Most evaluation frameworks produce numbers without insight. Build ones that work.

Agent Observability: Seeing Inside the Black Box
Request received. Response sent. 200 OK. Latency 3.2s. None of that tells you why the agent gave wrong advice. Here's what actually does.
Stop reading about AI and start building with it. Book a free discovery call and see how AI agents can accelerate your business.