Technical Deep DivesFebruary 11, 202617 min read

LLM Evaluation in Production: Stop Trusting Vibes

Founder & CEO, Agentik{OS}

Your LLM is lying to you confidently and you have no idea how often. Automated scoring, human evaluation frameworks, and continuous monitoring fix this.

LLM Evaluation in Production: Stop Trusting Vibes

Your LLM is producing wrong answers right now. Confidently. Plausibly. Some of your users are reading those wrong answers and making decisions based on them.

How often? You probably have no idea. Neither does anyone else running AI in production without a systematic evaluation framework.

This is the most expensive blind spot in AI development. You build, you ship, you demo, you watch users hit the feature. Things seem fine. The LLM sounds good. But "sounds good" is not an evaluation methodology. It is vibes. Vibes are not a production monitoring strategy.

Why Evaluation Is Harder Than It Looks

With traditional software, correctness is binary. The function either returns the right value or it does not. You write tests. The tests pass or fail. Straightforward.

With LLMs, there are several layers of complexity that make evaluation genuinely hard:

Outputs are not deterministic. The same input can produce different outputs across runs, especially at higher temperatures. Which output is "correct"?

Correctness is often subjective. For summarization, is a summary that covers the three most important points better than one that covers five points with less depth? It depends on the use case.

The output space is enormous. You cannot enumerate all possible good and bad outputs. Test cases only cover a fraction of what will happen in production.

Failure modes are subtle. An LLM that answers correctly 95% of the time looks excellent in spot checks. The 5% failure rate only becomes visible at scale or when you know specifically what to test.

Building an evaluation system is the engineering work that separates AI features that are trusted from AI features that are "interesting experiments."

The Evaluation Stack

A production-grade evaluation framework has three layers:

Layer 1: Automated scoring (runs on every response) Layer 2: Offline evaluation (runs on batches, measures quality systematically) Layer 3: Human review (validates the automated metrics, catches what automation misses)

None of these replaces the others. Each catches different failure modes.

Layer 1: Automated Scoring

Automated scoring runs on every response in production. It is fast, cheap, and catches broad quality issues.

LLM-as-Judge

Use a separate model to evaluate the output of your primary model. Give it the input, the output, and a rubric. Ask it to score.

typescript

async function llmJudge(
  input: string,
  output: string,
  criteria: string[]
): Promise<{ scores: Record<string, number>; reasoning: string }> {
  const criteriaText = criteria.map((c, i) => `${i + 1}. ${c}`).join('\n');
  
  const response = await anthropic.messages.create({
    model: 'claude-haiku-20241022', // Use fast/cheap model for evaluation
    max_tokens: 512,
    messages: [{
      role: 'user',
      content: `You are evaluating an AI assistant's response.

User Input:
${input}

AI Response:
${output}

Score the response on each criterion from 1-5:
${criteriaText}

Return JSON only:
{
  "scores": {
    "criterion_1": <1-5>,
    "criterion_2": <1-5>,
    ...
  },
  "reasoning": "brief explanation of scores"
}`,
    }],
  });
  
  const text = response.content[0].type === 'text' ? response.content[0].text : '{}';
  return JSON.parse(text);
}

// Usage for a customer support AI:
const evaluation = await llmJudge(
  userMessage,
  aiResponse,
  [
    'Accuracy: Does the response contain only factually correct information?',
    'Completeness: Does it address all aspects of the user\'s question?',
    'Tone: Is it appropriately professional and empathetic?',
    'Actionability: Does it give the user a clear next step?',
  ]
);

LLM-as-Judge is powerful but has biases. The judge model prefers responses that match its own style and tends to favor longer responses over shorter ones. Calibrate against human judgments to understand and account for these biases.

Rule-Based Automated Checks

For specific quality requirements, rules are faster and cheaper than LLM judgment.

typescript

interface QualityCheck {
  name: string;
  check: (input: string, output: string) => boolean;
  severity: 'critical' | 'warning';
}

const standardChecks: QualityCheck[] = [
  {
    name: 'no_hallucinated_urls',
    check: (_, output) => {
      const urls = output.match(/https?:\/\/[^\s]+/g) || [];
      // Flag URLs that weren't in the system prompt or input
      return urls.every(url => isKnownDomain(url));
    },
    severity: 'critical',
  },
  {
    name: 'within_length_bounds',
    check: (_, output) => output.length >= 50 && output.length <= 5000,
    severity: 'warning',
  },
  {
    name: 'no_sensitive_data_exposure',
    check: (_, output) => {
      const patterns = [
        /\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b/, // Credit card
        /\b\d{3}-\d{2}-\d{4}\b/, // SSN
        /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z]{2,}\b/i, // Email from internal data
      ];
      return !patterns.some(p => p.test(output));
    },
    severity: 'critical',
  },
  {
    name: 'response_language_matches_input',
    check: (input, output) => {
      const inputLang = detectLanguage(input);
      const outputLang = detectLanguage(output);
      return inputLang === outputLang;
    },
    severity: 'warning',
  },
];

function runQualityChecks(
  input: string,
  output: string
): { passed: boolean; failures: Array<{ check: string; severity: string }> } {
  const failures = standardChecks
    .filter(check => !check.check(input, output))
    .map(check => ({ check: check.name, severity: check.severity }));
  
  const hasCritical = failures.some(f => f.severity === 'critical');
  return { passed: !hasCritical, failures };
}

Layer 2: Offline Evaluation

Offline evaluation measures quality systematically against a curated test set. Run it before major prompt changes, model updates, or new feature releases.

Building Your Evaluation Dataset

Your eval dataset is worth more than almost any other investment in your AI system. Build it carefully.

Start by logging production inputs. After you accumulate a few hundred real requests, sample them for variety:

Easy cases (most inputs)
Medium difficulty
Edge cases (ambiguous, adversarial, unusual)
Known failure modes you have seen

For each sampled input, create a ground truth label. This might be:

A reference answer (for factual QA)
Quality criteria and acceptable responses (for open-ended tasks)
Binary pass/fail for specific requirements

typescript

interface EvalCase {
  id: string;
  input: string;
  reference_output?: string; // For tasks with a clear correct answer
  evaluation_criteria: string[]; // For open-ended tasks
  expected_properties: {
    must_contain?: string[];
    must_not_contain?: string[];
    min_length?: number;
    max_length?: number;
    must_be_json?: boolean;
  };
  tags: string[]; // For filtering and analysis
}

const evalDataset: EvalCase[] = [
  {
    id: 'cs-001',
    input: 'How do I cancel my subscription?',
    reference_output: undefined, // Open-ended, use criteria
    evaluation_criteria: [
      'Provides clear cancellation steps',
      'Mentions data retention policy',
      'Offers alternative (downgrade vs cancel)',
    ],
    expected_properties: {
      must_contain: ['cancel', 'settings'],
      must_not_contain: ['sorry to see you go', 'unfortunately'],
      min_length: 100,
    },
    tags: ['customer-support', 'cancellation', 'easy'],
  },
  // ... more cases
];

Running Evaluation

typescript

async function runEvaluation(
  systemPrompt: string,
  dataset: EvalCase[]
): Promise<EvaluationReport> {
  const results = await Promise.all(
    dataset.map(async (evalCase) => {
      const output = await getModelOutput(systemPrompt, evalCase.input);
      
      const propertyCheck = checkProperties(output, evalCase.expected_properties);
      const llmScore = evalCase.evaluation_criteria.length > 0
        ? await llmJudge(evalCase.input, output, evalCase.evaluation_criteria)
        : null;
      
      return {
        caseId: evalCase.id,
        input: evalCase.input,
        output,
        propertyCheck,
        llmScore,
        tags: evalCase.tags,
        passed: propertyCheck.passed && 
               (llmScore ? Object.values(llmScore.scores).every(s => s >= 3) : true),
      };
    })
  );
  
  const passRate = results.filter(r => r.passed).length / results.length;
  const byTag = groupBy(results, r => r.tags);
  
  return {
    passRate,
    totalCases: results.length,
    passed: results.filter(r => r.passed).length,
    failed: results.filter(r => !r.passed).length,
    byTag: Object.fromEntries(
      Object.entries(byTag).map(([tag, cases]) => [
        tag,
        cases.filter(c => c.passed).length / cases.length,
      ])
    ),
    failedCases: results.filter(r => !r.passed),
  };
}

Layer 3: Human Review

Automation does not replace human judgment. It filters down to the cases that most need human attention.

Set up a review queue that surfaces:

Cases where automated scores are low
Cases where users gave negative feedback
Random samples from production (for bias detection)
Cases that failed automated checks

Have at least two reviewers evaluate each case independently, then discuss disagreements. Inter-rater agreement tells you how well-defined your quality criteria are. Low agreement means the criteria are ambiguous. Clarify them.

typescript

interface HumanReviewCase {
  id: string;
  input: string;
  output: string;
  automated_score?: number;
  user_feedback?: 'thumbs_up' | 'thumbs_down';
  review_priority: 'high' | 'normal' | 'sample';
}

function selectForReview(
  responses: Array<{ id: string; score: number; feedback?: string }>,
  targetSampleSize: number
): HumanReviewCase[] {
  const highPriority = responses.filter(r => 
    r.score < 3 || r.feedback === 'thumbs_down'
  );
  
  const randomSample = shuffle(responses)
    .slice(0, targetSampleSize - highPriority.length)
    .map(r => ({ ...r, review_priority: 'sample' as const }));
  
  return [...highPriority.map(r => ({ ...r, review_priority: 'high' as const })), ...randomSample];
}

Continuous Monitoring in Production

Evaluation is not a one-time exercise. Quality drifts. Models update. User behavior changes. The real world surprises you.

Track these metrics over time:

Metric	What It Tells You	Alert Threshold
Average LLM judge score	Overall quality trend	Drop > 0.3 points
Rule check failure rate	Specific quality violations	Any increase
User negative feedback rate	User-perceived quality	Rise above baseline
Response length distribution	Prompt regression signals	Significant shift
Latency by query type	Performance degradation	p95 > threshold

typescript

class QualityMonitor {
  private metrics: MetricsDB;
  
  async recordResponse(
    requestId: string,
    input: string,
    output: string,
    modelId: string,
    latencyMs: number
  ): Promise<void> {
    // Run lightweight automated checks
    const checks = runQualityChecks(input, output);
    
    // Sample 10% for LLM judge (too expensive to run on all)
    const runJudge = Math.random() < 0.1;
    const judgeScore = runJudge 
      ? await this.runLLMJudge(input, output)
      : null;
    
    await this.metrics.record({
      requestId,
      modelId,
      timestamp: new Date(),
      latencyMs,
      outputLength: output.length,
      checksPass: checks.passed,
      checkFailures: checks.failures,
      judgeScore: judgeScore?.averageScore,
    });
    
    // Alert on critical failures
    if (!checks.passed && checks.failures.some(f => f.severity === 'critical')) {
      await this.alert(`Critical quality failure in request ${requestId}`, checks);
    }
  }
}

The Minimum Viable Eval System

If you ship AI features with no evaluation today, here is the fastest path to something defensible:

Add basic rule checks (length, no PII, language match) to every response. Log failures.
Log a sample of inputs and outputs to a database.
Manually review 20-30 logged outputs this week. Rate them. This is your baseline.
Build a small eval dataset of 50 cases from your logs. Automate running it.
Run the eval before every prompt change from now on.

Five steps. The first three can be done this week. You will immediately know more about your AI's quality than you do today.

FAQ

Q: How do you evaluate LLMs in production?

Evaluate production LLMs across five dimensions: accuracy (are responses correct?), consistency (same quality over time?), latency (fast enough?), cost (economically viable?), and safety (no harmful outputs?). Combine automated metrics with human evaluation and A/B testing against baseline performance.

Q: What metrics should you track for LLM quality?

Track task completion rate, factual accuracy, response relevance, latency percentiles (p50, p95, p99), cost per request, hallucination rate, user satisfaction scores, and safety violation rate. Automated evaluation catches quantity issues while human evaluation catches quality issues.

Q: How often should you evaluate LLM performance?

Evaluate continuously with automated checks on every response, daily aggregate quality reviews, weekly benchmark comparisons, and immediate re-evaluation after model updates or prompt changes. Set up alerts for quality degradation so issues are caught before they impact users at scale.

Sources

Why Evaluation Is Harder Than It Looks

With traditional software, correctness is binary. The function either returns the right value or it does not. You write tests. The tests pass or fail. Straightforward.

With LLMs, there are several layers of complexity that make evaluation genuinely hard:

Outputs are not deterministic. The same input can produce different outputs across runs, especially at higher temperatures. Which output is "correct"?

Correctness is often subjective. For summarization, is a summary that covers the three most important points better than one that covers five points with less depth? It depends on the use case.

The output space is enormous. You cannot enumerate all possible good and bad outputs. Test cases only cover a fraction of what will happen in production.

Building an evaluation system is the engineering work that separates AI features that are trusted from AI features that are "interesting experiments."

The Evaluation Stack

A production-grade evaluation framework has three layers:

None of these replaces the others. Each catches different failure modes.

Layer 1: Automated Scoring

Automated scoring runs on every response in production. It is fast, cheap, and catches broad quality issues.

LLM-as-Judge

Use a separate model to evaluate the output of your primary model. Give it the input, the output, and a rubric. Ask it to score.

typescript

async function llmJudge(
  input: string,
  output: string,
  criteria: string[]
): Promise<{ scores: Record<string, number>; reasoning: string }> {
  const criteriaText = criteria.map((c, i) => `${i + 1}. ${c}`).join('\n');
  
  const response = await anthropic.messages.create({
    model: 'claude-haiku-20241022', // Use fast/cheap model for evaluation
    max_tokens: 512,
    messages: [{
      role: 'user',
      content: `You are evaluating an AI assistant's response.

User Input:
${input}

AI Response:
${output}

Score the response on each criterion from 1-5:
${criteriaText}

Return JSON only:
{
  "scores": {
    "criterion_1": <1-5>,
    "criterion_2": <1-5>,
    ...
  },
  "reasoning": "brief explanation of scores"
}`,
    }],
  });
  
  const text = response.content[0].type === 'text' ? response.content[0].text : '{}';
  return JSON.parse(text);
}

// Usage for a customer support AI:
const evaluation = await llmJudge(
  userMessage,
  aiResponse,
  [
    'Accuracy: Does the response contain only factually correct information?',
    'Completeness: Does it address all aspects of the user\'s question?',
    'Tone: Is it appropriately professional and empathetic?',
    'Actionability: Does it give the user a clear next step?',
  ]
);

Rule-Based Automated Checks

For specific quality requirements, rules are faster and cheaper than LLM judgment.

typescript

interface QualityCheck {
  name: string;
  check: (input: string, output: string) => boolean;
  severity: 'critical' | 'warning';
}

const standardChecks: QualityCheck[] = [
  {
    name: 'no_hallucinated_urls',
    check: (_, output) => {
      const urls = output.match(/https?:\/\/[^\s]+/g) || [];
      // Flag URLs that weren't in the system prompt or input
      return urls.every(url => isKnownDomain(url));
    },
    severity: 'critical',
  },
  {
    name: 'within_length_bounds',
    check: (_, output) => output.length >= 50 && output.length <= 5000,
    severity: 'warning',
  },
  {
    name: 'no_sensitive_data_exposure',
    check: (_, output) => {
      const patterns = [
        /\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b/, // Credit card
        /\b\d{3}-\d{2}-\d{4}\b/, // SSN
        /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z]{2,}\b/i, // Email from internal data
      ];
      return !patterns.some(p => p.test(output));
    },
    severity: 'critical',
  },
  {
    name: 'response_language_matches_input',
    check: (input, output) => {
      const inputLang = detectLanguage(input);
      const outputLang = detectLanguage(output);
      return inputLang === outputLang;
    },
    severity: 'warning',
  },
];

function runQualityChecks(
  input: string,
  output: string
): { passed: boolean; failures: Array<{ check: string; severity: string }> } {
  const failures = standardChecks
    .filter(check => !check.check(input, output))
    .map(check => ({ check: check.name, severity: check.severity }));
  
  const hasCritical = failures.some(f => f.severity === 'critical');
  return { passed: !hasCritical, failures };
}

Layer 2: Offline Evaluation

Offline evaluation measures quality systematically against a curated test set. Run it before major prompt changes, model updates, or new feature releases.

Building Your Evaluation Dataset

Your eval dataset is worth more than almost any other investment in your AI system. Build it carefully.

Start by logging production inputs. After you accumulate a few hundred real requests, sample them for variety:

Easy cases (most inputs)
Medium difficulty
Edge cases (ambiguous, adversarial, unusual)
Known failure modes you have seen

For each sampled input, create a ground truth label. This might be:

A reference answer (for factual QA)
Quality criteria and acceptable responses (for open-ended tasks)
Binary pass/fail for specific requirements

typescript

interface EvalCase {
  id: string;
  input: string;
  reference_output?: string; // For tasks with a clear correct answer
  evaluation_criteria: string[]; // For open-ended tasks
  expected_properties: {
    must_contain?: string[];
    must_not_contain?: string[];
    min_length?: number;
    max_length?: number;
    must_be_json?: boolean;
  };
  tags: string[]; // For filtering and analysis
}

const evalDataset: EvalCase[] = [
  {
    id: 'cs-001',
    input: 'How do I cancel my subscription?',
    reference_output: undefined, // Open-ended, use criteria
    evaluation_criteria: [
      'Provides clear cancellation steps',
      'Mentions data retention policy',
      'Offers alternative (downgrade vs cancel)',
    ],
    expected_properties: {
      must_contain: ['cancel', 'settings'],
      must_not_contain: ['sorry to see you go', 'unfortunately'],
      min_length: 100,
    },
    tags: ['customer-support', 'cancellation', 'easy'],
  },
  // ... more cases
];

Running Evaluation

typescript

async function runEvaluation(
  systemPrompt: string,
  dataset: EvalCase[]
): Promise<EvaluationReport> {
  const results = await Promise.all(
    dataset.map(async (evalCase) => {
      const output = await getModelOutput(systemPrompt, evalCase.input);
      
      const propertyCheck = checkProperties(output, evalCase.expected_properties);
      const llmScore = evalCase.evaluation_criteria.length > 0
        ? await llmJudge(evalCase.input, output, evalCase.evaluation_criteria)
        : null;
      
      return {
        caseId: evalCase.id,
        input: evalCase.input,
        output,
        propertyCheck,
        llmScore,
        tags: evalCase.tags,
        passed: propertyCheck.passed && 
               (llmScore ? Object.values(llmScore.scores).every(s => s >= 3) : true),
      };
    })
  );
  
  const passRate = results.filter(r => r.passed).length / results.length;
  const byTag = groupBy(results, r => r.tags);
  
  return {
    passRate,
    totalCases: results.length,
    passed: results.filter(r => r.passed).length,
    failed: results.filter(r => !r.passed).length,
    byTag: Object.fromEntries(
      Object.entries(byTag).map(([tag, cases]) => [
        tag,
        cases.filter(c => c.passed).length / cases.length,
      ])
    ),
    failedCases: results.filter(r => !r.passed),
  };
}

Layer 3: Human Review

Automation does not replace human judgment. It filters down to the cases that most need human attention.

Set up a review queue that surfaces:

Cases where automated scores are low
Cases where users gave negative feedback
Random samples from production (for bias detection)
Cases that failed automated checks

typescript

interface HumanReviewCase {
  id: string;
  input: string;
  output: string;
  automated_score?: number;
  user_feedback?: 'thumbs_up' | 'thumbs_down';
  review_priority: 'high' | 'normal' | 'sample';
}

function selectForReview(
  responses: Array<{ id: string; score: number; feedback?: string }>,
  targetSampleSize: number
): HumanReviewCase[] {
  const highPriority = responses.filter(r => 
    r.score < 3 || r.feedback === 'thumbs_down'
  );
  
  const randomSample = shuffle(responses)
    .slice(0, targetSampleSize - highPriority.length)
    .map(r => ({ ...r, review_priority: 'sample' as const }));
  
  return [...highPriority.map(r => ({ ...r, review_priority: 'high' as const })), ...randomSample];
}

Continuous Monitoring in Production

Evaluation is not a one-time exercise. Quality drifts. Models update. User behavior changes. The real world surprises you.

Track these metrics over time:

Metric	What It Tells You	Alert Threshold
Average LLM judge score	Overall quality trend	Drop > 0.3 points
Rule check failure rate	Specific quality violations	Any increase
User negative feedback rate	User-perceived quality	Rise above baseline
Response length distribution	Prompt regression signals	Significant shift
Latency by query type	Performance degradation	p95 > threshold

typescript

class QualityMonitor {
  private metrics: MetricsDB;
  
  async recordResponse(
    requestId: string,
    input: string,
    output: string,
    modelId: string,
    latencyMs: number
  ): Promise<void> {
    // Run lightweight automated checks
    const checks = runQualityChecks(input, output);
    
    // Sample 10% for LLM judge (too expensive to run on all)
    const runJudge = Math.random() < 0.1;
    const judgeScore = runJudge 
      ? await this.runLLMJudge(input, output)
      : null;
    
    await this.metrics.record({
      requestId,
      modelId,
      timestamp: new Date(),
      latencyMs,
      outputLength: output.length,
      checksPass: checks.passed,
      checkFailures: checks.failures,
      judgeScore: judgeScore?.averageScore,
    });
    
    // Alert on critical failures
    if (!checks.passed && checks.failures.some(f => f.severity === 'critical')) {
      await this.alert(`Critical quality failure in request ${requestId}`, checks);
    }
  }
}

The Minimum Viable Eval System

If you ship AI features with no evaluation today, here is the fastest path to something defensible:

Add basic rule checks (length, no PII, language match) to every response. Log failures.
Log a sample of inputs and outputs to a database.
Manually review 20-30 logged outputs this week. Rate them. This is your baseline.
Build a small eval dataset of 50 cases from your logs. Automate running it.
Run the eval before every prompt change from now on.

Five steps. The first three can be done this week. You will immediately know more about your AI's quality than you do today.

FAQ

Q: How do you evaluate LLMs in production?

Q: What metrics should you track for LLM quality?

Q: How often should you evaluate LLM performance?

LLM Evaluation in Production: Stop Trusting Vibes

Why Evaluation Is Harder Than It Looks

The Evaluation Stack

Layer 1: Automated Scoring

LLM-as-Judge

Rule-Based Automated Checks

Layer 2: Offline Evaluation

Building Your Evaluation Dataset

Running Evaluation

Layer 3: Human Review

Continuous Monitoring in Production

The Minimum Viable Eval System

FAQ

Sources

Further Reading

Related Articles

Want to Implement This?

LLM Evaluation in Production: Stop Trusting Vibes

Why Evaluation Is Harder Than It Looks

The Evaluation Stack

Layer 1: Automated Scoring

LLM-as-Judge

Rule-Based Automated Checks

Layer 2: Offline Evaluation

Building Your Evaluation Dataset

Running Evaluation

Layer 3: Human Review

Continuous Monitoring in Production

The Minimum Viable Eval System

FAQ

Sources

Further Reading

Related Articles

Want to Implement This?