The AI Evaluation Playbook: Measuring What Matters in Production LLM Systems

The AI Evaluation Playbook: Measuring What Matters in Production LLM Systems | AI PM Portfolio

The AI Evaluation Playbook: Measuring What Matters in Production LLM Systems

May 15, 2024 · 18 min read · Definitive Guide

Most AI evaluation is done on benchmarks that do not reflect production reality. After 18 months of running LLMs in production for 16,000 users -- processing 128,000 documents across 60 specialized analyzers -- here is the evaluation playbook that actually works. It covers the four-stage evaluation hierarchy (offline, shadow, canary, production), the metric taxonomy that separates signal from noise, and the principle that changed everything: evaluation is not a step in the process -- evaluation IS the product.

Why do benchmarks fail to predict production performance?

Let me start with the number that changed how I think about evaluation. When we first deployed our document extraction pipeline, our benchmark accuracy was 96.1%. Our production accuracy was 82.3%. A 13.8 percentage point gap. That gap nearly killed the product.

The gap exists because benchmarks test a sanitized version of reality. According to a 2024 study by researchers at Stanford's Center for Research on Foundation Models, the average gap between benchmark performance and real-world deployment performance for LLM-based applications is 8-15 percentage points. Our experience was consistent with that finding. The reasons are systematic.

  • Distribution mismatch: Benchmarks use curated datasets. Production receives whatever users upload -- blurry photos, rotated scans, documents with coffee stains. According to our analysis, 23% of production documents had quality issues that no benchmark in our evaluation suite represented.
  • Temporal drift: Models change. Providers update their models without notice. We experienced 7 undocumented model updates from our AI providers in 2023 that changed extraction behavior. Benchmarks run once do not capture drift.
  • Interaction effects: In production, extraction errors compound. A wrong employer name means the wrong tax rate lookup, which means the wrong liability calculation. Benchmarks measure field-level accuracy. Production impact is pipeline-level.
  • Edge case density: Benchmarks under-represent edge cases. In production, the long tail of weird documents is 15-20% of volume. According to our golden dataset analysis, edge cases accounted for 62% of user-reported errors despite being only 18% of volume.

The solution is not better benchmarks. It is a multi-stage evaluation system that progressively moves from controlled to real-world conditions.

What is the four-stage evaluation hierarchy?

We developed a four-stage evaluation hierarchy that catches different categories of problems at each stage. Each stage is more realistic but more expensive to run. The key insight: most problems should be caught at the cheapest stage, but you need all four stages because each catches problems the others miss.

Stage 1: Offline evaluation

Offline evaluation runs against a golden dataset without any production traffic. It is cheap, fast, and repeatable. We run it on every commit that touches extraction logic.

Component What We Test How We Test Pass Criteria
Golden dataset 380 hand-verified documents Extract and compare against known-correct values No regression below baseline per document type
Edge case suite 85 known-difficult documents Extract and verify specific known failure patterns All previously-fixed failures remain fixed
Schema validation Output structure correctness Validate against JSON schema per document type 100% schema compliance
Regression markers 31 specific past bugs Replay exact inputs that caused past failures All 31 pass (zero regression)

According to our incident analysis, offline evaluation catches 72% of quality regressions before they reach any user. The remaining 28% require the later stages. [LINK:post-28]

Stage 2: Shadow evaluation

Shadow evaluation runs the new version alongside production on real traffic, but outputs are not shown to users. We compare the two versions to detect regressions at scale. According to our data, 48 hours of production traffic (800-1,200 documents) provides statistical significance for document types above 5% of volume.

Key shadow metrics: agreement rate (98% threshold), disagreement analysis (50 sampled for manual review), confidence shift detection, and latency comparison (15% regression blocks rollout).

Stage 3: Canary evaluation

Canary evaluation exposes the new version to 5% of real traffic, ramping to 25% over 24 hours. Unlike shadow, users see outputs. We monitor user correction rate (10% increase triggers rollback), support ticket rate, and downstream error rate. According to our deployment history, canary caught 4 regressions that shadow missed -- all related to downstream system interactions.

Stage 4: Production monitoring

Production monitoring runs continuously on 100% of traffic. According to our analysis, most teams track only accuracy. That is deeply insufficient.

What metrics actually matter for production LLM evaluation?

We organize production metrics into three categories: quality, operational, and business. Each category has different stakeholders, different thresholds, and different response playbooks.

Category Metric What It Measures Alert Threshold Response
Quality Field-level accuracy % of fields extracted correctly < 93% (rolling 24h) Page engineering team
Quality Document-level accuracy % of documents with zero errors < 78% (rolling 24h) Page engineering team
Quality Critical field accuracy Accuracy on high-stakes fields (SSN, dollar amounts) < 97% (rolling 24h) Page PM + engineering
Quality Hallucination rate % of extractions containing fabricated values > 0.5% (rolling 24h) Immediate investigation
Quality Confidence calibration Does stated confidence match actual accuracy? Brier score > 0.15 Re-calibrate thresholds
Operational P50 / P95 / P99 latency Extraction response time distribution P95 > 8 seconds Check provider health
Operational Error rate % of requests returning errors > 2% (rolling 1h) Check circuit breakers
Operational Fallback activation rate % of requests using backup provider > 15% (rolling 1h) Investigate primary provider
Operational Cost per extraction Average inference cost per document > $0.12 (rolling 24h) Check routing logic
Business User correction rate % of users editing extracted values > 18% (rolling 7d) Review failing doc types
Business Time to completion Avg time from upload to finalized data > 4 minutes UX review
Business NPS on extraction User satisfaction with extraction quality < 40 (monthly survey) Product review

According to our 18 months of tracking, the single most predictive metric for user satisfaction was not field-level accuracy -- it was the user correction rate. A user who has to correct 0 fields rates the experience 4.7/5. A user who corrects 1 field rates it 4.1/5. A user who corrects 3+ fields rates it 2.8/5. The correction rate integrates both accuracy and user perception in a way that no technical metric captures alone.

What does "evaluation is the product" mean in practice?

This is the principle that transformed our approach. Most teams treat evaluation as quality assurance -- something that happens after the product is built, to verify it works. We inverted that. Evaluation became the product itself.

Here is what that means concretely.

  1. Every feature starts with the eval. Before writing extraction code, we define the golden dataset, pass criteria, and production metrics. If we cannot define the eval, we do not build the feature. According to our tracking, features starting with eval definitions shipped 40% faster.
  2. The golden dataset is a product artifact. Our 380-document golden dataset is maintained with production-code rigor: version control, reviews, dedicated owner. We add 12-15 documents per month.
  3. Evaluation gets sprint allocation. We dedicate 15% of sprint capacity to eval infrastructure. According to our analysis, every hour invested prevented 3.2 hours of incident response.
  4. PM reviews eval results before deployment. A 0.3% accuracy regression might be acceptable for a feature that reduces latency by 40%. Only a PM can make that tradeoff.

The paradigm shift: In traditional software, you build the feature, then test it. In AI products, you build the test, then build the feature to pass the test. The eval is the spec. If the eval is wrong, the product is wrong -- no matter how good the model is.

How do you evaluate AI systems that have no "correct answer"?

Document extraction has clear correct answers -- the W-2 box 1 either says $52,341 or it does not. But many LLM applications do not have objective ground truth. Chatbot responses, content generation, summarization -- how do you evaluate those?

We developed three evaluation approaches for subjective outputs.

Approach 1: Pairwise comparison

Ask "is output A or output B better?" instead of "is this output good?" According to a 2024 NeurIPS study, pairwise evaluation has 23% higher inter-annotator agreement than Likert-scale scoring.

Approach 2: Rubric-based evaluation

Define 4-6 binary criteria instead of holistic scores. "Does it contain factual errors?" beats "rate quality 1-5." According to our experience, this reduced evaluator disagreement from 34% to 11%.

Approach 3: LLM-as-judge with calibration

Claude, given a detailed rubric, agreed with human evaluators 87% of the time on our tasks. The key: calibrate against 200 human-judged examples first, and only use the LLM judge where agreement exceeds 85%. According to a 2024 Berkeley study, calibrated LLM judges achieve human-level reliability on structured tasks but fail on nuanced judgments.

What are the common evaluation mistakes to avoid?

We made these mistakes. Learn from our pain.

  • Mistake 1: Measuring accuracy without stratification. An overall 94% can hide 72% on your most important document type. We discovered a 19-percentage-point gap that was invisible in aggregates.
  • Mistake 2: Ignoring confidence calibration. A model saying 95% confident but correct only 80% of the time is more dangerous than one honestly reporting 80%. We calibrate weekly via Brier score.
  • Mistake 3: Evaluating in isolation. We once improved accuracy 2 percentage points but increased latency 300%. Evaluate across all dimensions simultaneously. [LINK:post-27]
  • Mistake 4: Static golden datasets. Our initial 45 documents were too small and clean. A golden dataset needs 200+ documents to be statistically meaningful for 10+ document types.
  • Mistake 5: Not tracking eval costs. Our daily suite costs $12/day; shadow evaluation costs $400 per deployment. According to our tracking, evaluation was 8% of total inference spend -- worth budgeting explicitly.

How do you build an evaluation culture on your team?

The hardest part is cultural, not technical. We changed it with three practices.

  1. "Eval first" as a PR requirement. No PR merges without an eval update. Compliance was 61% in month one, 94% by month three. [LINK:post-28]
  2. Eval leaderboard. Accuracy per analyzer on a team dashboard. Analyzers on the leaderboard improved 2.1x faster than those tracked only in logs.
  3. Blameless post-mortems. Focus on "what eval would have caught this?" Every post-mortem produces at least one new test. According to our tracking, 31 of 510 tests originated from post-mortems. [LINK:post-26]

The definitive summary: Evaluation is the highest-leverage investment in any AI product. A team with mediocre models and excellent evaluation will outperform a team with excellent models and mediocre evaluation -- because the first team knows what is broken and can fix it, while the second team does not even know what they do not know.

Frequently Asked Questions

How much should we invest in evaluation versus feature development?

Our ratio: 15% of sprint capacity. It paid for itself within 2 months through reduced incident response. Early-stage AI products should budget 20-25% while building foundational eval infrastructure. Mature products need 10-15%.

Can we use LLMs to evaluate LLMs? Is that circular?

Not circular if the evaluator is more capable than the evaluated model and calibrated against human judgments. LLM-as-judge is reliable for structured criteria (factual correctness, format compliance) but unreliable for subjective criteria (tone, helpfulness).

How do you handle evaluation when the underlying model updates without notice?

This happened 7 times in 2023. Our defense: daily accuracy tests that alert on regressions above 0.5 percentage points. When we detect drift, we have a 48-hour window to determine regression (roll back) or improvement (update baselines). Of the 7 undocumented updates, 3 improved accuracy and 4 caused regressions.

What tools do you use for LLM evaluation?

Mostly in-house: version-controlled golden datasets, custom shadow framework, production monitoring dashboards, and a calibrated LLM-judge wrapper. Total build: 6 engineering weeks. For simpler architectures, tools like LangSmith, Braintrust, or Humanloop may work. According to a 2024 survey, 58% of production LLM teams use a mix of custom and off-the-shelf evaluation tools.

Last updated: May 15, 2024