AI Engineering

The AI Evaluation Playbook: Measuring What Matters in Production LLM Systems

dinesh challa

15 May 2024 — 7 min read

The AI Evaluation Playbook: Measuring What Matters in Production LLM Systems | AI PM Portfolio

The AI Evaluation Playbook: Measuring What Matters in Production LLM Systems

May 15, 2024 · 18 min read · Definitive Guide

Most AI evaluation is done on benchmarks that do not reflect production reality. After 18 months of running LLMs in production for 16,000 users -- processing 128,000 documents across 60 specialized analyzers -- here is the evaluation playbook that actually works. It covers the four-stage evaluation hierarchy (offline, shadow, canary, production), the metric taxonomy that separates signal from noise, and the principle that changed everything: evaluation is not a step in the process -- evaluation IS the product.

Why do benchmarks fail to predict production performance?

Let me start with the number that changed how I think about evaluation. When we first deployed our document extraction pipeline, our benchmark accuracy was 96.1%. Our production accuracy was 82.3%. A 13.8 percentage point gap. That gap nearly killed the product.

The gap exists because benchmarks test a sanitized version of reality. According to a 2024 study by researchers at Stanford's Center for Research on Foundation Models, the average gap between benchmark performance and real-world deployment performance for LLM-based applications is 8-15 percentage points. Our experience was consistent with that finding. The reasons are systematic.

Distribution mismatch: Benchmarks use curated datasets. Production receives whatever users upload -- blurry photos, rotated scans, documents with coffee stains. According to our analysis, 23% of production documents had quality issues that no benchmark in our evaluation suite represented.
Temporal drift: Models change. Providers update their models without notice. We experienced 7 undocumented model updates from our AI providers in 2023 that changed extraction behavior. Benchmarks run once do not capture drift.
Interaction effects: In production, extraction errors compound. A wrong employer name means the wrong tax rate lookup, which means the wrong liability calculation. Benchmarks measure field-level accuracy. Production impact is pipeline-level.
Edge case density: Benchmarks under-represent edge cases. In production, the long tail of weird documents is 15-20% of volume. According to our golden dataset analysis, edge cases accounted for 62% of user-reported errors despite being only 18% of volume.

The solution is not better benchmarks. It is a multi-stage evaluation system that progressively moves from controlled to real-world conditions.

What is the four-stage evaluation hierarchy?

We developed a four-stage evaluation hierarchy that catches different categories of problems at each stage. Each stage is more realistic but more expensive to run. The key insight: most problems should be caught at the cheapest stage, but you need all four stages because each catches problems the others miss.

Stage 1: Offline evaluation

Offline evaluation runs against a golden dataset without any production traffic. It is cheap, fast, and repeatable. We run it on every commit that touches extraction logic.

Component	What We Test	How We Test	Pass Criteria
Golden dataset	380 hand-verified documents	Extract and compare against known-correct values	No regression below baseline per document type
Edge case suite	85 known-difficult documents	Extract and verify specific known failure patterns	All previously-fixed failures remain fixed
Schema validation	Output structure correctness	Validate against JSON schema per document type	100% schema compliance
Regression markers	31 specific past bugs	Replay exact inputs that caused past failures	All 31 pass (zero regression)

According to our incident analysis, offline evaluation catches 72% of quality regressions before they reach any user. The remaining 28% require the later stages. [LINK:post-28]

Stage 2: Shadow evaluation

Shadow evaluation runs the new version alongside production on real traffic, but outputs are not shown to users. We compare the two versions to detect regressions at scale. According to our data, 48 hours of production traffic (800-1,200 documents) provides statistical significance for document types above 5% of volume.

Key shadow metrics: agreement rate (98% threshold), disagreement analysis (50 sampled for manual review), confidence shift detection, and latency comparison (15% regression blocks rollout).

Stage 3: Canary evaluation

Canary evaluation exposes the new version to 5% of real traffic, ramping to 25% over 24 hours. Unlike shadow, users see outputs. We monitor user correction rate (10% increase triggers rollback), support ticket rate, and downstream error rate. According to our deployment history, canary caught 4 regressions that shadow missed -- all related to downstream system interactions.

Stage 4: Production monitoring

Production monitoring runs continuously on 100% of traffic. According to our analysis, most teams track only accuracy. That is deeply insufficient.

What metrics actually matter for production LLM evaluation?

We organize production metrics into three categories: quality, operational, and business. Each category has different stakeholders, different thresholds, and different response playbooks.

Category	Metric	What It Measures	Alert Threshold	Response
Quality	Field-level accuracy	% of fields extracted correctly	< 93% (rolling 24h)	Page engineering team
Quality	Document-level accuracy	% of documents with zero errors	< 78% (rolling 24h)	Page engineering team
Quality	Critical field accuracy	Accuracy on high-stakes fields (SSN, dollar amounts)	< 97% (rolling 24h)	Page PM + engineering
Quality	Hallucination rate	% of extractions containing fabricated values	> 0.5% (rolling 24h)	Immediate investigation
Quality	Confidence calibration	Does stated confidence match actual accuracy?	Brier score > 0.15	Re-calibrate thresholds
Operational	P50 / P95 / P99 latency	Extraction response time distribution	P95 > 8 seconds	Check provider health
Operational	Error rate	% of requests returning errors	> 2% (rolling 1h)	Check circuit breakers
Operational	Fallback activation rate	% of requests using backup provider	> 15% (rolling 1h)	Investigate primary provider
Operational	Cost per extraction	Average inference cost per document	> $0.12 (rolling 24h)	Check routing logic
Business	User correction rate	% of users editing extracted values	> 18% (rolling 7d)	Review failing doc types
Business	Time to completion	Avg time from upload to finalized data	> 4 minutes	UX review
Business	NPS on extraction	User satisfaction with extraction quality	< 40 (monthly survey)	Product review

According to our 18 months of tracking, the single most predictive metric for user satisfaction was not field-level accuracy -- it was the user correction rate. A user who has to correct 0 fields rates the experience 4.7/5. A user who corrects 1 field rates it 4.1/5. A user who corrects 3+ fields rates it 2.8/5. The correction rate integrates both accuracy and user perception in a way that no technical metric captures alone.

What does "evaluation is the product" mean in practice?

This is the principle that transformed our approach. Most teams treat evaluation as quality assurance -- something that happens after the product is built, to verify it works. We inverted that. Evaluation became the product itself.

Here is what that means concretely.

Every feature starts with the eval. Before writing extraction code, we define the golden dataset, pass criteria, and production metrics. If we cannot define the eval, we do not build the feature. According to our tracking, features starting with eval definitions shipped 40% faster.
The golden dataset is a product artifact. Our 380-document golden dataset is maintained with production-code rigor: version control, reviews, dedicated owner. We add 12-15 documents per month.
Evaluation gets sprint allocation. We dedicate 15% of sprint capacity to eval infrastructure. According to our analysis, every hour invested prevented 3.2 hours of incident response.
PM reviews eval results before deployment. A 0.3% accuracy regression might be acceptable for a feature that reduces latency by 40%. Only a PM can make that tradeoff.

The paradigm shift: In traditional software, you build the feature, then test it. In AI products, you build the test, then build the feature to pass the test. The eval is the spec. If the eval is wrong, the product is wrong -- no matter how good the model is.

How do you evaluate AI systems that have no "correct answer"?

Document extraction has clear correct answers -- the W-2 box 1 either says $52,341 or it does not. But many LLM applications do not have objective ground truth. Chatbot responses, content generation, summarization -- how do you evaluate those?

We developed three evaluation approaches for subjective outputs.

Approach 1: Pairwise comparison

Ask "is output A or output B better?" instead of "is this output good?" According to a 2024 NeurIPS study, pairwise evaluation has 23% higher inter-annotator agreement than Likert-scale scoring.

Approach 2: Rubric-based evaluation

Define 4-6 binary criteria instead of holistic scores. "Does it contain factual errors?" beats "rate quality 1-5." According to our experience, this reduced evaluator disagreement from 34% to 11%.

Approach 3: LLM-as-judge with calibration

Claude, given a detailed rubric, agreed with human evaluators 87% of the time on our tasks. The key: calibrate against 200 human-judged examples first, and only use the LLM judge where agreement exceeds 85%. According to a 2024 Berkeley study, calibrated LLM judges achieve human-level reliability on structured tasks but fail on nuanced judgments.

What are the common evaluation mistakes to avoid?

We made these mistakes. Learn from our pain.

Mistake 1: Measuring accuracy without stratification. An overall 94% can hide 72% on your most important document type. We discovered a 19-percentage-point gap that was invisible in aggregates.
Mistake 2: Ignoring confidence calibration. A model saying 95% confident but correct only 80% of the time is more dangerous than one honestly reporting 80%. We calibrate weekly via Brier score.
Mistake 3: Evaluating in isolation. We once improved accuracy 2 percentage points but increased latency 300%. Evaluate across all dimensions simultaneously. [LINK:post-27]
Mistake 4: Static golden datasets. Our initial 45 documents were too small and clean. A golden dataset needs 200+ documents to be statistically meaningful for 10+ document types.
Mistake 5: Not tracking eval costs. Our daily suite costs $12/day; shadow evaluation costs $400 per deployment. According to our tracking, evaluation was 8% of total inference spend -- worth budgeting explicitly.

How do you build an evaluation culture on your team?

The hardest part is cultural, not technical. We changed it with three practices.

"Eval first" as a PR requirement. No PR merges without an eval update. Compliance was 61% in month one, 94% by month three. [LINK:post-28]
Eval leaderboard. Accuracy per analyzer on a team dashboard. Analyzers on the leaderboard improved 2.1x faster than those tracked only in logs.
Blameless post-mortems. Focus on "what eval would have caught this?" Every post-mortem produces at least one new test. According to our tracking, 31 of 510 tests originated from post-mortems. [LINK:post-26]

The definitive summary: Evaluation is the highest-leverage investment in any AI product. A team with mediocre models and excellent evaluation will outperform a team with excellent models and mediocre evaluation -- because the first team knows what is broken and can fix it, while the second team does not even know what they do not know.

Frequently Asked Questions

How much should we invest in evaluation versus feature development?

Our ratio: 15% of sprint capacity. It paid for itself within 2 months through reduced incident response. Early-stage AI products should budget 20-25% while building foundational eval infrastructure. Mature products need 10-15%.

Can we use LLMs to evaluate LLMs? Is that circular?

Not circular if the evaluator is more capable than the evaluated model and calibrated against human judgments. LLM-as-judge is reliable for structured criteria (factual correctness, format compliance) but unreliable for subjective criteria (tone, helpfulness).

How do you handle evaluation when the underlying model updates without notice?

This happened 7 times in 2023. Our defense: daily accuracy tests that alert on regressions above 0.5 percentage points. When we detect drift, we have a 48-hour window to determine regression (roll back) or improvement (update baselines). Of the 7 undocumented updates, 3 improved accuracy and 4 caused regressions.

What tools do you use for LLM evaluation?

Mostly in-house: version-controlled golden datasets, custom shadow framework, production monitoring dashboards, and a calibrated LLM-judge wrapper. Total build: 6 engineering weeks. For simpler architectures, tools like LangSmith, Braintrust, or Humanloop may work. According to a 2024 survey, 58% of production LLM teams use a mix of custom and off-the-shelf evaluation tools.

Last updated: May 15, 2024

The AI Evaluation Playbook: Measuring What Matters in Production LLM Systems

dinesh challa

The AI Evaluation Playbook: Measuring What Matters in Production LLM Systems

Why do benchmarks fail to predict production performance?

What is the four-stage evaluation hierarchy?

Stage 1: Offline evaluation

Stage 2: Shadow evaluation

Stage 3: Canary evaluation

Stage 4: Production monitoring

What metrics actually matter for production LLM evaluation?

What does "evaluation is the product" mean in practice?

How do you evaluate AI systems that have no "correct answer"?

Approach 1: Pairwise comparison

Approach 2: Rubric-based evaluation

Approach 3: LLM-as-judge with calibration

What are the common evaluation mistakes to avoid?

How do you build an evaluation culture on your team?

Frequently Asked Questions

How much should we invest in evaluation versus feature development?

Can we use LLMs to evaluate LLMs? Is that circular?

How do you handle evaluation when the underlying model updates without notice?

What tools do you use for LLM evaluation?

Read more

How I Built an Autonomous QA System That Tests My App While I Sleep

How I Built a 90-Second Concept Video Using 7 AI Tools From My Terminal

The AI Video Production Playbook: How I Made a 90-Second Concept Video for $15

300+ MCP Combo Patterns That Make AI Agents Actually Useful