The 10% Accuracy Improvement That Saved Millions
The 10% Accuracy Improvement That Saved Millions | AI PM Portfolio
The 10% Accuracy Improvement That Saved Millions
September 30, 2022 · 15 min read · Case Study / Deep-Dive
A 10% accuracy improvement on 50,000 AI-processed tax returns does not sound dramatic until you calculate the downstream impact: approximately 5,000 fewer errors, each costing $200-$500 to resolve. That is $1.5-2.5 million in avoided costs per season. But the improvement did not come from a single model breakthrough. It came from 47 small, deliberate changes shipped over 9 months -- an improvement flywheel where each fix created data that enabled the next fix. Here is how we built the flywheel, which of the 47 improvements had outsized impact, and where diminishing returns set in.
Why didn't a single model improvement deliver the 10% gain?
At a national tax services company processing 50,000 returns per season with AI, we started with a return-level accuracy of 84.1%. That meant approximately 7,950 returns had at least one error after AI processing. Our target was 94%. A 10-point improvement sounds like a model architecture change or a training data breakthrough. It was neither.
According to a 2022 Google Research paper on production ML, the majority of accuracy gains in deployed systems come from data quality improvements and pipeline fixes rather than model architecture changes. Their finding: 65% of production accuracy improvements are "boring" -- better data, better validation, better edge case handling. Our experience matched precisely. Of our 47 improvements, only 3 involved changing model architecture. The other 44 were data, rules, and pipeline changes.
The reason is structural. A tax return is not a single prediction. It is a pipeline of hundreds of predictions -- document classification, field extraction, value validation, calculation, cross-form reconciliation -- each feeding the next. A 0.5% improvement at any single stage compounds across the pipeline. And pipeline errors are far more varied than model errors: corrupted scans, unusual form layouts, state-specific rules, edge cases in calculation logic. No single model change addresses that variety.
How does the improvement flywheel work?
The flywheel was the system that made continuous improvement sustainable. Each improvement generated data that identified the next improvement. The cycle:
- Human reviewers catch errors in AI-processed returns as part of our 3-layer review architecture. Every error is logged with the specific field, the incorrect value, the correct value, and the reviewer's assessment of the error cause.
- Weekly error analysis clusters errors by root cause. We used a simple taxonomy: extraction error (wrong value from document), classification error (wrong document type), calculation error (wrong arithmetic or logic), validation gap (missing cross-field check), and edge case (unusual input the system was not designed for).
- Prioritization by impact ranks each error cluster by volume (how many returns affected) times severity (cost per error). A common low-severity error ranks higher than a rare high-severity one if the total impact is greater.
- Fix ships within 2 weeks through our 3-gate compliance process. The tight ship cycle is critical -- if fixes take months, the error data is stale by the time the fix deploys.
- Post-fix measurement confirms the improvement and often reveals the next error cluster. Fixing the top error exposes the second error, which was previously masked or overshadowed.
We ran this cycle 38 times over 9 months. Some cycles produced a single fix. Others produced 2-3 related fixes. The average cycle time was 10.4 days from error identification to deployed fix.
Which improvements had outsized impact?
Of the 47 improvements, the distribution of impact was steeply skewed. The top 10 improvements delivered 71% of the total accuracy gain. The bottom 20 delivered only 8%. This Pareto pattern is consistent with what a 2021 Berkeley AI Research study found across production ML systems: roughly 70% of accuracy gains come from 20% of interventions.
| Rank | Improvement | Category | Accuracy Impact | Returns Affected |
|---|---|---|---|---|
| 1 | 1099-MISC box mapping correction | Extraction | +1.3 pts | ~4,200 |
| 2 | W-2 multi-employer disambiguation | Extraction | +1.1 pts | ~3,800 |
| 3 | State withholding cross-validation | Validation | +0.9 pts | ~2,900 |
| 4 | Schedule C expense categorization retrain | Model update | +0.8 pts | ~2,100 |
| 5 | Dependent eligibility age calculation fix | Calculation | +0.7 pts | ~1,800 |
| 6 | Scan quality detection and re-routing | Pipeline | +0.6 pts | ~3,100 |
| 7 | EITC phase-out calculation precision | Calculation | +0.5 pts | ~1,400 |
| 8 | Multi-state filing detection | Classification | +0.5 pts | ~900 |
| 9 | 1099-INT decimal extraction fix | Extraction | +0.4 pts | ~2,600 |
| 10 | Schedule A medical expense threshold | Validation | +0.3 pts | ~1,100 |
The number 1 improvement -- 1099-MISC box mapping -- illustrates why pipeline fixes matter more than model upgrades. The 1099-MISC form has 18 boxes, each with different tax treatment. Our extraction model correctly identified the values but occasionally mapped them to the wrong box. Box 7 (nonemployee compensation) and Box 3 (other income) have different tax implications but visually similar positions on certain form layouts. A simple coordinate-based remapping rule, not a model retrain, fixed 1,300 errors per season at zero compute cost.
What does the diminishing returns curve look like?
The 47 improvements were shipped roughly evenly across 9 months: 5-6 per month. But their impact was not evenly distributed. The first 3 months of improvements delivered 6.3 of the 10 accuracy points. The next 3 months delivered 2.8 points. The final 3 months delivered 0.9 points.
| Period | Improvements Shipped | Accuracy Gain | Cumulative Accuracy | Engineering Hours Invested | Cost per Point |
|---|---|---|---|---|---|
| Months 1-3 | 16 | +6.3 pts | 90.4% | 1,280 hrs | 203 hrs/pt |
| Months 4-6 | 17 | +2.8 pts | 93.2% | 1,360 hrs | 486 hrs/pt |
| Months 7-9 | 14 | +0.9 pts | 94.1% | 1,120 hrs | 1,244 hrs/pt |
The cost per accuracy point increased 6x from months 1-3 to months 7-9. This is the diminishing returns curve in practice. The easy errors -- high volume, clear root cause, straightforward fix -- were resolved first. The remaining errors were rarer, more complex, and required more investigation to diagnose and more engineering to fix.
According to a 2022 analysis by Weights & Biases across their customer base, the average cost-per-accuracy-point increases 3-8x after the initial improvement wave. Our 6x increase was squarely within that range. The decision of when to stop investing is strategic, not technical: at what point does the cost of the next accuracy point exceed the cost of the errors it prevents?
When should you stop optimizing accuracy?
For us, the answer was 94.1%. Here is the economic logic:
At 94.1% return-level accuracy, approximately 2,950 returns out of 50,000 had at least one error after AI processing. Our human review architecture caught approximately 89% of those before filing, leaving roughly 325 errors reaching taxpayers. At an average resolution cost of $340 per error (combining direct cost and customer service time), that was approximately $110,000 in annual error cost.
To reach 95.1% (one more point), our projected engineering investment was 1,400 hours at a blended rate of $165/hour: $231,000. The accuracy point would prevent approximately 500 errors, but with 89% caught by human review, only 55 additional errors would be prevented from reaching taxpayers. At $340 each: $18,700 in avoided costs. The investment would not pay back for 12 years.
We redirected that engineering capacity to the processing efficiency initiative, which had a 7-month payback period. The decision was not "stop improving accuracy." It was "invest where the next dollar generates the highest return."
How do you build the improvement flywheel into your team structure?
The flywheel requires a specific team structure to operate. It cannot be a side project. Here is how we staffed it:
- Error analysis analyst (1 FTE): Dedicated to weekly error clustering and root cause analysis. This role examined every human-caught error, categorized it, and maintained the error taxonomy. Without a dedicated analyst, error review becomes ad-hoc and the flywheel stalls.
- ML engineer rotation (1.5 FTE equivalent): Two ML engineers alternated 3-week improvement sprints. One worked on the current top-priority improvement while the other investigated the next. The overlap prevented gaps between the flywheel cycles.
- Validation engineer (1 FTE): Built and maintained the 340+ automated validation rules (Layer 1 of our review architecture). Many improvements were validation rules, not model changes. This engineer shipped 14 of the 47 improvements.
- PM allocation (0.3 FTE -- me): Prioritization, stakeholder communication, and decision-making on when to pursue a fix versus accept the error rate. Approximately 12 hours per week dedicated to the improvement flywheel.
Total: approximately 3.8 FTEs dedicated to continuous accuracy improvement. At a blended cost of $165/hour, the annual investment was approximately $1.25 million. The annual return: $1.5-2.5 million in avoided error costs plus the downstream impact of higher customer satisfaction and reduced agent burden from the 10,000-agent rollout. According to McKinsey, the average ROI on dedicated ML improvement teams in production systems is 2.3x in the first year, rising to 3.5x by year two as the flywheel matures.
What did we learn about the nature of AI accuracy?
The 9-month journey from 84% to 94% taught me three things about accuracy in production AI systems that I did not learn from any textbook or research paper:
First, accuracy is a distribution, not a number. Our 94.1% average masked enormous variation: 98.7% on simple W-2-only returns, 87.3% on complex multi-form returns. The average was meaningless for planning. We segmented accuracy by return complexity and made different investment decisions for each segment.
Second, the last 2% is worth more than the first 8%. Going from 84% to 92% moved us from "not ready for production" to "viable with human review." Going from 92% to 94% moved us from "viable" to "economically advantaged." The marginal value of each accuracy point increased as we approached the threshold where AI + human review was cheaper than human-only processing.
Third, accuracy improvements compound in unexpected ways. When we improved extraction accuracy, human reviewers spent less time on obvious errors and more time on subtle judgment calls. This improved the quality of their corrections, which improved our training data, which improved the next model update. The metrics we tracked showed this compounding effect clearly: human override quality (the percentage of overrides that were correct) improved from 89% to 94% as the AI handled more routine corrections autonomously.
The Core Insight: A 10% accuracy improvement is not a moonshot. It is a grind -- 47 small fixes, each measured, each shipped through compliance, each generating the data for the next fix. The flywheel metaphor is exact: the first rotations are heavy. By month 6, the momentum carries itself. The PM's job is not to find the single brilliant fix. It is to build the system that finds, prioritizes, and ships 47 of them.
Frequently Asked Questions
How do you convince leadership to fund incremental improvements instead of a model rewrite?
Show the math. Our 47 incremental improvements cost $1.25 million and delivered 10 accuracy points in 9 months. A model architecture rewrite was estimated at $2.8 million over 12 months with a projected improvement of 6-8 points and significant deployment risk. Incremental improvement delivered more, faster, cheaper, and with lower risk. The data makes the argument.
What is the minimum dataset size for the improvement flywheel to work?
You need enough volume for weekly error analysis to produce statistically significant clusters. For us, that was approximately 500 returns per week. Below that threshold, weekly error clusters were too small to distinguish signal from noise. At lower volumes, run the flywheel on a monthly cadence instead of weekly.
Does the flywheel work for non-ML systems?
Yes. The pattern -- measure errors, cluster by root cause, prioritize by impact, fix, remeasure -- works for any system that processes data at scale. We applied the same framework to our customer service ticket resolution system, which had no ML components. It improved first-contact resolution from 61% to 73% over 6 months using the same flywheel mechanics.
How do you prevent the flywheel from introducing regressions?
Every improvement goes through our compliance gates, including an automated regression test suite. We maintained a "golden set" of 2,000 previously-correct returns. Every improvement was tested against this set before deployment. If accuracy on the golden set decreased by more than 0.1 points, the improvement was blocked until the regression was resolved. We hit this guardrail 6 times out of 47 improvements. Each time, the investigation revealed an unintended side effect that would have been worse than the improvement it accompanied.
Last updated: September 30, 2022