My Year-End Reflection: What Enterprise AI Got Right and Wrong in 2022
My Year-End Reflection: What Enterprise AI Got Right and Wrong in 2022 | AI PM Portfolio
My Year-End Reflection: What Enterprise AI Got Right and Wrong in 2022
December 20, 2022 · 14 min read · Reflection / Analysis
2022 was the year enterprise AI went from pilot to production. But inside a 6,000-location operation, the real story was not the technology breakthroughs that made headlines. It was data quality, integration complexity, and change management. ChatGPT launched on November 30 and changed the conversation overnight. Here is what actually happened in the trenches versus what the industry predicted, and what it means for 2023.
What did enterprise AI actually look like in 2022?
I spent 2022 managing AI systems across a national tax services company with 6,000 franchise locations. From that vantage point, the gap between what conference keynotes promised and what we actually shipped was enormous. According to Gartner's 2022 survey, 54% of AI projects moved from pilot to production in 2022, up from 35% in 2021. That matched what I saw. We moved three systems from pilot to production. But that stat hides the painful truth: "production" meant wildly different things depending on who you asked.
For our leadership, production meant "the model is running." For the operations teams at 6,000 locations, production meant "I have to trust this thing with a client's tax return." For the IT team, production meant "another system to monitor at 2 AM." For me as a product manager, production meant all three of those definitions had to be simultaneously true.
According to McKinsey's 2022 State of AI report, organizations that reported the highest AI adoption also reported the highest number of AI-related risks. That tracked with our experience exactly. The more we deployed, the more edge cases we discovered.
Which predictions about AI came true in 2022?
At the start of 2022, I wrote down five predictions about how AI would evolve in our operation. Looking back, two were right, two were wrong, and one was right for the wrong reasons.
| Prediction | Outcome | Why |
|---|---|---|
| Document classification accuracy would exceed 95% | Correct | Reached 96.8% by October. Better training data, not better models, made the difference. |
| We would reduce manual data entry by 60% | Wrong | Achieved only 38%. Edge cases in handwritten documents and poor-quality scans were harder than expected. |
| Location managers would resist AI-driven workflow changes | Correct | 43% of locations delayed adoption by more than 6 weeks. Change management was the bottleneck, not technology. |
| We would need dedicated ML ops engineers | Wrong | Cloud-managed ML services matured enough that existing DevOps staff could handle monitoring. We hired zero dedicated ML ops engineers. |
| AI would save us money in year one | Right, wrong reason | Saved $2.1M, but not from automation efficiency. Savings came from error reduction and fewer client complaints, not from reduced headcount as originally projected. |
The pattern across all five predictions: I consistently overestimated what the technology would do and underestimated the human factors. According to a 2022 MIT Sloan study, 78% of failed AI projects fail due to organizational rather than technical challenges. I would have dismissed that stat in January. By December, I had lived it.
What was the gap between AI demos and AI deployments?
The most important lesson of 2022 was understanding why demos deceive. A demo is a controlled environment with clean data, happy-path workflows, and an audience that wants to be impressed. A deployment is an uncontrolled environment with messy data, exception-heavy workflows, and users who just want to get their job done.
Here is a concrete example. In March, we demoed a document extraction system to the executive team. We fed it 50 perfectly scanned W-2s. Accuracy: 99.2%. Audience reaction: standing ovation. Two months later, we deployed it across 200 pilot locations. Accuracy in the first week: 71.3%. What happened?
- Image quality variance: Demo used high-resolution scans. Production received phone photos taken under fluorescent lighting, at angles, sometimes with fingers partially obscuring fields. 23% of submitted images fell below our quality threshold.
- Form variety: Demo used the standard W-2 layout. Production encountered 47 different W-2 variations from different payroll providers, state-specific addenda, and forms with non-standard fonts.
- User behavior: Demo assumed users would submit one document at a time. Production users submitted multi-page PDFs with W-2s, 1099s, and grocery receipts mixed together.
- Edge cases at scale: At 50 documents, you see zero edge cases. At 50,000 documents, you see every edge case. We catalogued 312 distinct failure modes in the first season.
We eventually reached 94.1% accuracy in production by the end of tax season. But the journey from 71.3% to 94.1% consumed more engineering hours than building the original system. According to Google's ML engineering guidelines published in 2022, teams should expect to spend 40-60% of total project effort on production hardening. We spent closer to 65%.
How did data quality shape our AI strategy?
If I could summarize 2022 in one sentence: data quality was the real product. We spent more time on data pipelines, validation rules, and input standardization than on model architecture. According to Andrew Ng's data-centric AI framework, improving data quality yields 2-3x more accuracy improvement than improving model architecture for most enterprise applications. Our experience confirmed this ratio almost exactly.
We implemented a data quality scoring system in Q2 that rated every incoming document on a 0-100 scale across four dimensions:
- Image clarity: Resolution, contrast, skew, and occlusion
- Structural integrity: Whether required fields were present and readable
- Consistency: Whether extracted values passed cross-field validation (e.g., federal withholding should not exceed gross wages)
- Completeness: Whether all expected fields contained values
Documents scoring below 60 were automatically routed to manual review. Documents between 60 and 80 received AI processing with mandatory human spot-checks. Documents above 80 received fully automated processing with statistical sampling. This tiered approach reduced our error rate by 41% compared to applying the same AI model uniformly to all inputs. The key insight: it is often cheaper to improve your data than to improve your model.
What did ChatGPT change in November 2022?
On November 30, OpenAI released ChatGPT. Within five days, it reached one million users. By mid-December, it was the most-discussed technology topic in every meeting I attended. But its impact on our actual enterprise operations in those first three weeks was precisely zero.
That is not a criticism of ChatGPT. It is an observation about the lag between technology availability and enterprise adoption. According to Everett Rogers' diffusion of innovations theory, the average time from technology awareness to enterprise adoption is 18-24 months. ChatGPT was three weeks old. We were not going to rebuild our production systems around it by January.
What ChatGPT did change immediately was the conversation. Three shifts happened in the last three weeks of December:
- Executive expectations accelerated overnight. My VP, who had never asked about AI capabilities unprompted, sent me four articles in one week asking "can we do this?" The answer to all four was "not yet, and not for the reasons you think."
- Staff anxiety spiked. Our customer service team saw ChatGPT demos and immediately asked whether their jobs were at risk. We spent more time on internal communication in December than in the previous six months combined.
- Vendor pitches tripled. Every SaaS vendor in our stack suddenly had an "AI-powered" version of their product. Most were wrappers around the OpenAI API with minimal fine-tuning. According to CB Insights, AI-related startup funding in Q4 2022 surged 37% over Q3, driven almost entirely by the ChatGPT catalyst.
The real enterprise impact of ChatGPT will unfold in 2023 and beyond. But the December 2022 experience taught me something important: in enterprise AI, the technology is often ahead of the organization's ability to absorb it. Managing that gap is the actual job of an AI product manager.
What were the biggest enterprise AI mistakes of 2022?
Looking across my own experience and conversations with peers managing AI at other organizations, five mistakes recurred consistently:
- Treating accuracy as a single number. "Our model is 95% accurate" is a meaningless statement without specifying accurate at what, measured how, on what data, and at what confidence threshold. We learned to report accuracy across 12 different dimensions. [LINK:post-10]
- Underinvesting in change management. We allocated 5% of our AI budget to training and change management. It should have been 20%. The technology worked. The people resisted. According to Prosci's 2022 benchmarking data, projects with excellent change management were 6x more likely to meet objectives.
- Optimizing for the wrong metric. We spent Q1 optimizing for processing speed. Our users cared about accuracy and transparency. Speed was table stakes. Optimizing for it was a three-month detour.
- Ignoring the feedback loop. Our initial system had no mechanism for users to report errors back to the model. Without feedback loops, models degrade. We lost 2.3 percentage points of accuracy before we built the correction pipeline. [LINK:post-8]
- Confusing "AI-first" with "AI-only." The best systems we shipped in 2022 were hybrid: AI handled the high-volume, pattern-matching work, and humans handled exceptions, edge cases, and quality assurance. Pure AI systems consistently underperformed hybrid ones.
What does this mean for enterprise AI in 2023?
Based on 12 months of operating AI systems at enterprise scale, here are my predictions for 2023. I will revisit these in December to see how they hold up.
- Data quality will become a formal discipline. In 2022, data quality was everyone's problem and nobody's job. In 2023, I expect to see dedicated data quality roles and tools become standard in AI teams. Gartner predicts that by 2025, organizations investing in data quality will outperform competitors by 70% in revenue. The organizations that start in 2023 will have a head start.
- LLMs will supplement but not replace existing ML pipelines. ChatGPT is impressive. But enterprise ML systems built on structured data, classification models, and rule engines will not be ripped out and replaced. LLMs will add a new layer, especially for unstructured text processing and natural language interfaces. [LINK:post-19]
- Change management will get a seat at the AI table. The organizations that struggled with AI adoption in 2022 struggled because of people, not technology. I expect 2023 budgets to reflect this reality.
- The "AI demo to production" gap will widen before it narrows. As LLMs make demos even more impressive, the gap between what is shown in a boardroom and what ships in production will grow. Managing expectations will become a core competency. [LINK:post-17]
2022 was the year enterprise AI grew up. Not because the technology matured, but because the organizations deploying it finally started treating it like an operational capability rather than a science experiment. The real work was never about the algorithms. It was about the data, the people, and the processes that surrounded them.
Frequently Asked Questions
What was the biggest surprise in enterprise AI in 2022?
The biggest surprise was how much of the work was non-technical. Across all three AI systems we moved to production, approximately 60% of total effort went to change management, data quality, and integration work. Only 40% was actual model development and engineering. Most AI product managers I spoke with reported similar ratios. The technology was rarely the bottleneck.
How should enterprise teams prepare for the impact of ChatGPT and LLMs?
First, audit your existing AI systems and identify where unstructured text processing is a bottleneck. That is where LLMs will have the most immediate impact. Second, invest in prompt engineering and evaluation frameworks now, before you need them. Third, do not panic-buy AI solutions from vendors who just bolted an API call onto their existing product. According to Forrester's Q4 2022 analysis, 68% of "AI-enhanced" enterprise products added no measurable value over their non-AI predecessors.
What metrics should AI product managers track beyond accuracy?
We tracked 14 metrics by year end, but the five that mattered most were: user trust score (survey-based), time-to-correction (how long it took to fix an AI error), coverage rate (percentage of cases the AI could handle without human intervention), feedback loop velocity (how quickly user corrections improved the model), and total cost of ownership including human review costs. Accuracy alone is a vanity metric.
Is it worth moving from pilot to production if accuracy is below 95%?
It depends entirely on what the AI is doing and what the cost of errors is. For our document classification system, 96.8% accuracy in production was acceptable because Layer 2 human review caught the remaining errors cheaply. For tax calculation, anything below 99.5% was unacceptable because errors were costly. The threshold should be determined by the cost of errors multiplied by the error rate, compared against the cost of the human-only alternative. Sometimes 88% accuracy with a human review layer beats 99% accuracy from humans alone on speed and total cost.
Last updated: December 20, 2022