The AI PM's Year in Review: From Hype to Production in 2023

The AI PM's Year in Review: From Hype to Production in 2023 | AI PM Portfolio

The AI PM's Year in Review: From Hype to Production in 2023

December 15, 2023 · 14 min read · Year in Review

2023 was the year AI went from demos to deployments. GPT-4 launched in March. Claude arrived. Every startup pivoted to AI. But the gap between demo and production was enormous -- and most AI products still failed. Here is what actually worked versus what was hype, from someone who shipped AI features to 16,000 users at a YC-backed tax-tech startup. Plus honest predictions for 2024, and the things I got completely wrong this year.

What actually happened in AI product management in 2023?

In January 2023, ChatGPT had just hit 100 million users in two months. Every board deck had "AI strategy" on slide three. Every PM was asked "what is our AI play?" And most of us had no idea how to answer that question in a way that would survive contact with production.

I spent 2023 as a product manager at a YC-backed tax-tech startup, shipping AI features to 16,000 users. According to a McKinsey survey from August 2023, 79% of respondents had some exposure to generative AI, but only 22% of organizations had adopted it in at least one business function. That gap -- between exposure and adoption -- was the defining story of the year.

What was real versus what was hype?

The hardest part of being an AI PM in 2023 was separating signal from noise. Every week brought a new model, a new framework, a new "this changes everything" announcement. Here is my honest assessment of what delivered and what did not.

Claim Verdict Reality
"GPT-4 will replace your entire workflow" Hype GPT-4 was a step change in capability, but replacing workflows required months of prompt engineering, guardrails, and evaluation pipelines. We spent 4 months making it production-ready for a single use case.
"AI will handle document extraction end-to-end" Partially real Multi-modal AI handled 60-70% of documents well. The remaining 30-40% needed fallback pipelines. [LINK:post-29]
"RAG solves the hallucination problem" Hype RAG reduced hallucinations by roughly 40% in our tests, but introduced new failure modes: retrieval misses, context window overflows, and stale embeddings. It was a tool, not a solution.
"Smaller models can replace GPT-4 for most tasks" Real By Q4, we ran 80% of our inference on smaller models, reserving GPT-4 and Claude for complex cases. This cut costs 30x. [LINK:post-27]
"AI agents will automate entire workflows" Hype Agents were impressive in demos. In production, they failed unpredictably. According to a 2023 LangChain survey, only 8% of agent-based projects made it to production.
"Prompt engineering is a real discipline" Real Prompt engineering was the most underrated skill of 2023. Our best prompts went through 50+ iterations. The difference between a naive prompt and an optimized one was 30+ percentage points in accuracy.
"Fine-tuning is dead, prompting is all you need" Hype Fine-tuning was expensive but necessary for specialized domains. Our domain-specific prompts plateaued at 82% accuracy; fine-tuning pushed it to 94%.
"Vector databases are essential for every AI app" Partially real Vector DBs were genuinely useful for search and retrieval. But many teams adopted them prematurely for problems that simple keyword search would have solved.

What were the biggest lessons from shipping AI to 16,000 users?

Shipping AI to real users taught me things that no benchmark or demo could. Here are the five most important lessons from the year.

Lesson 1: Evaluation is the product

The single biggest mistake we made early in 2023 was treating evaluation as an afterthought. We would build a feature, eyeball a few outputs, and ship it. By March, we had a 23% error rate on document extraction that users were catching before we did. According to a 2023 survey by Weights & Biases, 67% of ML teams said evaluation was their biggest bottleneck. We rebuilt our evaluation pipeline from scratch -- offline benchmarks, shadow testing, canary deployments, production monitoring -- and it became the single most valuable investment of the year. [LINK:post-30]

Lesson 2: Users do not care about the model

We spent weeks debating GPT-4 versus Claude versus open-source models. Users never once asked which model we used. They cared about three things: Was the output correct? Was it fast? Did it feel trustworthy? According to our user research, trust signals (showing the source document, highlighting extracted fields, letting users correct errors) increased adoption 34% more than any model upgrade did.

Lesson 3: The cost curve bends if you architect for it

Our initial AI pipeline cost $2.30 per document. At 16,000 users with an average of 8 documents each, that was $294,400 in inference costs for a single tax season. By Q4, we had architected a cascade system that cut costs to $0.08 per document -- a 30x reduction -- without measurable quality loss. The key insight: most documents do not need GPT-4. A cheap model handles 80% of cases; you only escalate the hard ones. [LINK:post-27]

Lesson 4: Latency tolerance is context-dependent

According to Google's research on user patience, the median tolerance for page loads is 3 seconds. For AI features, we found it was dramatically different depending on the perceived complexity of the task. Users waited 12 seconds for document analysis without complaint because they understood it was "thinking." But they abandoned a chatbot response after 4 seconds. The insight: set expectations explicitly. A progress bar that says "Analyzing page 3 of 7" buys you 10 extra seconds of patience.

Lesson 5: The PM's job changed more than anyone expected

In 2022, my job was writing specs, running sprints, and prioritizing backlogs. In 2023, I spent 40% of my time on prompt engineering, evaluation design, and cost optimization. According to a late-2023 Reforge survey, 61% of PMs reported that AI features required fundamentally different planning approaches. The traditional spec -- "given X input, produce Y output" -- does not work when the output is probabilistic. I had to learn to think in distributions, not deterministic outcomes.

What did I get wrong this year?

Intellectual honesty matters. Here are the predictions I made in January 2023 and how they played out.

  1. "Fine-tuning will become obsolete by Q3." Wrong. Fine-tuning remained essential for specialized domains. The models got better at zero-shot, but production accuracy requirements in tax preparation demanded fine-tuned models for our highest-stakes use cases.
  2. "We will ship an AI agent for end-to-end tax preparation by Q4." Wrong. We shipped an AI-assisted workflow, not an autonomous agent. The reliability bar for autonomous tax preparation was too high. According to our testing, the best agent-based approach had a 12% critical error rate -- unacceptable for financial documents.
  3. "Open-source models will close the gap with GPT-4 by summer." Partially right. Llama 2 and Mistral made impressive progress, but for our specific use case (structured extraction from financial documents), the gap remained significant through year-end. Open-source caught up on general tasks but lagged on specialized ones.
  4. "RAG will solve our knowledge management problem." Wrong. RAG solved retrieval. It did not solve the harder problems: keeping embeddings fresh, handling contradictory information in source documents, and managing context windows. We ended up building a custom knowledge layer on top of RAG.

What were the key metrics from our AI deployment?

Numbers matter more than narratives. Here is what our AI deployment actually achieved in 2023.

Metric January 2023 December 2023 Change
Document extraction accuracy 77% 94.2% +17.2 pp
Cost per document $2.30 $0.08 -96.5%
Median processing latency 18 seconds 3.2 seconds -82.2%
User satisfaction (AI features) 3.2 / 5 4.4 / 5 +37.5%
Automated test coverage 43 tests 510 tests +1,086%
AI providers in production 1 4 +300%
Specialized analyzers 8 60 +650%

The extraction platform evolution -- from 1 provider and 8 analyzers to 4 providers and 60 analyzers -- was the most important architectural decision of the year. [LINK:post-28]

What are my predictions for AI product management in 2024?

Given my track record above, take these with appropriate skepticism.

  1. Multi-modal will go mainstream. Claude 3's vision capabilities will move document processing from text-extraction to true document understanding. According to IDC's forecast, 40% of enterprise AI applications will use multi-modal inputs by end of 2024. [LINK:post-29]
  2. Cost optimization will become a core PM skill. In 2024, with AI budgets under scrutiny, PMs who can architect efficient inference pipelines will have a massive advantage. Cascade patterns and model routing will become standard. [LINK:post-27]
  3. Evaluation will become a product category. According to a16z's AI infrastructure report, evaluation tooling was the fastest-growing segment of the AI stack in Q4 2023. [LINK:post-30]
  4. The AI PM role will formalize. According to LinkedIn data, "AI Product Manager" job postings grew 47% in 2023. Distinct competencies around prompt engineering, evaluation design, and AI-specific user research will define the role.
  5. The winners will be multi-model, not single-model. The teams that build provider-agnostic architectures will have flexibility advantages as the model landscape shifts.

What advice would I give to PMs entering AI in 2024?

If you are a PM thinking about moving into AI product management, here is what I wish someone had told me in January 2023.

  • Learn to read model cards and benchmarks. You do not need to train models. You need to evaluate them. Understand what MMLU, HumanEval, and domain-specific benchmarks mean for your use case.
  • Build evaluation before building features. Start every AI feature with: "How will we know this is working?"
  • Design for the failure case first. The user sees a confidently wrong answer, not a 500 error. Design your UX around surfacing uncertainty and enabling correction.
  • Get comfortable with probabilistic thinking. Your feature will work 94.2% of the time. Your job is deciding if that is good enough and what happens during the other 5.8%.
  • Understand the cost structure. A feature that costs $2.30 per session has fundamentally different unit economics than one at $0.08. Learn to estimate costs before scoping features.

The one-sentence summary of 2023: AI went from "wow, look at this demo" to "okay, how do we make this actually work at scale?" -- and that transition was harder, slower, and more rewarding than anyone expected.

Frequently Asked Questions

What skills do AI product managers need that traditional PMs do not?

Three critical additions: prompt engineering (40% of the job), evaluation design (building test suites for probabilistic systems), and cost modeling (inference costs directly impact unit economics). According to a 2023 Reforge survey, 73% of successful AI PMs had hands-on experience with at least two of these.

Is AI product management a fad or a permanent role?

Permanent, but evolving. The same question was asked about "mobile PM" in 2012. Today, every PM thinks about mobile. By 2026, every PM will think about AI. The specialized "AI PM" role exists because we are in the transition period. The skills will become table stakes.

How do you handle stakeholders who want to use AI for everything?

I use what I call the "10x test." If AI does not make this feature at least 10x better (faster, cheaper, or more capable), it does not justify the complexity. In 2023, I said no to 14 AI feature proposals that passed the "cool demo" test but failed the "10x better in production" test. The ones that shipped are the ones in this portfolio.

What was the biggest surprise of your AI PM experience in 2023?

How much time I spent on non-AI problems. Data quality, integration testing, user education, error messaging -- these "boring" problems consumed 60% of my time and had more impact than any model upgrade. The AI is the easy part. The product around the AI is the hard part.

Last updated: December 15, 2023