16,000 Users, 4 AI Systems: A Retrospective
16,000 Users, 4 AI Systems: A Retrospective | AI PM Portfolio
16,000 Users, 4 AI Systems: A Retrospective
December 8, 2024 · 16 min read · Product Retrospective
Over 18 months at a YC-backed tax-tech startup, I built and managed four AI systems that processed 128,000 documents for 16,000 users: a document intake pipeline, a classification engine, an expert matching system, and a conversational assistant. Here is the honest retrospective -- what worked, what failed, what I would rebuild from scratch, and the architectural lessons I carry forward. The biggest lesson: multi-model architectures beat single-model bets every time.
What were the four AI systems and what did each one do?
Building one AI system is hard. Building four that need to work together is a different category of problem. Each of our four systems had a distinct job, a distinct architecture, and a distinct set of failure modes. They shared data, depended on each other's outputs, and were coupled in ways I did not fully understand until things broke.
| System | Purpose | Primary Model | Volume (Season) | Accuracy |
|---|---|---|---|---|
| Document Intake | OCR, extraction, validation of tax documents | GPT-4 Vision + Claude | 128,000 documents | 94.2% |
| Classification | Categorize documents into 14 tax types | Claude (Haiku for speed) | 128,000 classifications | 96.8% |
| Expert Matching | Assign users to the right tax specialist | Custom scoring + Claude review | 16,000 assignments | 91.3% first-match |
| Chat Assistant | Answer user questions about their taxes | Claude (Sonnet) | 25,500 conversations | 4.5/5 satisfaction |
The numbers look clean in a table. The reality was messier. Each system went through 3-5 major iterations. The document intake pipeline was rebuilt twice. The chat assistant nearly shipped with a hallucination rate that would have been legally dangerous. The expert matching system worked perfectly in testing and failed its first real-world deployment because we had not accounted for expert availability changing mid-day.
How did the document intake pipeline work?
The document intake pipeline was the most complex of the four systems. Users uploaded photos and PDFs of their tax documents -- W-2s, 1099s, K-1s, mortgage statements, anything the IRS might care about. The pipeline had to handle blurry photos, rotated pages, multi-page PDFs, and documents in formats we had never seen before.
We used a multi-modal architecture. GPT-4 Vision handled the initial image understanding -- detecting what type of document was in the image, identifying key regions (employer name, wages, withholding amounts), and flagging quality issues (blurry, cut off, wrong document). Claude handled the structured extraction -- taking the regions identified by GPT-4 Vision and producing typed JSON with validated field values.
According to a 2024 study by Gartner on document AI pipelines, multi-model architectures outperform single-model approaches by 12-18 percentage points on complex document types (forms with mixed layouts, handwritten annotations, multi-page structures). Our results were consistent with that finding: when we tested a single-model approach early in development, accuracy was 81%. The multi-model approach reached 94.2%.
What worked in the document intake pipeline?
- Multi-modal splitting: Using GPT-4 Vision for visual understanding and Claude for structured extraction was the right architectural decision. Each model played to its strengths. GPT-4 Vision was better at understanding spatial layouts and detecting document types from visual patterns. Claude was better at producing reliable structured JSON and following extraction schemas.
- Confidence scoring: Every extraction included a confidence score per field. Fields below 0.85 confidence were flagged for human review. This caught 89% of extraction errors before they reached users. According to our post-season analysis, the false positive rate (fields flagged that were actually correct) was 11%, which we considered acceptable. [LINK:post-20]
- Progressive enhancement: We started with the simplest document type (W-2, which has a standardized layout) and added document types one at a time. Each new document type got its own extraction schema, its own test suite, and its own accuracy threshold before being promoted to production.
What would I rebuild in the document intake pipeline?
- The validation layer came too late. We added field-level validation (e.g., "federal withholding cannot exceed total wages") after the pipeline was in production. It should have been there from day one. Those validation rules caught 6% of extractions that had passed the confidence threshold but contained logically impossible values. According to our error analysis, logical validation would have prevented 340 user-facing errors in the first month alone.
- We underinvested in preprocessing. Image quality issues (rotation, blur, lighting) accounted for 43% of extraction failures. We spent months optimizing the AI models when we should have spent weeks on image preprocessing -- auto-rotation, contrast enhancement, noise reduction. A 2024 benchmark by NIST on document AI found that preprocessing improvements yield 2-3x the accuracy gain per engineering hour compared to model improvements for document types with variable image quality.
- The fallback path was an afterthought. When the AI pipeline failed, users saw a generic error message and had to re-upload. We should have built a manual entry fallback from the start -- a form where users could type the values themselves. By the time we added it, we had lost 8% of users who gave up after failed uploads.
How did classification and expert matching interact?
Classification and expert matching were tightly coupled. Classification determined the document type and complexity category. Expert matching used the complexity category to find the right specialist. If classification was wrong, expert matching was wrong too, and the user ended up with the wrong specialist for their situation.
Classification used Claude (Haiku) for speed -- it needed to return a result in under 2 seconds so users saw immediate feedback after uploading a document. We tested 14 document types and achieved 96.8% accuracy overall. But accuracy was not uniform across types.
| Document Type | Classification Accuracy | Volume (% of total) | Impact of Misclassification |
|---|---|---|---|
| W-2 | 99.1% | 34% | Low (simple routing) |
| 1099-NEC | 97.4% | 18% | Medium (self-employment routing) |
| 1099-INT / 1099-DIV | 95.8% | 12% | Low (investment routing) |
| K-1 | 93.2% | 4% | High (partnership tax specialist) |
| Foreign tax documents | 87.6% | 3% | Very High (international tax specialist) |
| Other (10 types) | 96.1% avg | 29% | Varies |
The 87.6% accuracy on foreign tax documents was our biggest pain point. These documents had non-standard layouts, multiple languages, and formats that varied by country. Misclassification meant a user with international tax needs was routed to a domestic specialist who could not help them. According to our post-season user research, 23% of users who were initially misrouted reported a negative overall experience, compared to 4% of users who were correctly routed on the first attempt. [LINK:post-20]
Expert matching used a weighted scoring algorithm that combined expertise fit (50% weight), availability (30%), and historical performance (20%). The system made 15,000 assignments autonomously and escalated 1,000 to human review. First-match accuracy was 91.3%, meaning 91.3% of users stayed with their initially assigned expert through completion.
What did the chat assistant teach me about conversational AI in regulated domains?
The chat assistant was the system I was most proud of and most scared of. It answered user questions about their tax situations using the context from their uploaded documents. It handled 25,500 conversations over the season with a 4.5 out of 5 satisfaction score.
The fear was hallucination. A chatbot that hallucinates restaurant recommendations is annoying. A chatbot that hallucinates tax advice is potentially illegal. According to a 2024 analysis by the IRS Taxpayer Advocate Service, incorrect tax advice -- even from AI -- can still create penalties for the taxpayer. We could not afford a chatbot that made up numbers.
Our architecture used three layers of hallucination prevention:
- Grounded responses only: The assistant could only reference data that existed in the user's uploaded documents. If a user asked about a deduction that was not supported by their documents, the assistant said "I don't see documentation for that deduction in your uploaded files" rather than guessing.
- Citation requirements: Every factual claim in the assistant's response was required to cite the source document and field. "Based on your W-2 from [employer], your federal withholding was $X." If the model could not produce a citation, it was trained to say so.
- Hard guardrails: The assistant could never recommend a specific filing position, estimate a refund amount, or advise whether to itemize vs. take the standard deduction. Those questions were routed to human experts. According to a 2024 survey by the American Institute of CPAs, 91% of tax professionals believe AI should assist but not replace professional judgment on filing positions. Our guardrails reflected that consensus.
The lesson I carry forward: In regulated domains, the most important feature of your AI system is what it refuses to do. We spent as much time designing the refusal paths as the response paths. That investment paid off: zero legal or compliance incidents across 25,500 conversations.
What would I rebuild if I started over today?
If I started these four systems from scratch today, three things would change fundamentally.
Change 1: Unified extraction pipeline with shared schemas. We built each system's data contracts independently. The document intake pipeline produced one JSON schema. The classification engine consumed a different schema. The expert matching system used yet another format. Every boundary was a translation layer, and every translation layer was a bug surface. Today I would design shared schemas that all four systems consume and produce, with a contract auditor that catches schema mismatches before deployment. [LINK:post-38]
Change 2: Cost-aware model selection from day one. We used GPT-4 for everything in the first version. Our AI processing costs peaked at $0.43 per user per session. After optimizing -- using Haiku for classification, Sonnet for chat, GPT-4 Vision only for the visual understanding layer -- we brought it down to $0.14. A 67% reduction. According to a 2024 analysis by a16z, the median AI processing cost for B2C applications is $0.08-$0.15 per user interaction. We were above the median for months before optimizing. Today I would start with a cost-aware model selection framework: use the cheapest model that meets the accuracy threshold for each task.
Change 3: Observability as a first-class concern. We added logging and monitoring incrementally. By the time we had good observability, we had already missed weeks of production data that would have helped us diagnose early issues faster. According to a 2024 survey by Datadog, teams that implement AI observability before production launch resolve production incidents 2.8x faster than teams that add observability retroactively. Today I would set up structured logging, confidence score tracking, latency monitoring, and cost tracking before writing the first line of model interaction code.
What are the architectural lessons from running 4 AI systems in production?
After 18 months, 128,000 documents, and 16,000 users, these are the lessons that generalize beyond tax-tech:
Lesson 1: Multi-model beats single-model. No single model is best at everything. The architecture that works is: fast, cheap models for high-volume, low-complexity tasks (classification). Powerful models for high-complexity, low-volume tasks (extraction of unusual documents). Specialized models for domain-specific capabilities (vision for document understanding). [LINK:post-37]
Lesson 2: Confidence scores are your most important output. More important than the answer itself. A wrong answer with a low confidence score is manageable -- the system routes it to a human. A wrong answer with a high confidence score is a disaster -- the system trusts it and acts on it. Calibrating confidence was harder than building the models.
Lesson 3: Humans are the fallback, not the bottleneck. Every AI system needs a human escalation path. But the goal is not to minimize human involvement -- it is to maximize the quality of human involvement. Humans should handle the cases that AI genuinely cannot, not the cases where AI is slightly uncertain. According to our metrics, optimizing escalation precision (reducing unnecessary escalations from 48% to 17% of all escalations) improved human reviewer effectiveness by 3.2x because they focused on genuinely hard cases.
Lesson 4: Ship the fallback first. Before building the AI-powered path, build the manual path. Users can always type their data, select their document type, and request a human match. Once the manual path works, layer AI on top. This gives you a reliable fallback and a baseline to measure AI improvement against.
Lesson 5: Your AI system is only as good as your data pipeline. We spent 60% of our engineering time on data pipeline work -- ingestion, validation, normalization, storage -- and 40% on model work. The teams that spend 80% on models and 20% on data pipelines ship impressive demos and unreliable products.
Frequently Asked Questions
What was the total cost of running 4 AI systems for 16,000 users?
After optimization, our AI processing cost was approximately $0.14 per user per session, totaling roughly $2,240 per month during peak season. This included API calls to GPT-4 Vision, Claude (Haiku, Sonnet, and Opus for different tasks), vector storage for the chat assistant's retrieval augmented generation, and logging infrastructure. The cost was less than 5% of our total operating expenses, which is consistent with the 2024 a16z finding that AI costs for B2C applications average 3-8% of total operating costs.
How did you handle model updates and version changes?
Model updates were our second-biggest source of production incidents (after data quality issues). We maintained a shadow testing pipeline that ran new model versions against 500 representative documents and compared outputs to our baseline. No model update went to production without passing shadow testing with less than 2% accuracy regression. According to a 2024 survey by MLOps Community, 67% of AI teams have experienced a production incident caused by an unvalidated model update. We had one before implementing shadow testing. We had zero after.
What was the hardest system to build and why?
The chat assistant, without question. Not because of the technology -- Claude handles conversational AI well -- but because of the guardrails. Designing what the assistant should refuse to answer, how it should handle edge cases, and how to verify it was not hallucinating required more product thinking than any of the other systems. The hallucination prevention architecture alone went through five iterations before it was production-ready.
How do you decide when to use AI versus traditional software?
My rule of thumb: if the task requires judgment, adaptation, or handling of unstructured input, use AI. If the task is deterministic and high-volume, use traditional software. Document classification requires judgment (is this a W-2 or a 1099?) -- use AI. Calculating federal tax based on classified income -- use traditional software (a formula). The boundary is where structured meets unstructured.
Would you use the same models today?
The specific model choices would change because the landscape has evolved, but the multi-model architecture would remain the same. Today I would use a more cost-efficient model for classification (the inference cost gap between 2023 and 2025 models is roughly 5-10x for equivalent capability), and I would lean more heavily on multi-modal models for the intake pipeline since their vision capabilities have improved dramatically. The architectural principle -- right model for the right task -- is durable even as specific models change.
Published December 8, 2024. Based on the author's experience building and managing 4 AI systems at a YC-backed tax-tech startup serving 16,000 users across two filing seasons.