Building an AI Extraction Platform: 4 Providers, 60 Analyzers, 510 Tests

Building an AI Extraction Platform: 4 Providers, 60 Analyzers, 510 Tests | AI PM Portfolio

Building an AI Extraction Platform: 4 Providers, 60 Analyzers, 510 Tests

February 22, 2024 · 15 min read · Platform Architecture Case Study

We built a document extraction platform that uses 4 AI providers, 60 specialized analyzers, and 510 automated tests to extract structured data from financial documents. The system processes documents for 16,000 users with 94.2% accuracy. Here is the architecture, why we did not just use one model for everything, and how we built the test pyramid that keeps the whole thing reliable.

Why did we need a multi-provider extraction platform?

The obvious approach to document extraction is simple: take the best available AI model, send it every document, parse the response. We started there. It worked. Then it broke.

In March 2023, our single-provider pipeline processed 4,200 documents in one week. On a Tuesday at 2:47 PM, the provider had a 47-minute outage. During peak tax season. 312 users were mid-upload with no fallback. Our error rate spiked to 100% for those users. According to the provider's own status page, they had 14 degraded-performance incidents that month. We were betting our entire product on a single point of failure.

But the reliability problem was only the second-most-important reason to go multi-provider. The first was quality. According to our internal benchmarks, no single model was best at every document type. GPT-4 excelled at complex K-1 partnership schedules. Claude was better at long-form documents with dense text. Gemini Flash was fastest for simple structured forms. A 2024 Stanford CRFM study confirmed what we saw: model performance varies by 8-15 percentage points across task categories, even among frontier models.

We needed a platform, not a pipe.

What does the multi-provider architecture look like?

The platform has four layers: ingestion, routing, extraction, and validation. Each layer is independently testable and independently deployable.

Layer 1: Ingestion

Documents arrive as PDFs, photos, or scans. The ingestion layer normalizes them into a standard format: cleaned text (via OCR if needed), document images (for multi-modal processing), and metadata (page count, resolution, detected document type). According to our telemetry, 34% of uploaded documents require OCR preprocessing before any AI model can process them.

Layer 2: Routing

The router is a lightweight classifier that examines the document metadata and assigns it to the optimal extraction pathway. It answers two questions: what type of document is this, and which provider-analyzer combination should process it?

Document Type Primary Provider Fallback Provider Analyzer Count Accuracy
W-2 (wage statement) Rule-based + small model Claude 8 97.1%
1099-NEC (freelance) Gemini Flash GPT-4 5 95.3%
1099-B (investments) GPT-4 Claude 7 93.8%
K-1 (partnerships) GPT-4 Claude + consensus 12 91.2%
1098 (mortgage) Small model Gemini Flash 4 96.7%
Schedule C (business) Claude GPT-4 9 92.4%
Foreign tax forms GPT-4 + multi-modal Claude + multi-modal 6 88.9%
All other types Cascade (auto-routed) GPT-4 9 90.5%

The routing decisions were not static. We updated them monthly based on accuracy and cost data. When Claude 3 launched with improved document understanding, we re-routed 23% of extraction traffic to it within two weeks. [LINK:post-27]

Layer 3: Extraction

This is where the 60 analyzers live. Each analyzer is a specialized extraction module tuned for a specific document type and field category. A single document might trigger 3-8 analyzers running in parallel.

Why 60 analyzers instead of one general-purpose extractor? Because specialization wins. According to our A/B testing, a specialized W-2 analyzer outperformed a general "extract all fields" prompt by 11 percentage points (97.1% versus 86.3%). The general prompt was good at most things. The specialized analyzer was excellent at one thing. At production scale, the difference between "good" and "excellent" is the difference between 2,200 errors per season and 460 errors per season.

Layer 4: Validation

Every extraction result passes through validation before reaching the user. Validation includes three checks:

  1. Schema validation: Does the output match the expected structure? Are required fields present? Are values within expected ranges? (A W-2 box 1 value of $4,500,000 for an individual filer triggers a review.)
  2. Cross-field validation: Do the extracted fields make sense together? Federal withholding should be less than total wages. State tax paid should correspond to a real state. According to our analysis, cross-field validation catches 34% of extraction errors that schema validation misses.
  3. Confidence scoring: Each extraction includes a confidence score. Results below the threshold route to human review. We set thresholds per document type, ranging from 0.82 for simple forms to 0.92 for complex ones.

Why does specialization beat general-purpose AI for document extraction?

This is the question I get asked most. "Why not just use GPT-4 for everything? It is good at everything." Three reasons.

Reason 1: Domain-specific prompts dramatically outperform generic ones. A prompt that says "extract all financial data" gets 86% accuracy. A prompt specifying exact box numbers, codes, and field meanings gets 97%. According to a 2024 Google DeepMind study, task-specific prompts outperform general prompts by 8-22 percentage points across extraction tasks.

Reason 2: Specialized models have tighter failure distributions. A general extractor fails unpredictably. A specialized analyzer fails in known, testable ways. Our W-2 analyzer has exactly 4 known failure modes, all with tests. A general extractor has unknown failure modes that surface in production.

Reason 3: Cost efficiency. Specialized analyzers need smaller context windows and simpler prompts. Our specialized W-2 prompt is 340 tokens. A general extraction prompt is 2,400 tokens. At scale, that difference is $180,000 per season. [LINK:post-27]

How do you build a test pyramid for AI extraction systems?

Testing AI systems is fundamentally different from testing deterministic software. You cannot write a test that says "given input X, assert output equals Y" because the output is probabilistic. We developed a test pyramid with 510 tests across four levels.

Test Level Test Count What It Tests Run Frequency Avg Duration
Unit tests (deterministic) 285 Parsers, validators, schema, routing logic Every commit 12 seconds
Integration tests (mocked AI) 120 Pipeline flow, error handling, fallback logic Every PR 45 seconds
Accuracy tests (live AI) 80 Extraction quality against golden dataset Daily 8 minutes
Regression tests (production replay) 25 Known edge cases and past failures Weekly 22 minutes

The golden dataset

The accuracy tests run against a golden dataset of 380 documents with hand-verified correct extractions. Every time a user reports an extraction error, we add that document to the golden dataset after correcting it. The dataset grows over time, and our test coverage grows with it. According to our tracking, the golden dataset grew from 45 documents at launch to 380 documents over 14 months.

Snapshot testing for AI outputs

For the 80 accuracy tests, we do not assert exact equality. We assert structural equivalence within tolerance bands. If the golden answer for Box 1 is "$52,341.00" and the model extracts "$52,341", that is a pass (formatting difference). If it extracts "$52,431", that is a fail (digit transposition). We define per-field tolerance rules: monetary values must match within $0.01, dates must match exactly, names must match after normalization.

What are the reliability patterns that keep the platform running?

A multi-provider platform is only reliable if it handles failures gracefully. We implemented five reliability patterns.

  1. Circuit breakers per provider: If a provider's error rate exceeds 5% in a 5-minute window, the circuit breaker trips and routes all traffic to the fallback provider. According to our incident logs, circuit breakers activated 23 times in 2023, preventing an estimated 4,100 user-facing errors.
  2. Automatic retries with escalation: A failed extraction retries once with the same provider. If it fails again, it escalates to the fallback provider. If both fail, it routes to human review. We never show the user an error if a human can resolve it within our SLA.
  3. Provider health monitoring: We ping each provider every 30 seconds with a standardized test document. Latency spikes above 2x the baseline trigger proactive traffic rebalancing before users are affected.
  4. Graceful degradation: If all AI providers are down (it happened once, for 8 minutes), the platform falls back to rule-based extraction only. This handles 35% of document types with full accuracy and queues the rest for processing when providers recover.
  5. Idempotent processing: Every extraction is idempotent. If a document is processed twice (due to retry logic), the results are identical. This eliminates an entire class of consistency bugs. According to our analysis, idempotency prevented 1,200 duplicate extraction events in the first month after deployment.

What were the key architectural decisions and their tradeoffs?

Every architecture involves tradeoffs. Here are the three most consequential decisions we made and what we traded for them.

  • Decision: 60 specialized analyzers instead of 1 general extractor. Tradeoff: Higher maintenance burden. Each analyzer needs its own prompt, its own tests, its own monitoring. But 94.2% accuracy instead of 86.3% accuracy. For financial documents where errors have legal consequences, we chose accuracy over maintainability.
  • Decision: 4 providers instead of 1. Tradeoff: Integration complexity. Each provider has different APIs, rate limits, pricing models, and failure modes. We spent 3 weeks building the abstraction layer. But zero single-provider outage translated to a user-facing incident. [LINK:post-26]
  • Decision: 510 tests with a daily accuracy suite. Tradeoff: Slow feedback loop for accuracy changes. A commit that changes a prompt takes 8 minutes to validate against the golden dataset. Engineers initially pushed back. But according to our incident tracking, the test suite caught 31 regressions before they reached production -- regressions that would have affected an estimated 2,400 users. [LINK:post-30]

The principle: In AI systems, the architecture IS the product quality. You cannot prompt-engineer your way out of a bad architecture. A well-architected system with mediocre prompts will outperform a poorly-architected system with brilliant prompts -- because the architecture determines how failures propagate, how costs scale, and how quality is maintained over time.

Frequently Asked Questions

How do you manage 60 analyzers without drowning in maintenance?

Three strategies. First, analyzers share a common framework -- same input/output schema, same validation layer, same monitoring hooks. Only the prompt and extraction logic vary. Second, we group analyzers into families (W-2 family, 1099 family, etc.) so updates to shared logic propagate across the family. Third, we prioritize by impact: the top 10 analyzers by volume get weekly attention; the bottom 20 get monthly reviews. According to our time tracking, maintenance averages 6 hours per week for the entire platform.

How do you decide when to add a new analyzer versus extending an existing one?

We use a simple rule: if the new document type shares more than 70% of its fields with an existing analyzer, we extend. If it shares less than 70%, we create a new one. In practice, the decision is usually clear. A 1099-INT and a 1099-DIV share enough structure to be handled by related analyzers with different field mappings. A K-1 and a W-2 share almost nothing.

What happens when a new AI provider launches? How hard is it to add?

Adding a new provider takes approximately 2 weeks: 1 week for the API integration and abstraction layer, 1 week for benchmarking against our golden dataset. The abstraction layer means the rest of the platform does not know or care which provider is behind a given extraction. We added our fourth provider in 9 working days.

Is 510 tests overkill for an extraction platform?

No. The 510 tests protect 16,000 users processing an average of 8 documents each -- 128,000 extraction events per season. At 94.2% accuracy, that means roughly 7,400 extractions have some error. Without the test suite, that number would be significantly higher. According to our estimate, the test suite prevents approximately 3,800 additional errors per season. That is 3,800 users who do not have to manually correct AI mistakes.

Last updated: February 22, 2024