MCP Server Migration: Replacing an In-House Tax Engine
How a tax-tech app replaced a brittle in-house calculation engine with an MCP server. Benchmarks (2.2s), a silent percentage bug, PII anonymization.
Why We Replaced Our In-House Extraction Engine With an MCP Server | AI PM Portfolio
Why We Replaced Our In-House Tax Calculation Engine With an MCP Server
April 11, 2026 · 12 min read · MCP Servers
Last Updated: 2026-04-11
We replaced a brittle in-house tax calculation engine with a third-party service exposed via Model Context Protocol (MCP). The migration cut maintenance burden by approximately 80%, reduced calculation errors on edge-case filings, and -- critically -- let developers call the tax engine directly from Claude Code during development. The tradeoff: a 2.2-second latency floor for federal plus state calculations, and a silent percentage-versus-fraction storage bug that nearly corrupted thousands of records before we caught it in benchmarking.
What was wrong with the in-house tax calculation engine?
The application is a tax-tech platform that processes returns for individual filers. For the first year, tax calculations ran on a local engine embedded in the codebase -- a few thousand lines of TypeScript implementing federal and state tax logic. It worked. Barely.
Three problems compounded over time:
- Tax law changes every year. The IRS updates brackets, thresholds, credit phase-outs, and form layouts annually. Our 2024 brackets were hardcoded. When 2025 brackets arrived, we needed a developer to manually update 40+ constants, test against known-good returns, and hope nothing was missed. The IRS inflation adjustments for TY2025 changed 60+ parameters. Each one was a potential regression.
- State tax logic is a combinatorial nightmare. Federal tax is one set of rules. Add California, New York, Texas (no income tax, but franchise tax edge cases), and you are maintaining parallel rule sets. Our engine covered 3 states. The third-party service covers all 50 plus DC.
- Tight coupling killed iteration speed. The calculation engine was imported directly into API routes. Changing a deduction formula meant touching the same files as the API layer. A bug in the standard deduction calculation caused a 4-hour outage because the fix required redeploying the entire application.
By month 14, the in-house engine had accumulated 23 known inaccuracies on edge cases (AMT, SALT cap carryovers, education credits with phase-outs) and we were spending roughly 15 hours per week maintaining it instead of building product features.
Why MCP instead of a REST API or SDK?
When we decided to externalize tax calculations, we evaluated four integration patterns. The comparison shaped our architecture for the next 18 months. I covered why MCP matters as a platform shift in a previous post -- here is the specific decision matrix for this migration.
| Dimension | In-House Engine | REST API | SDK / npm Package | MCP Server |
|---|---|---|---|---|
| Integration effort | Already done (but brittle) | Medium (HTTP client, auth, retries) | Low (import and call) | Medium (MCP client, transport config) |
| Tax law updates | Manual, 15+ hrs/week | Automatic (provider maintains) | Automatic (version bump) | Automatic (provider maintains) |
| State coverage | 3 states | All 50 + DC | Depends on vendor | All 50 + DC |
| Developer experience | Call function in code | Postman / curl | Call function in code | Call from Claude Code terminal |
| Latency (fed + state) | ~50ms (local) | ~1-3s (network) | ~50ms (local) | ~2.2s (network + protocol) |
| Schema exploration | Read source code | Read API docs | TypeScript types | Interactive (tool discovery built in) |
| Testability from CLI | Write test script | curl commands | Write test script | Natural language from terminal |
| Best for | Prototypes, simple tax logic | Mature APIs with good docs | Well-maintained packages | AI-augmented dev workflows |
The deciding factor was developer experience. With an MCP server, a developer can open Claude Code and say "calculate federal tax for a single filer with $85,000 W-2 income and $12,000 in itemized deductions" and get a result in 3 seconds. No test script. No Postman setup. No authentication dance. The MCP server exposes 10 tools including schema exploration (tax_namespaces, tax_namespace_schema), validation (check_tax), and calculation (calculate_tax). During development, this cut our "time to verify a tax scenario" from 8-12 minutes to under 30 seconds.
According to the Model Context Protocol specification, MCP provides a standardized way for AI applications to connect to external data sources and tools. For our use case, this meant the same protocol that powers development-time exploration also powers the production calculation path -- one integration, two workflows.
How did we handle PII anonymization before external calls?
Tax data is some of the most sensitive information a person has. Social Security Numbers, dates of birth, employer details, addresses -- all of it flows through a tax calculation. Sending raw PII to any external service, MCP or otherwise, was non-negotiable: we would not do it.
Our anonymization layer sits between the application and the MCP client. Before any data leaves our server:
- SSNs are replaced with a canonical dummy value (
078-05-1120-- a well-known IRS test SSN that cannot belong to a real person) - Names become "Taxpayer User", "Spouse User", "Dependent User"
- Addresses become "123 Main St" -- but the real state and ZIP are preserved because state tax jurisdiction depends on them
- Employer names become "Payer 1", "Payer 2", etc.
- Dates of birth are stripped entirely (the calculation engine does not need them for most computations)
What does go over the wire: wages, interest, dividends, capital gains, deduction amounts, filing status, and number of dependents. The financial data necessary for accurate calculation, but nothing that identifies who the taxpayer is.
Design principle: Anonymize at the boundary, not at the source. The application stores real PII (encrypted at rest with AES-256-GCM). The anonymization happens in the MCP client wrapper -- a single function that transforms the full return into a calculation-safe payload. One boundary, one place to audit, one place to test.
We verified this by auditing every field the MCP server receives. The tax calculation service's API documentation confirmed that SSN, name, and full address are not used in any calculation logic -- they exist in the schema for PDF generation only, which we handle separately. This verification took 2 days and was worth every hour.
What was the percentage-versus-fraction bug?
This is the bug that justified our entire benchmarking process. It was silent, it was subtle, and it would have corrupted every tax rate we stored.
The MCP server returns effective tax rates as percentages. A 22.5% effective rate comes back as 22.5. Our application stored tax rates as fractions -- the same 22.5% rate was expected as 0.225 in the database. Every dashboard component rendered rates using ${rate.toFixed(1)}%.
Without the convention mismatch, a user with a 27.2% effective tax rate would see "27.2%" on their dashboard. With the mismatch, they would see "0.3%" -- because 0.272.toFixed(1) rounds to 0.3. Or worse: if we stored the percentage (22.5) and the dashboard appended a percent sign, users would see "22.5%" -- which happens to look correct, until the rate is used in a downstream calculation that expects a fraction and multiplies income by 22.5 instead of 0.225.
// THE BUG: MCP returns percentage, app expects fraction
const mcp_rate = 22.5; // from MCP server (percentage)
const stored_rate = 22.5; // stored as-is -- WRONG
// Downstream calculation uses it as a fraction
const estimated_tax = income * stored_rate;
// $85,000 * 22.5 = $1,912,500 (should be $19,125)
// THE FIX: explicit convention -- always store percentage form
const stored_rate = 22.5; // percentage form
// Dashboard: `${rate.toFixed(1)}%` → "22.5%" ✓
// Downstream: income * (rate / 100) → $19,125 ✓
// Validation guard: valid range is (0.1, 50]
// Anything outside → null (flags for human review)
const isValid = stored_rate > 0.1 && stored_rate <= 50;
We caught this during benchmarking -- comparing 18 test returns against known-correct CPA-prepared drafts. Two returns showed tax rates of "0.3%" and "0.2%" on the dashboard. The root cause was a 4-line conversion that was never written because the in-house engine and the application shared the same convention (fractions). The MCP server used a different convention (percentages), and nobody checked.
The fix was two-fold: (1) standardize on percentage form everywhere -- store 22.5, display as "22.5%", divide by 100 before arithmetic; (2) add a validation guard that rejects any rate outside the range 0.1 to 50 and nulls the field for human review. The guard has caught 3 bad values in 4 months of production operation.
Lesson: Every system boundary introduces a convention mismatch risk. Percentages vs. fractions. Cents vs. dollars. Milliseconds vs. seconds. UTC vs. local time. When integrating an external service, document the convention for every numeric field on day one. Not day two. Day one.
What do the latency benchmarks actually show?
We built a dedicated benchmarking script that calls the MCP server with a canonical single-filer W-2 return (California state) and measures each phase independently. Here are the results from 3 consecutive runs:
| Phase | Run 1 | Run 2 | Run 3 | Average |
|---|---|---|---|---|
| MCP StreamableHTTP connect | 703ms | 510ms | 661ms | 625ms |
| Federal calculate_tax | 874ms | 881ms | 931ms | 895ms |
| State (CA) calculate_tax | 1,226ms | 1,338ms | 1,364ms | 1,309ms |
| Fed + state total | 2,100ms | 2,219ms | 2,295ms | ~2.2s |
| Total incl. handshake | 2,803ms | 2,730ms | 2,956ms | ~2.8s |
For complex returns -- married filing jointly with multiple W-2s, 1099-B capital gains, and itemized schedules -- we measured 4-7 seconds. The full user-facing flow (button click to numbers on screen, including database writes) lands at 3-8 seconds depending on return complexity.
Is 2.2 seconds acceptable? For a tax application, yes. Users expect financial services to take a moment. A loading spinner with "Calculating your federal and state taxes..." is a natural UX pattern that users tolerate in the 3-8 second range. We tested this with 12 users and none reported the wait as problematic. Banking apps, brokerage dashboards, and insurance quote tools all operate in similar latency ranges.
The critical guardrail: every external call is wrapped in a Promise.race timeout pattern. If the MCP server does not respond within 15 seconds, the call fails gracefully and the application returns a 503 with a user-friendly message. In 4 months of production, we have hit the timeout exactly twice -- both during the provider's scheduled maintenance window.
// Timeout guard on every MCP call
const result = await Promise.race([
mcpClient.callTool('calculate_tax', payload),
new Promise((_, reject) =>
setTimeout(() => reject(new Error('MCP timeout after 15s')), 15_000)
)
]);
// If MCP fails, return 503 — never show stale/partial data
if (!result) {
return NextResponse.json(
{ error: 'Tax calculation temporarily unavailable' },
{ status: 503 }
);
}
What stayed local versus what moved to MCP?
Not everything should be external. We split the tax pipeline into three stages, each with a different integration pattern:
| Stage | Where it runs | Why |
|---|---|---|
| Document extraction (W-2s, 1099s, K-1s) | Local (Azure Content Understanding) | Raw PII in documents. Extraction must happen in our environment. Azure CU runs within our Azure subscription with data residency controls. 110 fields extracted per 1040, 16/16 accuracy on brokerage 1099-Bs in testing. |
| Tax calculation (federal + state) | MCP server (external) | Tax law changes annually. Maintaining calculation logic in-house was unsustainable. MCP server covers all 50 states. PII is anonymized before the call. ~2.2s latency is acceptable. |
| Tax filing (e-file submission) | REST API (external) | Filing is a regulated, credential-gated process. Direct API integration with an authorized e-file provider. PII is required here (IRS needs real SSNs) but transmitted over mTLS. |
The extraction stage deserves a note. We evaluated two Azure services for document extraction: Document Intelligence (DI) and Content Understanding (CU). CU won decisively -- it extracts 110 fields versus DI's 78 for a 1040, and DI has a documented failure mode where it misreads form titles as field values (we observed DI writing the literal string "1040" as the AGI value on two separate returns). The multi-provider extraction architecture is covered in detail here.
How does MCP change the development workflow?
This is the underappreciated advantage. Before MCP, testing a tax scenario required:
- Write a test fixture with all required fields (filing status, income, deductions, credits)
- Import the local engine
- Run the test
- Parse the output manually
- Compare against IRS worksheets or known-good values
Time: 8-12 minutes per scenario. After MCP, the workflow is:
- Open Claude Code in the terminal
- Say "calculate federal tax for a single filer with $92,000 W-2 income, $4,200 in student loan interest, and California residency"
- Claude Code calls the MCP server's
calculate_taxtool directly - Review the structured response with line-by-line breakdown
Time: under 30 seconds. The same MCP server that powers production is accessible from the development terminal. Schema discovery is built into the protocol -- tax_namespace_schema returns the full input schema for any jurisdiction, so a developer can explore what fields exist without reading documentation.
# From Claude Code terminal:
# "What fields does the California state return need?"
# → Claude calls tax_namespace_schema for CA
# → Returns structured schema with every required/optional field
# "Calculate tax for this scenario..."
# → Claude calls calculate_tax with the scenario
# → Returns itemized breakdown: AGI, taxable income,
# tax before credits, credits, total tax, effective rate
According to a 2024 analysis by Anthropic, MCP's tool discovery mechanism lets AI assistants understand available capabilities without hardcoded integrations. For us, this meant onboarding a new developer to the tax calculation system went from "read 2,000 lines of engine code" to "ask Claude Code to explore the schema and run a test calculation." Onboarding time for the tax engine dropped from roughly 2 days to 3 hours.
What are the validation gotchas when integrating an MCP server?
Beyond the percentage-fraction bug, we hit 6 validation issues during the migration. Each one was a silent failure -- the MCP server accepted the input but returned wrong results or unhelpful errors:
- Timestamp format: The server rejects timestamps with milliseconds.
2026-01-15T00:00:00.000Zfails silently;2026-01-15T00:00:00Zworks. We stripped.000Zand replaced withZin the client wrapper. - Zero-wage W-2 rows: W-2s with $0 wages (common in multi-state filings where one state has zero allocation) cause the server to return a validation error. We filter them before sending.
- Employer ZIP/state mismatch: If the employer's ZIP code does not match the employer's state, the server silently uses the ZIP's state. We added a state-ZIP lookup table to catch mismatches before they reach the server.
- UUID format: The server requires valid v4 UUIDs where the third group starts with
4. UUIDs generated by some libraries do not satisfy this constraint. We switched to a compliant generator. - Currency as whole-dollar integers: All monetary values must be whole-dollar integers, not floats.
85000not85000.00. Passing floats caused rounding discrepancies of $1-2 on some returns. - Required boolean fields: Schedule B requires explicit
foreign_accounts_inputandforeign_trust_inputbooleans. Omitting them (even when the filer has no foreign accounts) triggers a missing-field error rather than defaulting to false.
Each of these took 1-4 hours to diagnose. The total validation debugging cost was approximately 20 developer-hours. This is typical for any external service integration -- we saw similar edge cases when integrating multi-modal AI for document processing. The lesson: budget 2-3 weeks for integration hardening on any external service, not just MCP.
What were the measurable results after migration?
We tracked 5 metrics across the 4 months since migration:
| Metric | Before (In-House Engine) | After (MCP Server) | Change |
|---|---|---|---|
| Weekly maintenance hours | ~15 hours | ~2 hours | -87% |
| State coverage | 3 states | 50 states + DC | +1,600% |
| Known calculation inaccuracies | 23 edge cases | 4 edge cases | -83% |
| Calculation latency (fed + state) | ~50ms | ~2.2s | +4,300% (acceptable tradeoff) |
| Developer scenario testing time | 8-12 min | ~30 sec | -96% |
The latency increase is real and worth acknowledging. Going from 50ms to 2.2 seconds is a 44x slowdown. But the in-house engine's 50ms speed was meaningless if it returned the wrong answer on 23 edge cases. Speed without accuracy is not speed -- it is fast failure.
What are the lessons for any MCP migration?
After 4 months in production, here are the patterns that generalize beyond tax calculations:
- Benchmark before committing. We ran 18 test returns through both the old engine and the MCP server before switching. The benchmarking caught the percentage-fraction bug, 2 timestamp format issues, and 1 UUID validation failure. Without benchmarking, all three would have hit production.
- Establish numeric conventions on day one. Percentages vs. fractions. Cents vs. dollars. Milliseconds vs. seconds. Document the convention for every numeric field in the integration contract. Put a validation guard on each one. The cost of a convention mismatch in production is orders of magnitude higher than the cost of documenting it up front.
- Anonymize PII at a single boundary. Do not scatter anonymization logic across routes and services. Build one transformation function that sits between your app and the external service. Audit it quarterly. Make it the only path data can take to the outside.
- Wrap every external call in a timeout. The MCP server is fast (2.2s average), but external services have bad days. A 15-second timeout with a graceful 503 is better than a hung request that consumes a serverless function slot for 60 seconds. Promise.race is the right pattern for this.
- Separate extraction from calculation from filing. Each stage has different PII requirements, different latency budgets, and different failure modes. Bundling them into one system means one failure mode takes down everything. Splitting them means you can swap any individual stage without touching the others -- which is exactly what we did.
When should you NOT use an MCP server?
MCP is not universally better. Do not use it when:
- Latency is critical. If your calculation must complete in under 100ms (real-time bidding, game physics, streaming transforms), the network overhead of MCP makes it unsuitable. Keep computation local.
- Data cannot leave your environment. If regulatory requirements prohibit any external data transmission -- even anonymized -- MCP to an external server is off the table. Consider running the MCP server within your own infrastructure.
- The logic is simple and stable. If your calculation is 50 lines of code and changes once a year, the integration overhead of MCP is not justified. A local function is simpler, faster, and has fewer failure modes.
- You do not use AI-assisted development. The developer experience advantage of MCP (natural-language tool invocation from the terminal) is only valuable if your team uses AI coding assistants. If they do not, MCP offers no advantage over a well-documented REST API.
Frequently Asked Questions
Is MCP the same as a REST API?
No. MCP (Model Context Protocol) is a standardized protocol for connecting AI models to external tools and data sources. Unlike REST APIs, MCP includes built-in tool discovery (the client can ask what tools are available), schema exploration (the client can inspect input/output schemas), and is designed to be invoked by AI assistants natively. A REST API requires hardcoded endpoint knowledge; an MCP server is self-describing.
How do you handle MCP server downtime in production?
Every MCP call is wrapped in a Promise.race timeout (15 seconds). If the server is down or slow, the route returns a 503 with a user-friendly message. We do not fall back to the old in-house engine -- it was decommissioned. Instead, the user sees "Tax calculation temporarily unavailable, please try again in a few minutes." In 4 months, this has triggered twice, both during scheduled maintenance.
Does PII anonymization affect calculation accuracy?
No. Tax calculations depend on financial data (income, deductions, credits) and filing metadata (status, state, number of dependents). They do not depend on the taxpayer's name, SSN, or street address. We verified this by running identical financial scenarios with real versus anonymized PII through the MCP server -- results matched on every field across 18 test returns.
What is the cost difference between in-house and MCP?
The MCP server charges per calculation call. At our volume (approximately 200-400 calculations per month during tax season), the direct API cost is modest -- roughly comparable to the compute cost of running the in-house engine. The real savings are in developer time: 13 fewer maintenance hours per week multiplied by 20 weeks of tax season is 260 hours reclaimed for product development.
Can you run an MCP server locally for development?
Yes, and many teams do. The MCP specification supports multiple transports including stdio (local process) and HTTP (remote). You could run a local MCP server for development and point to a remote one in production. In our case, we use the remote server for both, because the development-time latency (2.8s including handshake) is fast enough and we want to test against the real calculation engine, not a local approximation.
Dinesh Challa is an AI Product Manager building production software with Claude Code. Follow him on LinkedIn.
Published April 11, 2026. Part of a series on MCP servers in production, covering real-world migration decisions, benchmarks, and lessons from building a tax-tech platform.