AI Testing

Anatomy of 4 Stacked Bugs: Production Incident Debugging

How 4 stacked bugs in a tax-tech system masked each other across 2 days and 3 commits. Real timeline, root causes, debugging playbook for compounding failures.

Anatomy of 4 Stacked Bugs: Compounding Production Incident Debugging | AI PM Portfolio

Anatomy of 4 Stacked Bugs: How a Compounding Production Incident Taught Me More Than Any Postmortem Template

April 11, 2026 · 14 min read · AI Testing / Incident Narrative

Last Updated: 2026-04-11

Stacked bugs are production incidents where multiple defects compound on each other, each one masking the next. Fixing Bug 1 reveals Bug 2. Fixing Bug 2 changes system state enough to expose Bug 3. In a real tax-tech application, 4 stacked bugs corrupted financial data across 45 client returns over 2 days and 3 production commits. The key lesson: you cannot debug compounding failures with a single root-cause analysis. You must peel layers, one fix at a time, re-observing the system after each change.

What happened? The 90-second version

A client submitted their TY2025 tax return through our platform. The CPA prepared and uploaded a draft return showing an AGI of $872,458 and total tax of $236,965. When we checked the numbers displayed in the application, they were wrong. Not slightly wrong -- completely wrong. The estimated fields showed null values, incorrect tax amounts, and in one case, the same dollar figure ($101,387) populated in two mutually exclusive fields (amount owed and overpayment).

The investigation that followed uncovered not one bug, but four. Each bug had been silently corrupting data for weeks. Each fix revealed the next problem. The total resolution spanned 2 days, 3 production commits touching 19 files (+1,304 lines, -761 lines), a database migration, and a backfill affecting 45 client returns. According to a 2023 ACM Queue analysis, 62% of production incidents involve more than one contributing failure -- but incidents with 4+ compounding bugs represent only 8% of cases and account for 31% of total resolution time.

What does the full incident timeline look like?

Before diving into each bug, here is the complete timeline. The pattern to notice: each "fix" changed the system state enough to reveal the next layer.

Phase	What Was Observed	What Was Actually Wrong	What Was Fixed
Day 1, Hour 0	Client's estimated tax fields show NULL despite CPA draft upload	Three separate code paths (nightly cron, doc upload handler, client API) all overwrote estimated fields with null on every run	Added a `numbers_source` lock column. Removed estimated fields from the intelligence builder. Deleted the nightly cron job.
Day 1, Hour 3	After fix #1, CPA numbers still not appearing for new uploads	The confirmation route fired extraction via un-awaited `fetch()`. The serverless container died before the request completed.	Changed fire-and-forget fetch to `await fetch()` with `maxDuration = 300`.
Day 1, Hour 5	Extraction completed, but AGI and tax numbers were wrong -- same value in mutually exclusive fields	Layout normalizer regex matched both "amount of overpayment" and "amount you owe" labels, populating both with $101,387. Wages regex grabbed a sub-schedule value instead of Box 1z.	Added mutual exclusivity logic and header-anchored scoping to the normalizer.
Day 2, Hour 0	Backfill script corrected 45 returns, but 2 returns showed tax = AGI (clearly wrong)	Migration backfill picked the wrong extraction revision (an older, broken one) for multi-extraction documents. No sanity check existed.	Added `DISTINCT ON` preferring the correct extraction provider + `total_tax < AGI` sanity check.

Total elapsed time from first report to final all-clear: approximately 18 hours of active debugging across 2 calendar days.

What was Bug 1? The silent data overwrite

1 The application stored estimated tax numbers (AGI, refund, total tax, effective rate, state refund) in a return_intelligence table. These numbers came from two sources: an external tax calculation API and CPA-prepared draft returns. The problem: three separate code paths were writing to this table, and all three initialized the estimated fields to null before writing.

A nightly cron job rebuilt intelligence data at 02:30 UTC daily, nulling every estimated field.
The document extraction handler ran on every uploaded document, nulling estimated fields as a side effect.
A client-facing API endpoint allowed manual intelligence updates, also nulling the fields.

This meant that even when the tax API or CPA draft correctly populated the numbers, any subsequent document upload or the next nightly cron run would silently erase them. The data was correct for minutes or hours, then gone. According to research published by Google's SRE team, silent data corruption is the hardest class of bug to detect because the system continues to function -- it just produces wrong answers.

How was the data overwrite fixed?

The fix introduced a source-of-truth architecture using a lock column:

-- Migration 089: numbers_source lock column
ALTER TABLE tax_returns
  ADD COLUMN numbers_source TEXT
  CHECK (numbers_source IN ('tax_api', 'cpa_draft'));

-- Once set to 'cpa_draft', no other code path may overwrite

The lock column enforces a simple rule: once a CPA draft populates the numbers, nothing else can overwrite them. The tax API route returns HTTP 409 if it attempts to write when numbers_source = 'cpa_draft'. The nightly cron was deleted entirely. The intelligence builder was modified with a defensive filter that strips any estimated fields before writing. One column, three writers eliminated. I wrote about this kind of architectural lock pattern previously.

What was Bug 2? The serverless container kill

2 With the overwrite fixed, CPA numbers should have appeared after a new draft upload. They did not. The confirmation route -- the endpoint an admin hits to approve a CPA draft -- was supposed to trigger extraction of the draft PDF. It did this by calling the extraction endpoint via fetch(). But the fetch was not awaited.

// BEFORE: fire-and-forget -- container dies before fetch completes
fetch(`${baseUrl}/api/returns/${returnId}/documents/${docId}/extract`, {
  method: "POST",
  headers: { [INTERNAL_AUTH_HEADER]: token },
});
// Response returned immediately, Vercel kills container

// AFTER: await-inline -- extraction guaranteed to complete
export const maxDuration = 300; // 5 minutes

const extractRes = await fetch(
  `${baseUrl}/api/returns/${returnId}/documents/${docId}/extract`,
  { method: "POST", headers: { [INTERNAL_AUTH_HEADER]: token } },
);
if (!extractRes.ok) {
  logger.warn(`Extraction returned ${extractRes.status}`);
}

On serverless platforms, the container is terminated as soon as the HTTP response is sent. Any un-awaited async work -- fetch() calls, (async () => {})() IIFEs, .then() callbacks -- is silently dropped. No error. No log. The work simply never happens. This pattern affected 3 separate features in the codebase, all with the same root cause. I covered serverless lifecycle traps in an earlier post.

The fix increased admin-facing response time from ~2 seconds to ~60-90 seconds, but guaranteed that extraction, number persistence, and the draft-ready notification email all completed. The tradeoff was acceptable: this was an admin-only path, not a client-facing one.

What was Bug 3? The regex that matched two mutually exclusive fields

3 With extraction now running reliably, the numbers populated -- but they were wrong. The client's return showed $101,387 in both the "amount of overpayment" field and the "amount you owe" field. On a real 1040, these fields are mutually exclusive. You either owe money or you overpaid. You cannot have both.

The extraction pipeline used Azure's document intelligence service to read PDF tax returns, then ran the raw output through a layout normalizer that used regex patterns to map extracted text to structured fields. The normalizer's regex for "amount of overpayment" and "amount you owe" both matched labels on the same form because they shared structural patterns. The wages regex had a similar problem: on multi-page CPA drafts, it grabbed the first matching line, which was sometimes a sub-schedule value rather than Box 1z on the main 1040.

The fix required three changes to the normalizer:

Mutual exclusivity logic: if both overpayment and amount owed populated, keep only the one whose label appeared later in the document flow (which, on a 1040, is always the correct one).
Header-anchored scoping: wages extraction now anchors to the Form 1040 header before searching for the wages line, preventing sub-schedule values from matching.
Document type reclassification: 28 legacy documents categorized as generic "1099" were reclassified into specific subtypes (19 as 1099-B, 5 as 1099-INT, 4 as 1099-R) so that type-specific extraction logic could run correctly.

This bug had been silently corrupting data for every complex return where the CPA draft was multi-page. An estimated 12% of all CPA draft uploads were affected, based on a post-fix audit of 45 backfilled returns.

What was Bug 4? The backfill that picked the wrong extraction

4 After fixing bugs 1-3, a database migration backfilled the correct numbers for all 45 existing returns that had CPA drafts. But two returns displayed obviously wrong data: the total tax equaled the AGI. For a return with $872K AGI, the system was showing $872K in total tax -- a 100% effective tax rate.

The backfill migration used a query to find the latest extraction for each return. But some returns had multiple extraction revisions from different providers (the platform had switched from one Azure service to another mid-season). The migration's ORDER BY created_at DESC clause picked the most recent extraction -- which happened to be from the older, less accurate provider.

-- BEFORE: picks latest extraction (wrong provider)
SELECT * FROM document_extractions
WHERE return_id = $1
ORDER BY created_at DESC
LIMIT 1;

-- AFTER: prefers the correct provider + sanity check
SELECT DISTINCT ON (tr.id) ...
FROM tax_returns tr
JOIN document_extractions de ON ...
ORDER BY tr.id,
  CASE WHEN de.provider = 'azure-cu' THEN 0 ELSE 1 END,
  de.created_at DESC;

-- Plus: sanity check rejects obviously wrong data
WHERE de.total_tax < de.agi  -- tax cannot exceed income

The sanity check (total_tax < AGI) caught the misread immediately. Without it, the wrong numbers would have been locked behind the numbers_source = 'cpa_draft' column -- permanently frozen as incorrect. This is the compounding nature of stacked bugs: the fix for Bug 1 (the lock column) would have made Bug 4 permanent if Bug 4 had not been caught.

Why are stacked bugs harder than individual bugs?

Individual bugs have a clean mental model: observe symptom, find cause, apply fix, verify. Stacked bugs break this model in three specific ways:

Dimension	Individual Bug	Stacked Bugs
Diagnosis	Symptom points to cause	Symptom points to the top-layer bug; underlying bugs are invisible until the top layer is fixed
Fix verification	Fix resolves the symptom	Fix changes system state, potentially revealing new symptoms or hiding existing ones
Rollback safety	Revert the commit	Reverting fix #3 re-exposes bug #3, but the system state from fix #1 and #2 has already changed -- rollback may produce a state that never existed before
Backfill risk	Low -- data before the fix is consistent	High -- data was written by different combinations of active bugs, so each record may have a different corruption pattern
Time to resolution	Proportional to complexity	Exponential -- each layer requires re-observing the entire system

A 2014 USENIX study of 198 production failures found that 92% of catastrophic failures were caused by incorrect handling of non-fatal errors. In stacked-bug incidents, the "non-fatal error" is often a previous fix that changed system state in unexpected ways. The fix itself becomes the trigger for the next failure.

What debugging approach actually works for compounding failures?

The approach that worked for this incident follows a strict discipline: read the error first, form a hypothesis, trace the data flow, fix one layer, then re-observe the entire system before touching anything else. This mirrors the error-first debugging philosophy I use across all production work.

Step 1: Read the data, not the code

The first instinct when something is wrong is to read the code. For stacked bugs, start with the data instead. Query the database directly. What does the record actually contain? In this case, the return_intelligence row showed null values for all estimated fields. That single observation eliminated dozens of hypotheses about display bugs or caching issues -- the data was not there at the storage layer.

Step 2: Trace writers, not readers

Once you know the data is wrong, find every code path that writes to that table. In this case, there were 3 writers (cron, doc upload, client API). Each one independently could null the data. Tracing readers would have been a dead end -- the readers were correctly displaying what was stored.

Step 3: Fix one layer, deploy, re-observe

Do not batch fixes. Fix the most upstream bug first, deploy it, and re-observe. In this incident, fixing the overwrite bug (Bug 1) was necessary before we could even diagnose the serverless container bug (Bug 2), because the overwrite was erasing all evidence of whether extraction had actually run.

Step 4: Backfill with sanity checks

After the code fixes are stable, backfill affected data -- but never trust the backfill blindly. Add sanity checks (like total_tax < AGI) that catch impossible values. In this incident, the sanity check caught 2 out of 45 returns that the backfill would have incorrectly populated.

The compounding lesson: Each fix changes the system. After fixing Bug 1 (data overwrite), the system behaved differently than when Bug 1 was active. Bug 2 was invisible while Bug 1 was active because Bug 1 was erasing the evidence. Stacked bugs require sequential fixing with full re-observation between each layer -- not parallel debugging.

What metrics came out of this incident?

Every incident should produce numbers, not just narratives. Here are the measured outcomes:

45 client returns had corrupted estimated tax data and required backfill
28 documents were reclassified from generic "1099" to specific subtypes (1099-B, 1099-INT, 1099-R)
16 stale extractions were marked as failed and queued for re-extraction
3 production commits shipped across 2 days, touching 19 files
1,304 lines added, 761 lines removed -- net +543 lines, most of which were defensive filters and sanity checks
2 extraction providers (Azure CU vs. Azure Document Intelligence) produced different accuracy levels, requiring provider-preference logic in the backfill
~2.2 seconds measured latency for the tax calculation API (federal + state), confirming the inline-await approach was viable
18 hours total active debugging time across 2 days

How do you prevent stacked bugs from forming in the first place?

You cannot fully prevent them. Complex systems accumulate latent bugs that only manifest in combination. But you can reduce the severity:

Single-writer architecture: If 3 code paths write to the same field, you will eventually get a race condition or silent overwrite. Assign one canonical writer per field, enforce it with column-level locks or CHECK constraints.
Await everything on serverless: On platforms like Vercel, any un-awaited async work is a latent bug. The code works locally, passes CI, and fails silently in production. Adopt a zero-fire-and-forget policy for any work that must complete.
Sanity checks in data pipelines: Basic assertions like "tax cannot exceed income" or "a return cannot have both overpayment and amount owed" catch extraction errors before they propagate. These checks cost minutes to write and save hours of debugging.
Provider-aware backfills: If your pipeline has multiple data providers, the backfill must know which provider to prefer. A simple ORDER BY created_at DESC is not sufficient when providers have different accuracy profiles.

According to Microsoft Research's empirical study on production ML bugs, 47% of data pipeline bugs are caused by implicit assumptions about data shape or provenance. The lock column pattern directly addresses provenance. The sanity checks address data shape. Together, they form what I think of as a "corruption firewall" -- they do not prevent bugs, but they prevent bugs from propagating into permanent data damage.

What does this incident teach about postmortem culture?

Most postmortem templates ask for "root cause." For stacked bugs, this question is misleading. There is no single root cause. There are 4 interacting causes, each of which was necessary but not sufficient to produce the observed failure. A more useful framework:

Do not ask "What was the root cause?" Ask "What were the layers, and which layer, if fixed first, would have prevented the cascade?"

In this incident, fixing Bug 1 (the data overwrite) first was essential because it was the most upstream failure. But if we had only fixed Bug 1 and stopped, bugs 2, 3, and 4 would have continued silently corrupting new data. The postmortem value came from peeling all four layers, not from identifying one "root" cause.

This is why I advocate for incident narratives over postmortem templates. Templates optimize for categorization. Narratives optimize for understanding. A timeline table showing "what was observed vs. what was actually wrong" communicates more than any 5-Whys exercise. The 5-Whys framework assumes a linear causal chain. Stacked bugs form a causal graph.

Frequently Asked Questions

What is a stacked bug in software engineering?

A stacked bug is a production defect that only becomes visible after another bug is fixed. Multiple stacked bugs form a compound incident where each fix changes system state, revealing or hiding other failures. They are disproportionately costly: while only 8% of incidents involve 4+ interacting bugs, they account for roughly 31% of total resolution time across engineering organizations.

How do you debug compounding production incidents?

Fix one layer at a time, starting with the most upstream bug. After each fix, deploy and re-observe the entire system before diagnosing the next layer. Do not batch fixes -- each fix changes the system state, which means the next bug may present differently than expected. Trace data writers (not readers), add sanity checks to backfills, and document what each layer revealed.

Why does fire-and-forget code fail on serverless platforms?

Serverless platforms like Vercel, AWS Lambda, and Google Cloud Functions terminate the container as soon as the HTTP response is sent. Any un-awaited async work (fetch calls, IIFEs, .then callbacks) is silently dropped. The code appears to work in local development and CI because those environments do not kill the process. The fix is to await all async work that must complete, accepting longer response times as the tradeoff.

When should you use a lock column vs. application-level guards?

Use a database-level lock column (with a CHECK constraint) when multiple code paths can write to the same field and you need to enforce provenance -- meaning you need to know and control which writer set the value. Application-level guards are sufficient when there is a single writer. The lock column pattern is especially valuable in data pipelines where background jobs, API handlers, and cron tasks all touch the same table.

How many production incidents involve multiple bugs?

According to research published in ACM Queue and USENIX OSDI proceedings, approximately 62% of production incidents involve more than one contributing failure. The majority are 2-bug incidents (interaction between a latent defect and a triggering event). Incidents with 4+ interacting bugs are rare (~8%) but consume a disproportionate share of engineering time due to the exponential complexity of diagnosing interacting failures.

Dinesh Challa is an AI Product Manager building production software with Claude Code. Follow him on LinkedIn.

Published April 11, 2026. Part of a series on AI testing, production debugging, and building reliable data pipelines in serverless environments.