955 Tests, Zero Flaky Failures: Building Reliable AI Systems

955 Tests, Zero Flaky Failures: Building Reliable AI Systems | AI PM Portfolio

955 Tests, Zero Flaky Failures: Building Reliable AI Systems

August 15, 2025 · 15 min read · Engineering Deep Dive

We run 955 tests on every push with a zero-flaky-failure rate. The secret is not discipline -- it is architecture. We test schemas instead of implementations, enforce a ratchet pattern that prevents test count from ever decreasing, and structure our testing pyramid specifically for AI systems: unit tests at the base, contract tests in the middle, fuzz and chaos tests for probabilistic behavior, and E2E tests at the top. Here is the full system, the tooling, and the hard-won lessons from scaling test reliability alongside AI-powered features.

Why do most AI teams have a flaky test problem?

Flaky tests are the silent tax on engineering velocity. A test that passes 95% of the time sounds acceptable until you do the math: with 500 tests at 95% individual reliability, the probability that all 500 pass on a clean run is 0.95^500, which is effectively zero. You will rerun CI on nearly every push. According to a 2024 study by Google's engineering productivity team, flaky tests cost the average software organization 2-5% of total engineering hours. For a 10-person team, that is one engineer working full-time on nothing but rerunning failed builds.

AI systems make the flaky test problem worse. Traditional flakiness comes from timing issues, network instability, and shared state. AI systems add a new category: non-deterministic outputs. A document extraction pipeline might return a confidence score of 0.87 on one run and 0.89 on the next. A classification model might flip its prediction near a decision boundary. If your tests assert on exact output values, every AI component becomes a source of flakiness. [LINK:post-38]

At a YC-backed tax-tech startup, we processed 128,000 documents per tax season across 4 AI extraction providers and 60 analyzers. We shipped 4-6 features per week. If our test suite had even a 1% flaky rate, we would have wasted roughly 10 hours per week on false failures. Instead, we achieved 955/955 with zero flaky tests -- not by being careful, but by designing the test architecture to make flakiness structurally impossible.

What does the AI testing pyramid look like?

The classic testing pyramid -- unit tests at the base, integration in the middle, E2E at the top -- was designed for deterministic software. AI systems need a modified pyramid that accounts for probabilistic behavior, schema evolution, and adversarial inputs.

Layer Test Type Count What It Validates Run Time
1 (Base) Unit Tests 510 Individual functions, transformers, validators ~12 seconds
2 Contract Tests 185 Schema shapes between services, API contracts ~8 seconds
3 Fuzz Tests 120 Edge cases, malformed inputs, boundary conditions ~25 seconds
4 Chaos Tests 80 Provider failures, timeout handling, fallback paths ~15 seconds
5 (Top) E2E Tests 60 Full user flows through the browser ~3 minutes
Total 955 ~4 minutes

The ratio matters. Roughly 53% of our tests are unit tests, 19% are contract tests, 13% are fuzz tests, 8% are chaos tests, and 6% are E2E tests. This is deliberately bottom-heavy. Each layer up is more expensive to run, slower to execute, and harder to debug when it fails. According to the Test Pyramid principle documented by Martin Fowler, the optimal ratio is roughly 70/20/10 for unit/integration/E2E. We modified this to 53/19/13/8/6 to accommodate the fuzz and chaos layers that AI systems require.

Why should you test schemas instead of implementations?

This is the single most important architectural decision in our testing strategy. We do not test what the AI returns. We test the shape of what it returns.

Consider a document extraction pipeline. The traditional approach tests: "Given this W-2 PDF, the gross_income field should be $85,432." That test is brittle. It breaks when the AI model updates. It breaks when the extraction provider changes its response format. It breaks when you switch providers. It breaks when the test fixture ages. Every AI model change becomes a test maintenance burden.

Our approach tests: "Given any W-2 PDF, the response must conform to the W2ExtractionSchema: gross_income is a number or null, employer_ein matches the XX-XXXXXXX pattern, and all required fields are present." This test validates the contract between the extraction service and the rest of the system. It does not care about the specific values. It cares about the shape, the types, and the constraints.

The schema testing principle: Test the contract, not the content. If the shape is correct, the downstream system can handle the data. If the shape is wrong, the downstream system will fail regardless of whether the values are correct. Schemas are deterministic even when the AI is not.

This principle eliminated 100% of our AI-related test flakiness. We use Zod schemas as the single source of truth: the same schema validates the extraction output in production and in tests. According to a 2024 analysis by ThoughtWorks in their Technology Radar, contract testing reduces integration failures by 60-80% compared to implementation-level testing. Our experience confirms that number -- we have not had a schema-related production incident since adopting this pattern 14 months ago.

How does the ratchet pattern prevent test regression?

The ratchet pattern is a CI enforcement mechanism that ensures the test count never decreases. It works like a ratchet wrench: it can only turn in one direction.

Here is how it works. Our CI pipeline records the total test count after every successful merge to the main branch. On every new PR, it compares the current test count against the recorded count. If the PR reduces the test count -- by deleting tests, skipping tests, or commenting them out -- CI fails with a clear error: "Test count dropped from 955 to 948. Add the missing tests or justify the removal."

We also apply the ratchet pattern to specific quality metrics beyond count. Our brand color ratchet ensures that no more than 50 instances of a specific gray hex code appear in the codebase (forcing teams to use design tokens instead of hardcoded colors). We ratchet TypeScript strict-mode compliance, ensuring the number of type errors never increases.

What are the ratchet metrics we track?

Metric Current Value Ratchet Direction What It Prevents
Total test count 955 Can only increase Test deletion, skip creep
Passing test count 955/955 Must equal total Known-failing tests lingering
Brand color violations 50 Can only decrease Hardcoded colors bypassing design system
TypeScript errors 0 Must stay at 0 Type safety regression
Flaky test incidents 0 (trailing 30 days) Target: 0 Accepting flakiness as normal

The psychological effect of the ratchet is as important as the technical effect. When developers know that every test they write is permanent, they write better tests. According to research on commitment mechanisms in behavioral economics (Ariely, 2008), irreversible commitments produce higher-quality decisions. The ratchet makes test quality an irreversible commitment.

How do fuzz tests catch what unit tests miss?

Fuzz testing generates random, malformed, or adversarial inputs and checks that the system handles them gracefully. For AI systems, this is not optional -- it is essential. Users submit documents with corrupted metadata, scanned images at 72 DPI, PDFs with password protection, and forms filled out in unexpected languages. If the system crashes on any of these, it is a production incident.

We use fast-check, a property-based testing library for TypeScript, to generate thousands of input variations per test run. A single fuzz test might exercise 100 randomly generated inputs in under a second. Across our 120 fuzz tests, we exercise approximately 12,000 input variations per CI run.

The fuzz tests do not assert on specific outputs. They assert on invariants: the system should never throw an unhandled exception, should never return a response that violates the output schema, and should never take longer than the timeout threshold. These invariants are deterministic even when the inputs are random, which is why fuzz tests achieve zero flakiness. [LINK:post-38]

In the first month of fuzz testing, we discovered 23 edge cases that no human tester had found: a SSN field that accepted 10-digit numbers, a date parser that crashed on February 29 in non-leap years, and an extraction pipeline that hung indefinitely on zero-byte PDFs. According to a 2023 meta-analysis published in IEEE Transactions on Software Engineering, fuzz testing discovers 2.4x more edge-case bugs per testing hour than manual exploratory testing.

What role do chaos tests play in AI reliability?

Chaos tests simulate infrastructure failures: provider timeouts, API rate limits, network partitions, and corrupted responses. For an AI system that depends on 4 external extraction providers, chaos testing is the difference between graceful degradation and cascading failures.

Our chaos tests cover three failure categories:

  • Provider unavailability: What happens when the primary extraction provider returns a 503? The system should fail over to the secondary provider within 2 seconds.
  • Partial responses: What happens when the provider returns half the expected fields? The system should accept the partial result and flag missing fields for manual review.
  • Corrupted responses: What happens when the provider returns valid JSON that does not match the expected schema? The system should reject it, log the discrepancy, and retry with a different provider.

Each chaos test mocks the failure condition at the HTTP layer, ensuring the test is deterministic. We do not simulate failures by actually breaking infrastructure -- that would introduce the exact non-determinism we are trying to eliminate. The mock is the test, and the mock is reproducible.

How does Playwright E2E testing work at scale without flakiness?

Playwright E2E tests are the most expensive layer: they launch a real browser, navigate through actual UI flows, and assert on visible outcomes. They are also the most prone to flakiness in most organizations. According to a 2024 report by Sauce Labs, 62% of E2E test suites have a flaky test rate above 5%.

We achieved 0% flakiness in our 60 E2E tests by following four rules:

  1. Never assert on timing. Use Playwright's built-in auto-waiting instead of explicit waits. Every assertion uses locator.waitFor() or expect with a timeout, never setTimeout.
  2. Isolate test state completely. Each test gets a fresh authenticated session via Playwright's storageState, a dedicated test user in the database, and a clean set of test data created in the beforeEach block and torn down in afterEach.
  3. Test user-visible outcomes, not implementation details. Assert on what the user sees ("the dashboard shows 3 documents"), not on internal state ("the Redux store contains 3 items"). This makes tests resilient to refactoring.
  4. Run E2E tests in a dedicated CI step with retries disabled. If a test fails, it fails. No automatic retries that mask flakiness. This forces us to fix the root cause immediately rather than accepting intermittent failures.

The no-retry rule is controversial but effective. Most teams enable 2-3 retries for E2E tests, which masks flakiness and creates a culture of "just rerun it." According to data from CircleCI's 2024 State of CI report, teams with automatic retries enabled take 3.2x longer to fix flaky tests than teams without retries, because the urgency is lower when the test eventually passes. [LINK:post-38]

How does GitHub Actions CI/CD tie it all together?

Our CI pipeline runs in GitHub Actions with a specific execution order designed to fail fast and minimize compute waste:

  1. TypeScript type check (15 seconds) -- catches type errors before any tests run.
  2. Unit tests via Vitest (12 seconds) -- fastest feedback on logic errors.
  3. Contract tests via Vitest (8 seconds) -- catches schema violations between services.
  4. Fuzz tests via fast-check (25 seconds) -- catches edge cases and invariant violations.
  5. Chaos tests via Vitest (15 seconds) -- catches failure-handling regressions.
  6. E2E tests via Playwright (3 minutes) -- catches full-flow regressions.
  7. Ratchet check (5 seconds) -- verifies test count and quality metrics.

Each step depends on the previous step passing. If unit tests fail, the more expensive fuzz and E2E tests never run. Total pipeline time: approximately 4 minutes and 20 seconds. According to a 2024 benchmark by BuildKite, the median CI pipeline for a production application takes 8.2 minutes. Ours is roughly half that, because the bottom-heavy pyramid catches most issues in the first 30 seconds.

The 30-second rule: 73% of our test failures are caught in the first 30 seconds of CI (type check + unit tests). By the time you get to E2E, you have already validated types, logic, contracts, edge cases, and failure handling. E2E is the final confirmation, not the primary detection layer. [LINK:post-38]

What did we learn that applies to any AI team?

Three principles transfer regardless of stack or domain:

Principle 1: Deterministic tests for non-deterministic systems. The key insight is that you can always find a deterministic property to test, even when the system's output is non-deterministic. Schema shape is deterministic. Response time bounds are deterministic. Error handling behavior is deterministic. You never need to assert on the exact value an AI returns.

Principle 2: Make test quality irreversible. The ratchet pattern works because it changes the default. Without a ratchet, the default is entropy: test counts drift down, coverage erodes, flaky tests accumulate. With a ratchet, the default is improvement: every merge raises the floor.

Principle 3: Test the contract between systems, not the system itself. AI systems change frequently. Models update. Providers change. Prompts evolve. If your tests are coupled to the implementation, every change is a maintenance burden. If your tests validate the contract -- the shape of data flowing between components -- they survive implementation changes without modification.

Frequently Asked Questions

How long did it take to go from 0 to 955 tests?

Approximately 14 months. We started with 40 unit tests and a basic Playwright E2E suite. The ratchet pattern was introduced at the 200-test mark. Growth was not linear -- the fuzz and chaos layers were added as dedicated sprints when we hit scaling pain points. The most productive period was months 8-12 when the testing patterns were established and writing new tests became mechanical.

How do you handle tests for features that use external AI APIs?

We never call external AI APIs in tests. Every external dependency is mocked at the HTTP layer with realistic response fixtures. The fixtures are generated from actual API responses (sanitized of PII) and versioned alongside the test code. When a provider changes its response format, we update the fixture once and all affected tests automatically validate against the new shape. [LINK:post-38]

Does the ratchet pattern slow down development?

The opposite. Before the ratchet, developers would skip writing tests on "quick fixes" and "small changes." Those skipped tests accumulated into coverage gaps that caused production incidents. The ratchet eliminates the decision: every PR must maintain or increase the test count. After the first month of adjustment, developers reported that the ratchet actually made development faster because they had higher confidence that their changes were safe.

What is the cost of running 955 tests on every push?

Our GitHub Actions bill for CI is approximately $180 per month, running an average of 40 pushes per day. The 4-minute pipeline costs roughly $0.11 per run. By comparison, a single production incident caused by a missed regression costs 4-8 engineering hours to diagnose and fix -- roughly $400-800 at market rates. The test suite pays for itself if it prevents one incident every two months. It prevents roughly four per month.

How do you decide when to write a fuzz test versus a unit test?

Unit tests validate known scenarios: "Given this specific input, expect this specific output." Fuzz tests validate invariants across unknown scenarios: "Given any input in this category, the system should never crash, never violate the output schema, and never exceed the timeout." If you can enumerate all the important inputs, write unit tests. If the input space is too large to enumerate -- which is almost always true for AI systems -- write fuzz tests for the invariants and unit tests for the critical paths.

Published August 15, 2025. Based on 14 months of building and scaling a testing infrastructure at a YC-backed tax-tech startup processing 128,000 documents per tax season.