AI Testing

Nightly QA System at 2 AM: What It Caught That CI Missed

A 15-group nightly QA system running 35 test files at 2 AM catches integration bugs CI never will. Real examples, architecture, and $0.50/night ROI breakdown.

Nightly QA System at 2 AM: What It Caught That CI Missed | AI PM Portfolio

Running a 15-Group Nightly QA System at 2 AM: What It Caught That CI Missed

April 11, 2026 · 12 min read · AI Testing

Last Updated: 2026-04-11

A nightly QA system running 35 test files across 15 groups at 2 AM via GitHub Actions catches integration failures that CI pipelines structurally cannot: stale third-party API contracts, data consistency drift, extraction model accuracy regression, and cross-feature regressions that only surface with production-shaped data. Cost is roughly $0.50 per night. In the first month, the system caught three production-blocking issues before any client saw them.

Why does CI give you a false sense of safety?

Every push to our codebase triggers a CI pipeline: TypeScript type checks, 955 unit tests, linting, and build verification. The pipeline passes in under 4 minutes. The green checkmark feels reassuring. It is also incomplete. I wrote about building those 955 tests previously -- they are essential, but they test code in isolation. They mock external services. They use synthetic data. They run in a hermetic environment that bears little resemblance to production.

The gap became obvious after three incidents in two weeks. A Stripe webhook handler silently stopped processing events because Stripe renamed a field in their payload. A database migration passed validation but violated constraints when real application data hit the new schema. An extraction model drifted in accuracy after the provider shipped an update we never consented to. CI passed on all three. Each was a production-blocking bug discovered by users, not by our test suite.

According to a 2025 report by Datadog's State of Software Delivery, 68% of production incidents originate from integration failures rather than code-level bugs. Our experience matched that number exactly. The problem is not that CI is broken. The problem is that CI is structurally unable to test the boundaries between your code and everything it depends on.

What does a 15-group nightly QA system look like?

The system is a single GitHub Actions workflow file that runs at 2 AM UTC every night. It executes 35 test files organized into 15 groups, each targeting a different failure class. Results are written to a database table with retry logic and run URLs. If any group fails, a daily summary email goes out at 8 AM via Resend.

How are the 15 test groups organized?

Each group addresses a specific category of failure that unit tests and CI pipelines cannot catch. The groups are not arbitrary -- they map to the failure modes we actually experienced in production:

Payment webhook consistency -- verifies Stripe webhook payloads still match our handler expectations against a live test-mode Stripe account
Extraction pipeline accuracy -- runs 50 known documents through the extraction pipeline and measures accuracy against ground truth, flagging any drift beyond 2%
Email delivery verification -- sends a test email via Resend and confirms delivery via the Resend API within 30 seconds
Auth flow validation -- runs the full signup, login, password reset, and session refresh flows against staging
Database constraint validation -- inserts production-shaped data through the application layer and verifies all constraints hold
API response schema validation -- calls every public API route and validates response shapes against Zod schemas using a schema introspection library
Third-party service health -- pings external dependencies (payment processor, email provider, document storage, AI providers) and validates API keys are live
Form-API contract auditing -- 25 guard tests that verify frontend form field names match API route expectations, catching mismatches introduced by independent frontend and backend changes
Cross-feature regression -- tests workflows that span multiple features (upload a document, extract data, generate a draft, send an email) as a single end-to-end path
Fuzz testing -- 50 property-based tests using fast-check that generate random but structurally valid inputs, catching edge cases that handwritten test data misses
Chaos testing -- 12 fault injection tests that simulate Supabase, Stripe, Redis, and Resend being unavailable, verifying graceful degradation
Security surface scanning -- DAST-style probes against staging endpoints checking for open redirects, missing auth, and header misconfigurations
Accessibility auditing -- automated accessibility checks against 6 critical user-facing pages
Performance regression -- Lighthouse CI on 6 URLs, alerting if any score drops more than 5 points from the previous run
Data consistency validation -- queries the database for orphaned records, referential integrity violations, and state machine inconsistencies that can accumulate over time

What does the workflow file look like?

# .github/workflows/nightly-qa.yml
name: Nightly QA
on:
  schedule:
    - cron: '0 2 * * *'  # 2 AM UTC every night
  workflow_dispatch: {}   # manual trigger for debugging

jobs:
  nightly-qa:
    runs-on: ubuntu-latest
    timeout-minutes: 30
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'
      - run: npm ci

      # Group 1: Contract audits (25 tests)
      - name: Form-API contracts
        run: npx vitest run tests/unit/form-api-contracts.test.ts

      # Group 2: Fuzz tests (50 property-based tests)
      - name: Fuzz testing
        run: npx vitest run tests/fuzz/

      # Group 3: Chaos tests (12 fault injection tests)
      - name: Chaos testing
        run: npx vitest run tests/chaos/

      # Groups 4-15: E2E + integrations (Playwright)
      - name: Integration and E2E suites
        run: npx playwright test --project=nightly
        env:
          VERCEL_AUTOMATION_BYPASS_SECRET: ${{ secrets.BYPASS_SECRET }}

      # Write results to database
      - name: Record results
        if: always()
        run: npx tsx scripts/write-test-results.ts
        env:
          SUPABASE_URL: ${{ secrets.SUPABASE_URL }}
          SUPABASE_SERVICE_KEY: ${{ secrets.SUPABASE_SERVICE_KEY }}

The write-test-results.ts script parses test output, maps each test to its group, and writes results to a test_results table with columns for suite name, pass/fail, duration, error message, and a link back to the GitHub Actions run. A separate scheduled job queries this table at 8 AM and sends a summary email if any group failed.

What did the nightly QA catch that CI missed?

In the first 30 days of running the nightly system, it caught three production-blocking issues. Each would have reached clients if discovered through normal support channels instead.

Issue 1: A Stripe webhook field rename

Stripe renamed a nested field in their invoice.payment_succeeded webhook payload from payment_method_details.card.last4 to payment_method.card.last4. The change was documented in Stripe's changelog but not flagged as a breaking change because the old field was still present -- just empty.

Our CI tests mocked the Stripe webhook payload using our own fixture files. The fixtures still had the old field name. Every test passed. In production, the webhook handler processed the event, found an empty string where it expected a card number, and silently stored "unknown" as the payment method. No error was thrown. No log was written. The nightly QA caught it because Group 1 (payment webhook consistency) sends a real test-mode Stripe event and validates every field in the response.

// The test that caught it
test('invoice.payment_succeeded contains valid card metadata', async () => {
  // Trigger a real Stripe test-mode payment
  const invoice = await stripe.invoices.create({
    customer: TEST_CUSTOMER_ID,
    auto_advance: true,
  });
  // ... payment completion logic ...

  // Validate the webhook payload we received
  const event = await waitForWebhook('invoice.payment_succeeded', 15000);
  const card = event.data.object.payment_method?.card;

  // This assertion failed: card.last4 was populated,
  // but our handler was reading from the old path
  expect(card?.last4).toMatch(/^\d{4}$/);
});

Time to fix: 20 minutes. Time it would have taken to discover through client reports: days to weeks, because the failure was silent.

Issue 2: A database constraint violated by real data

A migration added a unique constraint on (user_id, tax_year, document_type). The migration passed because the existing data in the migration test database was clean. But in production, 3 users had legitimately uploaded two W-2 forms for the same tax year (two employers). The constraint would have blocked their uploads with a cryptic 500 error.

Group 5 (database constraint validation) caught it because it inserts production-shaped data through the application layer, including edge cases like multiple documents of the same type. The test generated a user with two W-2s for the same year and hit the constraint violation before any real user did.

Fix: changed the constraint to (user_id, tax_year, document_type, employer_ein) -- still enforcing uniqueness but at the correct granularity. Measuring extraction accuracy at scale taught us that edge cases in tax data are the norm, not the exception.

Issue 3: Extraction model accuracy drift

Our document extraction provider shipped a model update on a Tuesday. By Wednesday night, Group 2 (extraction pipeline accuracy) flagged that accuracy on handwritten documents had dropped from 96.2% to 89.7% -- a 6.5 percentage point regression. The provider's release notes described the update as "improved performance on typed documents," with no mention of handwritten regression.

The nightly test runs 50 ground-truth documents through the extraction pipeline and compares results field by field. The threshold is 2% drift. A 6.5% drop triggered an immediate alert. We pinned the model version, contacted the provider, and had a fix deployed before the next business day.

Without the nightly system, we would have discovered this through gradually increasing client complaints over days or weeks. The accuracy drift was small enough per-document that no single user would have reported it as a bug. It would have manifested as a vague increase in "the numbers look wrong" support tickets.

Pattern: All three issues share a common trait -- they were invisible to unit tests because unit tests mock the external world. The nightly QA system tests the seams between your code and everything outside it: payment processors, databases with real data shapes, AI model providers. These seams are where production breaks.

How does CI compare to nightly QA, synthetic monitoring, and manual QA?

Dimension	CI Pipeline	Nightly QA	Synthetic Monitoring	Manual QA
When it runs	Every push	Scheduled (e.g., 2 AM)	Continuous (every 5-15 min)	Before releases
What it tests	Code correctness in isolation	Integration health with real dependencies	Availability and latency	User experience and edge cases
External services	Mocked	Real (test mode)	Real (production)	Real (staging)
Data shape	Synthetic fixtures	Production-shaped, generated	Minimal probes	Manual test cases
Feedback speed	2-5 minutes	Next morning	Minutes	Hours to days
Cost	~$30-100/month (CI minutes)	~$15/month ($0.50/night)	$50-500/month (Datadog, etc.)	$5,000+/month (headcount)
Catches	Type errors, logic bugs, regressions in code	API contract drift, data issues, model regression	Downtime, latency spikes	UX bugs, visual issues, workflow gaps
Best for	Fast feedback on code changes	Catching slow-burn integration rot	Alerting on acute outages	Validating subjective quality

The key insight: these four approaches are complementary, not competing. CI catches code bugs instantly. Nightly QA catches integration rot overnight. Synthetic monitoring catches outages in minutes. Manual QA catches what automation cannot perceive. Running only CI is like locking your front door but leaving every window open.

What is the staging health check pattern?

Alongside the nightly QA system, we run a simpler but equally valuable pattern: service health endpoints that validate external connections are alive. The pattern emerged after a missing API key on a staging deployment silently broke the payment flow for 48 hours.

// app/api/health/stripe/route.ts
import Stripe from 'stripe';
import { NextResponse } from 'next/server';

export async function GET(request: Request) {
  // Protected by CRON_SECRET to prevent external probing
  const authHeader = request.headers.get('authorization');
  if (authHeader !== `Bearer ${process.env.CRON_SECRET}`) {
    return NextResponse.json({ error: 'Unauthorized' }, { status: 401 });
  }

  try {
    const stripe = new Stripe(process.env.STRIPE_SECRET_KEY!);
    const balance = await stripe.balance.retrieve();

    return NextResponse.json({
      status: 'healthy',
      provider: 'stripe',
      timestamp: new Date().toISOString(),
      // Confirm we got a real response, not a cached/stale one
      available: balance.available.length > 0,
    });
  } catch (error) {
    return NextResponse.json({
      status: 'unhealthy',
      provider: 'stripe',
      error: error instanceof Error ? error.message : 'Unknown error',
    }, { status: 503 });
  }
}

A GitHub Actions workflow pings this endpoint every 6 hours and sends an email via Resend on any non-200 response. The entire workflow is 30 lines of YAML. The cost is negligible. The value is that we know within 6 hours if any external dependency becomes unreachable -- before any user encounters the failure.

We follow the same pattern for every critical integration: /api/health/stripe, /api/health/email, /api/health/storage. Each endpoint makes a real API call to the provider (not just a DNS check) and validates the response shape. According to the GitHub Actions documentation on scheduled workflows, cron triggers are reliable to within a few minutes, making them suitable for health checks at 6-hour intervals.

What does it cost and what is the ROI?

The nightly QA system runs on GitHub Actions' standard Ubuntu runners. The full 15-group suite completes in 18-22 minutes. At GitHub's billing rate of $0.008 per minute for Linux runners, the nightly cost is approximately $0.16-$0.18. Adding the 6-hourly health checks (4 runs per day, ~2 minutes each), total daily cost is roughly $0.22. Monthly cost: approximately $6.60.

In the first 30 days, the system caught 3 production-blocking issues:

Stripe field rename -- would have caused incorrect payment records for every client billed during the gap. Estimated impact: 15-40 affected transactions, potential chargeback risk.
Database constraint violation -- would have blocked 3 known users (and any future multi-employer filers) from uploading documents. Estimated support cost: 2-4 hours of debugging per affected user.
Extraction accuracy drift -- would have degraded accuracy for all handwritten document uploads. At our volume, approximately 30% of documents are handwritten. Estimated impact: 200+ documents processed at 89.7% instead of 96.2% accuracy.

Conservative estimate: preventing these three issues saved 20-40 hours of incident response, support, and remediation time. At an engineering cost of $100-150/hour, that is $2,000-$6,000 in saved incident cost against $6.60 in compute. The ROI is not close.

How should you design your own nightly QA system?

What should you test first?

Start with your external dependencies. Every mock in your CI pipeline is a lie you are telling yourself about the outside world. List every service you mock in CI tests -- payment processors, email providers, AI model APIs, storage services, authentication providers. Write one integration test per service that makes a real API call and validates the response. That is your first nightly group.

How should you organize test groups?

Group tests by failure domain, not by feature. "Stripe tests" is a better group than "billing feature tests" because when Stripe changes something, you want to know which Stripe-specific tests failed, not which features broke. Feature-level impact analysis comes after you know the root cause.

How should you handle test data?

Use production-shaped data, not production data. We maintain a set of 50 ground-truth documents with known correct extraction results. We generate synthetic users with realistic edge cases (multiple employers, amended returns, non-standard filing statuses). The data is complex enough to catch real bugs but contains no PII. Property-based testing with fast-check fills the gaps by generating thousands of structurally valid but unexpected inputs.

How should you handle failures?

Every test result gets written to a database with a link to the GitHub Actions run. The daily summary email includes: which groups failed, which specific tests failed, the error message, and a direct link to the run logs. We do not alert on every failure in real time -- that leads to alert fatigue. A daily digest at 8 AM gives the team a clean morning review ritual. If something is urgent enough to need real-time alerting, it belongs in synthetic monitoring, not nightly QA.

Design philosophy: CI catches code bugs. Nightly QA catches integration bugs. They are complementary, not competing. Do not try to make your nightly QA fast -- it runs while you sleep. Optimize for coverage and realism instead of speed.

What are the common mistakes when building nightly QA?

Testing too much in CI, too little at night. If a test requires mocking an external service, it is a candidate for nightly QA where you can test against the real service. Move it.
Alerting on every failure immediately. Nightly QA will have flaky tests, especially network-dependent ones. Batch failures into a daily digest. Investigate patterns, not individual failures.
Not pinning test environments. If your nightly tests run against staging and someone deploys to staging at 1:55 AM, your tests run against a mid-deploy state. Pin the environment or use a dedicated test environment.
Skipping the database write. If test results only exist in GitHub Actions logs, nobody will look at trends over time. Write results to a database. Query for regression patterns weekly.
Ignoring the cost ceiling. Nightly QA should cost under $1/night for most teams. If your suite exceeds 30 minutes, you are either testing too much (move some tests to weekly) or running tests sequentially that could run in parallel.

What is the long-term value of nightly QA?

Beyond catching individual bugs, the nightly QA system produces a longitudinal dataset. After 30 days, we can answer questions that no other system can: Which external dependencies are most unstable? (Answer: the extraction model provider, with 4 accuracy fluctuations in 30 days.) Which test groups fail most often? (Answer: email delivery, due to Resend rate limiting on the test account.) Which failures correlate? (Answer: Stripe webhook failures always precede database constraint failures by exactly one day, because the webhook handler creates the records that hit the constraints.)

This data shapes architectural decisions. The extraction provider's instability led us to add model version pinning. The email delivery flakiness led us to add retry logic with exponential backoff. The Stripe-database correlation led us to add a pre-insert validation step in the webhook handler.

After 90 days of nightly QA data, you will know more about your system's integration health than most teams learn in a year of production incidents. That knowledge compounds. Each bug you catch at 2 AM is a bug your users never see at 2 PM.

Frequently Asked Questions

How is nightly QA different from integration tests in CI?

CI integration tests typically mock external services and use synthetic data. Nightly QA tests against real (test-mode) external services with production-shaped data. CI tells you if your code is correct. Nightly QA tells you if your code still works correctly with everything it depends on. Both are necessary; neither is sufficient alone.

What GitHub Actions runner should nightly QA use?

Standard Ubuntu runners are sufficient for most nightly QA suites. Our 35-file, 15-group suite completes in under 22 minutes on ubuntu-latest. Larger runners are only justified if your suite exceeds 30 minutes and parallelization is not an option. At $0.008/minute for Linux runners, cost is rarely the bottleneck -- suite design is.

How do you prevent nightly QA tests from becoming flaky?

Three rules: add explicit timeouts on every network call (we use 15 seconds), retry transient failures once before marking as failed, and separate true failures from infrastructure noise in the daily digest. If a test fails more than 3 times in 7 days on infrastructure issues alone, rewrite it or move it to a weekly cadence.

Should nightly QA run against staging or production?

Run write-heavy tests (database inserts, document uploads) against staging. Run read-only smoke tests (health checks, schema validation) against production. Never run destructive tests against production. Our split is roughly 80% staging, 20% production smoke.

What is the minimum viable nightly QA system?

One GitHub Actions workflow with a cron trigger, one test file per external dependency that makes a real API call and validates the response, and an email notification on failure. You can build this in under 2 hours. Start with your payment processor and your most critical third-party API. Expand from there based on what breaks.

Dinesh Challa is an AI Product Manager building production software with Claude Code. Follow him on LinkedIn.

Published April 11, 2026. Part of a series on AI testing and quality systems in production AI applications.