How I Built an Autonomous QA System That Tests My App While I Sleep
An autonomous QA system is a set of AI-powered tests that run on a schedule — nightly or on every pull request — without human intervention. It tests your app's forms, APIs, security headers, accessibility, and performance, then emails you a pass/fail report by morning. This guide shows how to build one for under $15/month using Claude Code, Playwright, and GitHub Actions.
Last updated: March 31, 2026
Last week, a user reported they couldn't save their profile. The bug? A form component accepted 9 digits for a field, but the API validated it as exactly 4. Both sides worked perfectly in isolation. The gap between them was invisible — until a real user hit it.
That bug made me ask: how do modern teams find these before users do? According to a 2025 Forrester Wave report, 72.8% of engineering leaders now list autonomous testing as a top priority. So I built one. Here's exactly how.
What Is a Form-API Contract Mismatch?
A form-API contract mismatch occurs when a frontend form sends data in a format that the backend API rejects — even though both sides pass their own tests independently. In our case, the form collected 9 digits and sent them in a field the API validated as exactly 4 digits. The API returned a 400 error. The user saw "Failed to save." Neither unit test caught it.
According to Meta's engineering blog, these integration-layer bugs account for a significant percentage of production incidents that slip past traditional test suites.
How Does the 5-Layer Architecture Work?
The system has five layers, each catching a different class of bug:
| Layer | What It Catches | When It Runs | Cost |
|---|---|---|---|
| Shared Schema Primitives | Form-API format mismatches | Build time | $0 |
| Contract Auditor | Enum gaps, missing fields, validation drift | Every PR | $0 |
| Nightly E2E on Staging | Broken forms, slow APIs, UI regressions | 2 AM UTC nightly | $0 |
| Prod Smoke Tests | Pages down, env var issues, expired keys | 2 AM UTC nightly | $0 |
| Fuzz + Chaos Tests | Edge-case inputs, service outages | Every PR | $0 |
Layer 1: How Do Shared Schema Primitives Prevent Bugs?
The root cause of our bug was that the form and API defined validation rules independently. The fix: create a single source-of-truth file that both import from. A shared helper function routes input to the correct API field based on its length. The mismatch becomes structurally impossible.
This pattern eliminated an entire class of bugs with about 30 lines of code.
Layer 2: What Does the Contract Auditor Check?
A Vitest guard test (25 assertions) that statically reads form component source code and compares it against API Zod schemas. It runs in under 1 second and checks:
- Does every form that handles formatted fields use the shared routing helper?
- Do hardcoded enum values in forms match the schema's enum constraints?
- Does the form's mutation payload include all required schema fields?
- Are there raw type casts that bypass validation?
This runs in npm test — it gates every PR automatically with zero infrastructure.
Layer 3: What Do the Nightly Staging Tests Cover?
Every night at 2 AM UTC, a GitHub Actions workflow seeds a test user, logs into staging, and runs 11 E2E specs covering every form. Each test:
- Fills the form using accessibility-first selectors (
getByLabel()) - Clicks Save and intercepts the API response
- Asserts HTTP 200 (not 400 validation error)
- Asserts API response time under 3 seconds (catches backend regressions)
The key architectural decision: 5 grouped sequential runners instead of 15 parallel ones. This keeps GitHub Actions at ~900 min/month (within the 2,000-min free tier) and prevents race conditions on shared test data.
Layer 4: Why Test Production If You Already Test Staging?
Staging and production can diverge. Different environment variables, different API keys, different migration state. Prod smoke tests are read-only:
- 12 public pages load with HTTP 200
- 15 authenticated pages load without errors
- API health endpoints respond (database, cache, payments, email)
- 6 security headers present on every page
- CORS rejects wildcard origins
- Webhook endpoints reject invalid signatures (400, not 500)
Layer 5: What Are Fuzz and Chaos Tests?
Fuzz testing uses fast-check to generate thousands of random payloads for every API schema. 50 fuzz tests run in under a second. If the schema says 4 digits, fast-check tries 0, 3, 5 digits, emoji, Unicode, null, and undefined.
Chaos testing mocks each external service at the module boundary and verifies the app degrades gracefully — JSON errors, not crashes. 12 chaos tests verify a database outage doesn't crash your checkout flow.
How Does the AI Layer Work?
Auto Test Generation on PRs
Using anthropic/claude-code-action in GitHub Actions, every PR gets test suggestions as comments. Cost: ~$0.05-0.15 per PR. At 50 PRs/month, $3-8 total.
What Are QA Agent Slash Commands?
Inspired by OpenObserve's "Council of Sub Agents" pattern (380 to 700+ tests, 85% flaky reduction), I created 4 bounded AI agents:
| Command | Role | What It Does |
|---|---|---|
/qa-analyze | Analyst | Maps workflows, extracts selectors, identifies edge cases |
/qa-generate | Engineer | Writes Playwright specs from analyst output |
/qa-sentinel | Sentinel | Audits tests for anti-patterns |
/qa-heal | Healer | Auto-fixes failing tests, up to 5 iterations |
Key insight: bounded agents with clear roles work far better than one super agent.
What Does the Daily Email Look Like?
- Green days: Subject: "All passing" — one line, move on
- Red days: Subject includes failure names with clickable links to the run
Passing suites collapse to one line. Trend arrows show if errors are increasing.
What Does It Cost?
| Item | Monthly Cost |
|---|---|
| GitHub Actions (~900 min/month) | $0 (free tier) |
| Claude API (PR test gen + review) | $3-13 |
| All testing libraries | $0 (open source) |
| Daily email | $0 (free tier) |
| Total | $3-13/month |
Results After Week 1
- 112 guard tests on every PR (25 contract + 50 fuzz + 12 chaos + 25 existing)
- 20 E2E specs nightly across staging and production
- 6 security tests for headers, CORS, and webhook signatures
- Accessibility auditing on 6 pages (WCAG 2.1, ratchet pattern)
- Performance budgets via Lighthouse CI (LCP < 2.5s, CLS < 0.1)
What Would I Do Differently?
Start with the contract auditor. It took 30 minutes to build and catches the highest-impact bugs. E2E tests are 10x more effort.
Group runners from day one. 15 parallel runners blew through GitHub Actions free tier instantly. Five grouped runners is the sweet spot.
Use the ratchet pattern. Don't fix all accessibility violations at once. Set a ceiling, prevent regression, tighten over time.
Frequently Asked Questions
How long does it take to build?
The minimum viable version (shared schemas + contract auditor + nightly smoke) takes 2-3 hours. The full system takes 2-3 focused sessions.
Does this work for any tech stack?
The principles are universal. The implementation uses Zod, Playwright, and Vitest, but the contract auditor pattern works with any schema library.
Is the AI test generation accurate?
About 70-80% of Claude's test suggestions are directly usable. At $0.05-0.15 per PR, the ROI is strong even at 50% accuracy.
How do you handle flaky tests?
The /qa-heal command auto-diagnoses and fixes failing tests. The /qa-sentinel audits for anti-patterns. OpenObserve reduced flaky tests by 85% with this approach.
How do you handle test data?
A seed script creates a test user with known state before each nightly run. Staging writes real data (fake credentials). Production tests are strictly read-only.
The whole thing costs less than a coffee per month. Start with: shared schemas + contract auditor + nightly smoke tests.