Prompt Engineering Is Product Design: A PM's Framework
Prompt Engineering Is Product Design: A PM's Framework | AI Product Manager Blog
Prompt Engineering Is Product Design: A PM's Framework
By Dinesh · September 22, 2023 · 13 min read
Last updated: September 2023
Prompt engineering is not an engineering task. It is product design. The prompt is the interface between your user's intent and the model's capability, and getting that interface right determines whether your AI product feels magical or broken. At an AI-first tax platform, I managed prompt design across 25,500 AI-assisted interactions, treating each prompt as a product specification with version control, quality metrics, and A/B testing. Here is the framework that made it work.
The industry treats prompt engineering as a technical skill sitting somewhere between software engineering and data science. That framing is wrong and it leads to wrong outcomes. When engineers own prompts, they optimize for technical metrics like token count and response latency. When PMs own prompts, they optimize for the metric that actually matters: did the user get what they needed? A 2023 survey by Retool found that 67% of companies building with LLMs had no formal process for managing prompts. Most were ad-hoc, undocumented, and untested. That is not an engineering failure. It is a product management failure.
What Is the Prompt-as-Product Framework?
The framework treats every prompt as a product feature with the same rigor you would apply to a UI component, an API endpoint, or a pricing page. It has five pillars:
- Specification: Every prompt has a written spec defining its purpose, inputs, expected output format, success criteria, and failure modes.
- Version control: Every prompt change is tracked with a version number, a changelog, and a rollback plan.
- Quality measurement: Every prompt is measured on accuracy, completeness, tone, and user satisfaction, not just "does it work."
- Testing: Every prompt change is A/B tested against the current version before full deployment.
- Ownership: The PM owns the prompt spec. Engineering owns the integration. Neither can change the other's domain unilaterally.
This is not overhead. It is the minimum viable process for a product component that directly shapes 100% of your AI user experience. If you would not ship a UI redesign without a spec and testing, you should not ship a prompt change without them either.
How Do You Write a Prompt Specification?
A prompt spec is not a prompt. It is the document that defines what the prompt should achieve, how it should behave, and how you will know if it is working. We used a template with seven fields:
| Field | Description | Example |
|---|---|---|
| Purpose | What user need does this prompt serve? | Answer tax deduction eligibility questions |
| Inputs | What context does the prompt receive? | User filing type, income range, state, question text |
| Output Format | What should the response look like? | Direct answer + confidence tier + one supporting citation |
| Tone | How should it sound? | Professional, direct, no hedging on known facts |
| Boundaries | What should the prompt refuse to do? | No specific dollar amount advice, no legal opinions |
| Success Criteria | How do you measure quality? | >90% accuracy on known Q&A pairs, >4.5/5 user rating |
| Failure Modes | Known ways the prompt can go wrong | Hallucinated IRS rules, overly cautious refusal to answer |
We maintained 14 active prompt specs across the platform, each mapped to a distinct user-facing feature. The spec was the contract between PM and engineering. I defined the "what" and "why." Engineering defined the "how," which included the actual prompt text, model selection, temperature settings, and retrieval-augmented generation (RAG) configuration.
How Does Version Control for Prompts Work?
Prompts are code that runs on someone else's computer (the model provider's). But they still need version control, and git is not enough because git tracks text changes, not behavioral changes. A single-word change to a prompt can shift output quality dramatically.
We implemented a three-layer version control system:
- Text versioning (git): The prompt text itself lived in the repository, tracked like any other source file. Every change had a commit, a PR, and a reviewer. We used semantic versioning: major version for behavior changes, minor for output format changes, patch for wording refinements.
- Behavior versioning (eval suite): Each prompt had an evaluation suite of 30-50 test cases. A "behavior version" changed only when eval results changed. This caught cases where a text change had no behavioral impact (and vice versa, where a model update changed behavior without any prompt change).
- Performance versioning (production metrics): We tracked real-world performance metrics per prompt version: accuracy, user satisfaction, escalation rate, and response time. This let us correlate text changes with outcome changes across thousands of real interactions.
The behavior versioning layer was the breakthrough. It solved the problem of "we changed the prompt and nothing seemed different" versus "we didn't change the prompt but something is different." Both happen regularly with LLMs, and without eval-based versioning, you are flying blind.
How Do You Measure Prompt Quality?
The quality of a prompt is the quality of its output, measured against the spec's success criteria. We used a four-dimension quality framework:
| Dimension | Metric | Target | Measurement Method |
|---|---|---|---|
| Accuracy | % of factually correct answers | >92% | Expert review of random sample (weekly) |
| Completeness | % of questions fully answered (no follow-up needed) | >85% | User follow-up rate (inverse) |
| Tone Adherence | % of responses matching spec tone | >95% | Automated classifier + spot checks |
| Safety | % of responses that stay within boundaries | >99.5% | Boundary violation detector (automated) |
We sampled 200 interactions per week per prompt for quality review. This was expensive in human time, roughly 6 hours per week for the review team. But it was the only way to catch quality degradation before users noticed. Automated metrics caught format and tone issues. Only human review caught factual accuracy issues reliably.
One critical finding: prompt quality degraded by 8-12% when the underlying model received an update, even without any prompt change. This happened three times during our first year. Each time, our behavior versioning detected the shift within 24 hours because the eval suite ran nightly. Without that monitoring, we would have discovered the degradation through user complaints days or weeks later.
How Do You A/B Test Prompts?
A/B testing prompts is harder than A/B testing UI changes because the output is non-deterministic. The same prompt with the same input produces different output each time (unless temperature is zero, which usually degrades quality). Here is the process we used:
- Define the hypothesis: "Changing the system prompt to include explicit output formatting instructions will increase completeness from 82% to 88% without reducing accuracy."
- Set up variants: Control (current prompt) and treatment (modified prompt), each receiving 50% of traffic for the specific feature.
- Run for sufficient volume: We required a minimum of 500 interactions per variant before evaluating results. At our volume, this typically took 5-7 days.
- Measure all four dimensions: Not just the target metric. A prompt change that improves completeness but degrades accuracy is a net negative.
- Human review of edge cases: Automated metrics cannot catch everything. We manually reviewed the bottom 10% of responses from each variant to understand failure mode differences.
- Ship or kill: If the treatment improved the target metric without degrading others, ship. If results were ambiguous, extend the test. We never shipped ambiguous results.
Over 10 months, we ran 27 prompt A/B tests. Fifteen shipped (56%), eight were killed, and four required iteration before shipping. The winning changes were often surprisingly small: adding a single constraint line, reordering instructions, or changing one word in the tone directive. The biggest single improvement came from adding "If you are not confident in your answer, say so explicitly" to a system prompt. That one sentence improved user satisfaction by 0.4 points and became a standard inclusion in every prompt spec. This directly connected to our broader trust design work, which I detail in designing for 4.7/5 satisfaction.
Who Should Own Prompt Engineering in an Organization?
This is the most contentious question in AI product teams. My answer, based on shipping real product: the PM owns the spec, engineering owns the implementation, and both own the outcome.
Here is how responsibilities split in practice:
- PM owns: prompt spec (purpose, tone, boundaries, success criteria), A/B test hypotheses, quality review cadence, escalation policy when quality drops.
- Engineering owns: prompt text, model selection, RAG pipeline, temperature and parameter tuning, eval suite implementation, deployment and rollback.
- Shared ownership: quality metrics review (weekly), A/B test analysis, failure mode investigation, model update impact assessment.
This split works because it mirrors how product and engineering collaborate on every other feature. The PM does not write the code for a checkout flow, but they define what the checkout flow should do, how to measure it, and what "good" looks like. Prompts should work the same way.
The anti-pattern I see most often: an engineer writes a prompt, it "works," and nobody ever revisits it. No spec, no metrics, no tests. Six months later, a model update degrades it silently, and the team discovers the problem when a user complains on Twitter. A 2023 analysis by Weights & Biases found that 41% of production prompts had never been formally tested. That number should be zero.
What Are the Most Common Prompt Design Mistakes PMs Make?
After reviewing hundreds of prompts across our platform and advising other teams, these are the five mistakes I see most often:
- Optimizing for the demo, not the distribution: A prompt that works beautifully on 10 cherry-picked examples may fail on the long tail of real user inputs. Test against the full distribution, not a curated set.
- Conflating length with quality: Longer prompts are not better prompts. Every instruction competes for the model's attention. We found that our best-performing prompts were 40-60% shorter than our first drafts.
- Ignoring failure modes: Every prompt will fail on some inputs. The spec should define what happens when it fails, not just what happens when it succeeds. See how we built trust in AI decisions for our approach to graceful failure.
- No boundary testing: If you have not tested what your prompt does when a user asks something outside its scope, you do not know what your product does in that situation. And users will find those edges faster than you expect.
- Treating prompts as static: A prompt is not "done" when it ships. It needs the same monitoring, iteration, and deprecation lifecycle as any other product feature.
The prompt is the product. If you manage it like a configuration file, you will get configuration-file-quality results. If you manage it like a core product feature, with specs, testing, metrics, and ownership, you will get results that justify the AI investment your company is making. The framework above is not theoretical. It powered 25,500 interactions at a real company with real users and real regulatory constraints. It works. The question is not whether your prompts need this rigor. The question is how much longer you can afford to operate without it.
Frequently Asked Questions
How do you handle prompt changes when the model provider updates the underlying model?
Every model update triggers an automatic eval suite run across all active prompts. If any prompt's eval scores drop by more than 2% on any dimension, it flags for human review. We had three model updates during our first year, and two of them required prompt adjustments. The third was benign. Without the automated eval pipeline, we would not have caught the regressions for days.
Is this framework overkill for a small team or early-stage startup?
Start with the spec and one metric. You can skip version control and A/B testing until you have enough volume to measure statistically. But the spec, even a lightweight one, is non-negotiable from day one. If you cannot write down what the prompt should do, you cannot tell if it is doing it. A spec takes 30 minutes. Debugging a bad prompt in production takes days.
Can you use GPT-4 or Claude to evaluate other prompts automatically?
Yes, with caveats. We used LLM-as-judge for tone adherence and format compliance, where it performed at 90%+ agreement with human reviewers. For factual accuracy, LLM-as-judge was unreliable, with only about 72% agreement with human reviewers in our domain. Use it for what it is good at, but do not eliminate human review for accuracy-critical prompts.
How do you get engineering buy-in for PM-owned prompt specs?
Frame it as reducing their burden, not increasing it. Engineers do not want to be responsible for whether a tax answer is correct. They want to focus on the infrastructure: latency, reliability, cost. The spec gives them a clear definition of "done" and protects them from scope ambiguity. In our experience, engineers were relieved when prompt quality ownership moved to product.