The Agentic AI Product Framework: Designing Systems That Act, Not Just Respond
The Agentic AI Product Framework: Designing Systems That Act, Not Just Respond | AI PM Portfolio
The Agentic AI Product Framework: Designing Systems That Act, Not Just Respond
January 15, 2025 · 17 min read · Product Framework
Agentic AI is the shift from systems that respond to prompts to systems that take real actions in the world. After building four AI systems at a YC-backed tax-tech startup -- including agents that processed 128,000 documents and matched 16,000 users with specialists autonomously -- I developed a framework for agentic product design. The core idea: every agent needs a position on the autonomy spectrum, from suggest to draft to act to decide, and getting that position wrong is the most expensive mistake in AI product management.
What is the difference between conversational AI and agentic AI?
Conversational AI answers questions. Agentic AI completes tasks. That single-sentence distinction drives a fundamentally different product architecture, user experience, and risk model.
A conversational system receives a prompt, generates a response, and waits for the next prompt. An agentic system receives a goal, decomposes it into sub-tasks, selects tools, executes actions, evaluates results, and iterates until the goal is met. According to a 2024 survey by Gartner, 67% of enterprise AI investments will shift from conversational to agentic architectures by 2027. The economic logic is straightforward: a chatbot answers a question in 3 seconds but requires a human to act on the answer. An agent completes the full task in 30 seconds with no human in the loop.
At a YC-backed tax-tech startup, I experienced this transition firsthand. We started with a conversational AI assistant -- users asked questions about their tax documents and received answers. Useful, but it only addressed 18% of user sessions. The other 82% were users who wanted the system to do something: classify a document, match them with an expert, generate a filing checklist. When we rebuilt around agentic patterns, task completion rates jumped from 18% to 73% in the first quarter. [LINK:post-36]
But agentic AI introduces a category of risk that conversational AI avoids entirely: the system takes actions that are difficult or impossible to undo. A chatbot that gives wrong advice is bad. An agent that files the wrong form is catastrophic. That asymmetry demands a framework.
What is the autonomy spectrum and why does it matter?
The autonomy spectrum is a four-level framework for deciding how much independence an AI agent should have for any given action. I developed it after watching three agentic products fail because they gave their agents too much autonomy too fast.
| Level | Mode | Agent Behavior | Human Role | Error Cost |
|---|---|---|---|---|
| 1 | Suggest | Recommends an action, waits for approval | Decides and executes | Low (human catches errors) |
| 2 | Draft | Produces a complete output, waits for review | Reviews and approves | Medium (review may miss details) |
| 3 | Act | Executes the action, notifies the human after | Monitors and can override | High (action already taken) |
| 4 | Decide | Executes without notification for routine cases | Reviews exceptions only | Very High (no immediate oversight) |
The mistake most teams make is treating autonomy level as a global setting. It is not. It is a per-action setting. A single agent might operate at Level 4 for document classification (low-stakes, easily reversible), Level 2 for tax form preparation (high-stakes, needs human review), and Level 1 for filing with the IRS (irreversible, regulated). According to a 2024 study published in the ACM Conference on Computer-Supported Cooperative Work, agentic systems that use variable autonomy levels per action type see 41% fewer critical errors than systems with uniform autonomy settings.
How do you decide which autonomy level to assign?
I use a 2x2 matrix with two axes: reversibility and consequence severity.
- Reversible + Low consequence: Level 4 (Decide). Example: categorizing an uploaded document into a folder. If wrong, re-categorize. No harm done.
- Reversible + High consequence: Level 3 (Act). Example: sending a notification to a user about a missing document. The message is sent, but you can follow up with a correction.
- Irreversible + Low consequence: Level 2 (Draft). Example: generating a summary of extracted data for internal use. No user impact, but the data persists in the system.
- Irreversible + High consequence: Level 1 (Suggest). Example: determining whether a user qualifies for a specific tax deduction. Wrong answer has financial and legal implications.
At scale, this matrix saved us repeatedly. When we onboarded our document classification agent, it operated at Level 4 for 12 of 14 document types and Level 2 for the remaining 2 (foreign tax documents and amended returns). Those two types had a 23% higher error rate in testing, so they required human review. According to our internal metrics, this variable approach caught 94% of classification errors while keeping 86% of documents fully automated.
How does tool use architecture work in agentic systems?
An agent without tools is just a chatbot with ambition. Tool use -- the ability for an AI model to call external functions, APIs, and services -- is what transforms a language model into an agent. The architecture has three layers.
Layer 1: Tool definition
Every tool an agent can use needs a structured definition: what it does, what inputs it requires, what outputs it produces, and what side effects it has. According to research published by Anthropic in 2024, models with well-structured tool definitions show 34% higher task completion rates than models with loosely defined tools. The definition is not just documentation -- it is a contract that the model uses to decide when and how to invoke the tool.
At our startup, we defined 23 tools across our four AI systems. Each definition included explicit preconditions (what must be true before the tool is called), postconditions (what will be true after), and failure modes (what happens when the tool fails). This level of rigor was painful to build but essential for reliability. [LINK:post-38]
Layer 2: Tool selection
Given a user goal, the agent needs to decide which tools to use, in what order. This is where function calling comes in. The model receives the user's goal and the set of available tools, then generates a structured function call. The critical design decision is whether tool selection is single-step (model picks one tool) or multi-step (model plans a sequence of tool calls).
Multi-step planning is more powerful but introduces a compounding error problem. If each step has a 95% success rate, a 5-step plan has a 77% success rate. A 10-step plan drops to 60%. We solved this with checkpoint validation: after every 2-3 tool calls, the agent evaluates whether the intermediate state is consistent with the goal. If not, it replans from the current state rather than continuing a failing plan.
Layer 3: Tool execution and observation
After the model generates a function call, the system executes it and returns the result to the model. The model then decides whether the goal is met, whether to call another tool, or whether to ask the user for help. This observe-act loop is the core runtime of any agentic system.
Design principle: Every tool execution should return enough context for the model to evaluate success or failure. Returning "OK" is not enough. Returning "Document classified as W-2, confidence 0.94, matched fields: employer_name, wages, federal_withholding" gives the model the information it needs to decide next steps.
What are the guardrails that keep agentic systems safe?
Guardrails are constraints that prevent an agent from taking actions outside its intended scope. Without them, agentic systems are dangerous. According to a 2024 NIST report on AI safety, 78% of agentic AI incidents in production environments were caused by agents taking actions outside their intended scope, not by the actions themselves being executed incorrectly.
I use five categories of guardrails:
- Scope boundaries: Explicit definitions of what the agent can and cannot do. Our document classification agent could classify documents and request re-uploads. It could not delete documents, modify extracted data, or contact users directly. These boundaries were enforced at the tool level -- the agent literally did not have access to tools outside its scope.
- Rate limits: Maximum actions per time window. Our expert matching agent could assign up to 50 users per hour. If it attempted more, the system queued the overflow for human review. This caught a runaway loop early in deployment where a bug caused the agent to reassign the same user repeatedly.
- Confidence thresholds: Minimum confidence scores required for autonomous action. Below the threshold, the agent escalated to a human. We set different thresholds per autonomy level: Level 4 actions required 0.85 confidence, Level 3 required 0.90, and Level 2 outputs included the confidence score for the human reviewer. [LINK:post-20]
- Rollback mechanisms: Every Level 3 and Level 4 action had an automated rollback path. If the system detected anomalous outcomes within 5 minutes of an action (e.g., user immediately re-uploading a document that was just classified), it triggered a rollback and escalated to human review.
- Audit trails: Every agent action was logged with the full decision context: what the agent observed, what options it considered, why it chose the action it did, and what the outcome was. This was not just for compliance -- it was the primary debugging tool when things went wrong.
How do you measure success for agentic AI products?
Traditional product metrics do not capture what matters in agentic systems. Task completion rate is necessary but insufficient. You also need to measure autonomy rate (percentage of tasks completed without human intervention), escalation quality (how often human escalations were actually necessary), and recovery rate (how often the system self-corrected after an error).
| Metric | What It Measures | Our Target | Our Actual |
|---|---|---|---|
| Task Completion Rate | % of goals fully achieved | 85% | 89% |
| Autonomy Rate | % completed without human help | 70% | 74% |
| Escalation Precision | % of escalations that needed humans | 80% | 83% |
| Mean Time to Task | Average seconds from goal to completion | 45s | 38s |
| Recovery Rate | % of errors self-corrected by agent | 60% | 67% |
| User Trust Score | Post-task survey (1-5 scale) | 4.2 | 4.5 |
The metric that surprised me most was Escalation Precision. Early on, our agents escalated too aggressively -- only 52% of escalations actually required a human. That created a hidden cost: human reviewers became desensitized to escalations because most were unnecessary, and they started rubber-stamping reviews. We tuned confidence thresholds upward gradually, improving precision from 52% to 83% over four months.
What are the most common mistakes in agentic AI product design?
After building agentic systems and studying dozens of others, I see the same five mistakes repeatedly.
Mistake 1: Starting at Level 4 autonomy. Teams see the potential of autonomous agents and skip the progressive trust-building that safe deployment requires. Start at Level 1 for everything. Promote to Level 2 after 100 successful executions. Level 3 after 1,000. Level 4 after 10,000 with less than 0.1% error rate. This sounds slow. It is. That is the point.
Mistake 2: No rollback path. According to a 2024 post-mortem analysis of 150 AI product incidents published by the Partnership on AI, 43% of incidents were made worse by the inability to undo the agent's action. Before building the forward path, build the backward path.
Mistake 3: Treating tool use as an API call. A function call from an AI model is not the same as an API call from application code. The model might hallucinate parameter values, call tools in the wrong order, or retry failed calls with slightly different (and wrong) parameters. Every tool needs input validation that is independent of the model's output.
Mistake 4: Ignoring the observation loop. Many teams implement act but not observe. The agent calls a tool and moves on without evaluating the result. This is equivalent to writing code without checking return values. Our agent architecture required explicit observation steps after every tool call -- the model had to generate a natural-language assessment of whether the tool call achieved its intended effect.
Mistake 5: Optimizing for speed over trust. Users do not want the fastest agent. They want the most trustworthy agent. In our user research, 71% of users preferred an agent that took 60 seconds and explained its actions over one that took 10 seconds silently. Transparency is not a feature -- it is the product. [LINK:post-39]
Frequently Asked Questions
What tools and frameworks are available for building agentic AI products today?
The ecosystem is evolving rapidly. For tool use and function calling, major model providers (including the teams behind GPT-4 and Claude) offer native function calling APIs. For orchestration, frameworks like LangChain, LlamaIndex, and CrewAI provide agent scaffolding. For tool interoperability, the Model Context Protocol (MCP) is emerging as a standard that allows agents to connect to external tools through a unified interface. [LINK:post-40] The choice of framework matters less than the design principles: clear tool definitions, variable autonomy levels, and robust guardrails.
How do you handle agentic AI in regulated industries?
Regulated industries require two additional layers: an explainability layer that can produce human-readable justifications for every agent action, and a compliance boundary that hard-blocks certain actions regardless of the model's output. In tax preparation, for example, our agent could never modify a submitted form -- that action was simply not available at the tool level. According to a 2024 Deloitte survey, 89% of regulated enterprises require human-in-the-loop for any AI action that affects a customer's financial or legal status.
What is the difference between function calling and tool use?
Function calling is the mechanism -- the model generates a structured JSON object that specifies which function to call and with what arguments. Tool use is the broader concept -- the agent's ability to interact with external systems, databases, APIs, and services. Function calling is how tool use is implemented. Think of function calling as the wire protocol and tool use as the application layer.
How many tools should an agent have access to?
Fewer than you think. According to internal testing at our startup, agent task completion peaked at 8-12 tools and degraded beyond 20. Each additional tool increases the decision space and the probability of selecting the wrong tool. Our best-performing agent had exactly 7 tools. If you need more functionality, consider a hierarchical architecture where a planning agent delegates to specialized sub-agents, each with a focused tool set.
Is agentic AI replacing traditional software?
Not replacing -- augmenting. Agentic AI is best for tasks that require judgment, adaptation, and multi-step reasoning. Traditional software is better for tasks that are deterministic, high-volume, and need guaranteed correctness. The optimal architecture uses agents for the decision layer and traditional software for the execution layer. An agent decides which tax form to prepare; traditional software renders the PDF.
Published January 15, 2025. Based on the author's experience building agentic AI systems at a YC-backed startup serving 16,000 users.