15,000 Automated Assignments: Building Trust in AI Decision-Making

15,000 Automated Assignments: Building Trust in AI Decision-Making | AI Product Manager Blog

15,000 Automated Assignments: Building Trust in AI Decision-Making

By Dinesh · August 15, 2023 · 12 min read

Last updated: August 2023

Users trust AI decisions more when they see the reasoning behind them, not just the outcome. At an AI-first tax platform, we automated 15,000 expert-to-client assignments and achieved higher user confidence in the AI's choices than in human-made assignments, by designing transparency into every decision. The key was not explainability for its own sake but showing exactly the right amount of reasoning at the right moment.

Expert matching is a high-stakes problem. Match a client with the wrong professional and you lose both the client and the professional's time. In a tax context, a mismatch can mean missed deductions, delayed filings, or compliance risk. The manual process took 12 minutes per assignment and the operations team was the bottleneck for every new client. We needed automation, but we needed it to be trusted.

Why Is Transparency Critical for AI Decision-Making?

Research from the Stanford HAI lab in 2023 found that 78% of users reject AI recommendations when they cannot see the reasoning, even when those recommendations are more accurate than human alternatives. Transparency is not about satisfying curiosity. It is a prerequisite for adoption.

We saw this play out in our own data. In the first version of our automated matching system, we showed users only the result: "You've been matched with Expert A." The acceptance rate was 61%. Nearly 4 in 10 users either requested a different expert or contacted support to ask why they had been assigned to this particular person.

When we added a single line of explanation ("Matched based on your filing type and state residency"), acceptance jumped to 79%. When we added the full transparency pattern described below, it reached 94%. Same algorithm. Same accuracy. The only variable was how much reasoning we exposed.

How Does the Explainability Spectrum Work?

Not all decisions need the same depth of explanation. Explaining too little erodes trust. Explaining too much overwhelms users and paradoxically reduces trust because complexity feels like obfuscation. We developed an explainability spectrum with four levels, deployed based on the stakes of the decision.

Level What the User Sees When to Use It Example
Result Only The decision with no explanation Trivial, reversible choices Document sort order
Reason Summary One sentence explaining the primary factor Low-stakes, semi-reversible "Sorted by deadline urgency"
Factor Breakdown Top 3 factors with relative weights shown Medium-stakes, important choices Expert matching with factor cards
Full Audit Trail Complete decision log with all inputs, weights, and alternatives considered High-stakes, irreversible, or compliance-required Final filing review assignment

The expert assignment system used Level 3 (Factor Breakdown) as its default, because matching a client with a professional is a medium-to-high-stakes decision that affects the entire service experience. The full audit trail was available to administrators and to any user who requested it.

What Were the Matching Factors and How Did We Show Them?

The matching algorithm scored candidates across six dimensions. Each score contributed to a weighted composite. But showing users a table of six scores with decimal weights is not transparency, it is a spreadsheet. We had to translate algorithmic output into human understanding.

The six matching dimensions were:

  1. Specialization match: Does the expert's specialization cover the client's filing type? (e.g., self-employment, international, investment income)
  2. State expertise: Is the expert licensed and experienced in the client's state of residence?
  3. Complexity alignment: Does the expert's typical case complexity match the client's estimated complexity?
  4. Availability: Does the expert have capacity within the client's timeline?
  5. Language preference: Can the expert communicate in the client's preferred language?
  6. Historical performance: What is the expert's satisfaction rating from similar clients?

For the user-facing explanation, we collapsed these into a "match card" that showed the top three reasons in plain language. For example: "This expert was selected because they specialize in self-employment taxes, are licensed in California, and have a 4.8/5 rating from clients with similar filings." No weights. No scores. Just the reasons that mattered most.

This design choice was informed by a finding from our user research: 89% of users said they wanted to know "why this expert" but only 7% wanted to know "what was the exact scoring methodology." We designed for the 89%.

How Did We Build Trust Metrics Into the System?

Trust is not a binary. You cannot ask users "Do you trust this?" and get a useful answer. We needed to measure trust through behavior, not just surveys. We tracked five trust signals:

  1. Acceptance rate: Did the user accept the AI's assignment without requesting a change?
  2. Time-to-accept: How long between seeing the assignment and proceeding? Shorter times indicate higher trust.
  3. Explanation depth: Did the user click to see more reasoning? Moderate clicking is healthy; excessive clicking suggests anxiety.
  4. Override rate: How often did users manually request a different expert?
  5. Post-assignment satisfaction: After working with the assigned expert, did the user rate the match positively?
Trust Metric Before Transparency After Transparency Change
Acceptance rate 61% 94% +33 pts
Median time-to-accept 4.2 minutes 47 seconds -81%
Override rate 28% 4.5% -84%
Post-assignment satisfaction 3.9/5 4.6/5 +0.7 pts

The time-to-accept metric was the most revealing. A drop from 4.2 minutes to 47 seconds meant users were no longer hesitating, second-guessing, or searching for ways to override the system. They read the match card, understood the reasoning, and proceeded. That behavioral change told us more than any survey could.

What Happens When Users Disagree With the AI?

Transparency without agency is just a lecture. Users needed the ability to override the AI, and we needed to design that override process so it improved the system rather than undermining it.

We built a three-step override flow:

  1. Acknowledge the concern: When a user clicked "Request different expert," the system asked a single question: "What's the most important thing to change?" with options like "Different specialization," "Different availability," or "Prefer someone with more experience."
  2. Re-rank and explain: The system generated a new recommendation using the user's feedback as an additional constraint and showed the updated match card. 68% of overrides were resolved at this step, the user accepted the second recommendation.
  3. Manual selection: If the user still was not satisfied, they could browse available experts with the AI's ranking visible but not binding. Only 1.4% of all assignments reached this step.

Every override fed back into the model as a training signal. Users who overrode for "different specialization" told us our specialization taxonomy was too coarse. We refined it from 8 categories to 23, and the override rate dropped by another 2 percentage points. The override flow was a product improvement engine, not just a safety valve.

How Do You Scale Transparent AI Decisions?

The objection I hear most often from engineering teams is that generating explanations is expensive. It adds latency, requires additional model calls, and complicates the architecture. Here is how we kept it practical:

  • Pre-compute explanations: Match cards were generated at assignment time, not on demand. The cost was marginal because the factors were already computed during scoring.
  • Template explanations: We did not generate free-text explanations with an LLM. We built 45 explanation templates that filled in dynamic values (expert name, specialization, rating). This eliminated latency and hallucination risk.
  • Lazy-load depth: The full audit trail was only generated if a user or admin requested it, which happened in fewer than 3% of cases. We never pre-computed what nobody would read.
  • Cache aggressively: For the same client profile, the matching explanation was identical until inputs changed. We cached explanations with a 24-hour TTL.

The result: transparency added less than 50ms of latency to the assignment flow. Users perceived no delay. The engineering cost was a one-time investment of about two weeks of development, and the ongoing maintenance cost was lower than the support cost of unexplained assignments.

What Did Transparent AI Teach Us About Our Own Product?

The most unexpected benefit of building transparent AI decisions was that the explanations revealed product gaps. When the match card said "Matched because they specialize in international taxes," and the client override reason was "I need someone who handles FBAR, not just foreign income," we learned that our "international taxes" category was too broad.

Transparency created a feedback loop: the AI explained itself, users corrected it, and the corrections mapped directly to product improvements. Over 15,000 assignments, we identified 11 taxonomy refinements, 3 missing matching factors, and 2 UX improvements, all from analyzing explanation-override pairs. For how this transparency approach connected to our overall satisfaction framework, see how we designed for 4.7/5 satisfaction.

The broader lesson applies to any AI system making decisions on behalf of users: if you cannot explain the decision clearly, the decision model is probably too opaque to debug, too fragile to maintain, and too risky to trust. Transparency is not just a user experience feature. It is an engineering discipline. If your explanation does not make sense to a user, your algorithm probably has a flaw you have not found yet.

For more on how we validated what users actually valued in these AI interactions, see 25,000 AI interactions later: what users actually want.

Frequently Asked Questions

Does showing AI reasoning slow down the user experience?

Not if you design it with progressive disclosure. The match card (Level 3 explanation) added zero perceived latency because it was pre-computed during the scoring step. The full audit trail (Level 4) took 1-2 seconds to generate but was only loaded on demand. Most users never requested it. The key is to pre-compute the default explanation tier and lazy-load everything deeper.

How do you prevent users from gaming the override system?

We tracked override patterns at the user level. Fewer than 0.5% of users overrode more than twice per assignment cycle. For those users, we added a brief human review step where an operations team member validated the override reason. This added a small delay but prevented systematic gaming without penalizing legitimate preference changes.

Can this approach work with LLM-generated decisions where reasoning is less deterministic?

Yes, but you need to separate the explanation from the generation. Do not ask the LLM to explain itself. Instead, log the inputs, context, and prompt that produced the output, then generate a structured explanation from those artifacts. This gives you deterministic explanations even for non-deterministic outputs. We used this pattern for our prompt engineering work, which I discuss in prompt engineering as product design.

What is the ROI of building trust transparency into an AI system?

For us, the direct ROI was a 33-point increase in acceptance rate, which eliminated approximately 5,800 manual reassignment requests per year. At 12 minutes per reassignment, that recovered roughly 1,160 hours of operations time annually. The indirect ROI was higher client satisfaction, which drove retention. We estimated the full impact at roughly 15x the engineering investment within the first year.