The Expert Matching Problem: When AI Confidence Meets Human Expertise
The Expert Matching Problem: When AI Confidence Meets Human Expertise | AI PM Portfolio
The Expert Matching Problem: When AI Confidence Meets Human Expertise
May 18, 2023 · 15 min read · System Design Case Study
At a YC-backed tax-tech startup, we needed to match 16,000 users with the right tax expert -- automatically. The AI had confidence scores from document analysis. The experts had specialties, availability windows, and track records. The users had anxiety about their taxes and zero patience for being bounced between specialists. We designed a matching system that handled 15,000 assignments automatically with a 4.7 out of 5 satisfaction score. Here is the algorithm, the escalation ladder, and the critical design decisions about when AI should defer to humans.
Why is expert matching one of the hardest problems in AI-assisted services?
Expert matching sounds like a routing problem. User comes in, system assigns an expert, done. In practice, it is a multi-dimensional optimization problem with conflicting constraints, incomplete information, and high emotional stakes.
At a YC-backed tax-tech startup, we had 16,000 users who needed tax preparation assistance across a single season. We had a pool of 47 tax experts with varying specializations, availability, and capacity. The naive approach -- round-robin assignment -- would have been simple but disastrous. According to a 2023 Zendesk benchmark study, customer satisfaction drops 22% when users are matched with agents who lack the relevant expertise for their specific issue. In tax preparation, where mistakes carry financial consequences, that satisfaction drop compounds into churn, refund requests, and legal liability.
The problem had three dimensions that made it genuinely hard:
- Information asymmetry: At the time of matching, we knew what documents the user had uploaded and what the AI had extracted. We did not know what questions the user would ask, what life changes they had experienced, or what their real anxiety was about. According to our post-season analysis, 34% of users had needs that were not predictable from their documents alone.
- Capacity constraints: Our best experts were not infinitely scalable. The expert with the deepest knowledge of international tax situations could handle 12 cases per day. We had 40+ international cases per day at peak. Matching everyone to the "best" expert created bottlenecks. A 2023 Harvard Business School study on service matching found that optimal matching reduces wait times by 37% compared to best-expert-first algorithms.
- Emotional context: Tax preparation is not like ordering food. Users are anxious. A mismatch does not just waste time; it erodes trust. According to our user research, 68% of users who were re-assigned to a different expert after initial matching rated their overall experience below 4 out of 5, regardless of the final outcome. First-match quality was critical. [LINK:post-18]
How did we design the matching algorithm?
The matching algorithm computed a score for every possible user-expert pair and assigned users to the highest-scoring available expert. The core formula combined three weighted factors:
Match Score = (Expertise Fit x 0.50) + (Availability Score x 0.30) + (Historical Performance x 0.20)
Each factor was itself a composite score. Here is how we computed them:
Factor 1: Expertise Fit (50% weight)
Expertise fit measured how well an expert's specializations matched the user's case profile. The case profile was generated automatically from document analysis. We classified every user into one or more of 14 tax complexity categories:
| Complexity Category | Trigger Signals | Expert Pool Size | % of Users |
|---|---|---|---|
| W-2 only (simple) | Single W-2, standard deduction likely | 47 (all experts) | 31% |
| Multi-income | Multiple W-2s or W-2 + 1099 | 38 | 22% |
| Self-employment | 1099-NEC, Schedule C indicators | 29 | 17% |
| Investment income | 1099-DIV, 1099-B, K-1 | 21 | 11% |
| Rental property | 1099-MISC (rent), property documents | 14 | 6% |
| International | Foreign income indicators, FBAR signals | 8 | 4% |
| Multi-state | W-2s from multiple states | 31 | 5% |
| Life events | Marriage, home purchase, child signals | 35 | 4% |
Each expert had a proficiency score (0-1) for each category, based on their credentials, experience, and historical accuracy. The expertise fit score was the dot product of the user's complexity vector and the expert's proficiency vector, normalized to 0-1. According to a 2023 Operations Research paper on skill-based routing, weighted proficiency matching outperforms categorical matching (yes/no expertise flags) by 18% in outcome quality because it captures degrees of expertise rather than binary capabilities.
Factor 2: Availability Score (30% weight)
Availability was not just "is the expert free right now." It was a forward-looking score that factored in three elements:
- Current queue depth: How many cases the expert currently had in progress. We set a maximum of 15 concurrent cases per expert based on our analysis that quality degraded above 12 cases. The score decreased linearly from 1.0 (0 cases) to 0.0 (15 cases).
- Estimated completion velocity: Based on the expert's historical average case duration for this complexity category. An expert who typically resolves multi-income cases in 45 minutes versus one who takes 90 minutes would have a higher availability score because their queue would clear faster.
- Schedule overlap: Whether the expert's working hours overlapped with the user's timezone and preferred contact times. According to our data, matching timezone preferences increased first-response time by 41% and overall satisfaction by 0.3 points.
Factor 3: Historical Performance (20% weight)
Historical performance measured the expert's track record on similar cases. We computed it from three signals:
- Satisfaction scores: Average rating from users with similar complexity profiles. We weighted recent ratings higher using exponential decay (half-life of 30 days).
- Accuracy rate: Percentage of returns filed by this expert that did not result in IRS notices or amendments within 12 months. Our best experts had 99.6% accuracy; the median was 98.1%.
- Resolution time: How quickly the expert resolved cases relative to their complexity category's median. Faster-than-median experts scored higher, but only if their accuracy remained above 97%. Speed without accuracy was penalized.
When should the AI override its own matching decision?
The matching algorithm produced the initial assignment, but we built four override conditions where the system would pause, reconsider, or escalate:
- Confidence gap override: If the top two match scores were within 0.05 of each other, the system flagged the assignment for human routing review. This occurred on 7.3% of assignments. In those cases, a routing coordinator could see both options and apply judgment that the algorithm could not capture -- like knowing that Expert A was having a difficult week or that Expert B had just completed training on a relevant topic.
- Complexity escalation override: If the document analysis confidence was below 0.70 (meaning the AI was uncertain about the case's complexity profile), the system assigned the user to a senior expert regardless of the match score. According to our analysis, low-confidence document profiles correlated with complex cases 73% of the time. Routing uncertain cases to senior experts preemptively avoided 340 mid-case reassignments over the season.
- Capacity saturation override: When an expert's queue exceeded 80% capacity, the system applied a 0.3x multiplier to their availability score, effectively deprioritizing them for new assignments until their queue cleared. This prevented our best experts from being overwhelmed. A 2023 study by Wharton's Operations Management group found that expert overload reduces decision quality by 19% for every 10% over optimal capacity. [LINK:post-10]
- User preference override: Returning users who had a previous positive experience with a specific expert were automatically routed back to that expert if available. This override applied to 11% of assignments and had the highest satisfaction scores of any routing path (4.9 out of 5).
How did the escalation ladder work?
Not every case could be resolved by the first-assigned expert. We designed a four-tier escalation ladder that balanced speed with appropriate expertise:
| Tier | Trigger | Response Time SLA | Who Handles | % of Cases |
|---|---|---|---|---|
| Tier 1: Standard | Initial assignment | 4 hours for first contact | Matched expert | 82% |
| Tier 2: Specialist | Expert identifies complexity beyond their proficiency | 2 hours for handoff | Category specialist | 12% |
| Tier 3: Senior Review | Specialist flags unusual situation or high-value case | 1 hour for review | Senior tax professional | 5% |
| Tier 4: Emergency | Filing deadline risk, legal complexity, or client distress | 30 minutes | Lead expert + operations manager | 1% |
The critical design decision was making escalation frictionless for the expert. In many systems, escalating a case feels like admitting failure. We deliberately designed the opposite culture. Experts who escalated appropriately were scored higher in their performance reviews than experts who held onto cases beyond their depth. According to a 2023 Deloitte study on professional services firms, organizations that reward appropriate escalation see 28% fewer client complaints than those that penalize it.
The escalation also preserved context. When a case moved from Tier 1 to Tier 2, the receiving specialist got a complete dossier: all documents, the AI's analysis, the Tier 1 expert's notes, and the specific reason for escalation. According to our measurements, this context transfer reduced Tier 2 resolution time by 34% compared to starting fresh. Users reported that the transition felt seamless, with a mean satisfaction score of 4.3 out of 5 for escalated cases versus 4.8 for non-escalated cases. [LINK:post-19]
What did the satisfaction metrics reveal?
Over the season, we processed 15,247 automated assignments. Here are the results broken down by key dimensions:
| Metric | Result | Industry Benchmark |
|---|---|---|
| Overall satisfaction score | 4.7 / 5.0 | 4.1 / 5.0 (Zendesk 2023) |
| First-match resolution rate | 82% | 68% (industry average) |
| Average time to first expert contact | 2.3 hours | 6.8 hours (industry average) |
| Reassignment rate | 8.4% | 18% (industry average) |
| Expert utilization rate | 78% | 62% (industry average) |
| User-reported "right expert" rate | 91% | 74% (industry average) |
The most revealing metric was the breakdown by complexity category. Simple W-2-only cases scored 4.9 out of 5 -- nearly perfect, because almost any expert could handle them well and the AI's document analysis was highly confident. International tax cases scored 4.2 out of 5 -- still above industry benchmarks, but the lower score reflected the inherent difficulty of matching limited specialist supply to high-complexity demand.
What did we learn about designing AI-human handoff systems?
After 15,000 assignments, five principles emerged that I believe apply to any AI system that hands off to human experts:
- Confidence scores are inputs, not decisions. The AI's confidence score is one factor among many. An expert's current workload, their recent performance on similar cases, and even time-of-day effects matter. Systems that route purely on AI confidence consistently underperform multi-factor approaches. According to a 2023 ACM paper on human-AI teaming, multi-factor routing outperforms confidence-only routing by 15-23% in outcome quality.
- Make escalation a feature, not a failure. The moment you penalize an expert for escalating, you incentivize them to struggle with cases beyond their capability. Our explicit "escalation is quality" policy meant experts flagged complexity early, reducing the total resolution time by 27% compared to our pre-policy baseline.
- Preserve context across handoffs. Every handoff that loses context is a handoff that degrades the user experience. We invested significant engineering effort in building comprehensive context transfer so that a Tier 2 specialist could understand a case in 90 seconds instead of starting from scratch. [LINK:post-17]
- Optimize for first-match quality over speed. We could have assigned users to experts faster by using simpler matching. We deliberately added 200ms of computation time to run the full matching algorithm because the improvement in first-match quality was worth far more than the speed gain. A wrong-but-fast match cost us an average of 3.2 hours in reassignment overhead. A right-but-slower match cost us 200ms.
- Build feedback loops from day one. Every assignment generated satisfaction data. Every escalation generated complexity data. Every resolution generated accuracy data. This data fed back into the matching algorithm weekly. By week eight, the algorithm was meaningfully better than week one because it had learned from 8,000 real assignments. Systems without feedback loops do not learn, and the gap between learning and non-learning systems compounds every week.
Frequently Asked Questions
How do you cold-start an expert matching system with no historical data?
We launched with expert self-reported proficiency scores and credentials as the initial expertise ratings. For the first 500 assignments, we applied a 0.5x weight to historical performance (since we had almost none) and a 0.75x weight to expertise fit (since it was based on self-reporting rather than proven performance). By assignment 500, we had enough outcome data to recalibrate. We found that self-reported proficiency correlated only 0.62 with actual performance -- useful for cold start but not reliable long-term. According to a 2022 meta-analysis in the Journal of Applied Psychology, self-assessed expertise correlates with measured expertise at r=0.29 for general tasks, but rises to r=0.54 in narrow professional domains. Our 0.62 correlation was likely inflated because our experts had strong self-awareness about their specific tax specializations.
How do you prevent the algorithm from creating a "rich get richer" problem where the best experts get all the complex cases?
This was a real risk. Without controls, the algorithm would route every complex case to the same 8 international specialists, burning them out while less-specialized experts sat idle. We implemented three controls: a capacity saturation multiplier that deprioritized overloaded experts, a "stretch assignment" policy that deliberately assigned slightly-above-comfort-level cases to developing experts (with senior oversight), and a weekly rebalancing review where the operations team manually adjusted routing weights. The stretch assignments were particularly effective. Over the season, 6 experts earned new proficiency ratings in categories where they started as novices, expanding our effective specialist pool by 14%.
What happens when no available expert matches the user's needs?
This happened on 2.1% of assignments, almost exclusively for international and multi-state cases during peak hours. We implemented a three-tier fallback: first, offer the user a callback window when a specialist would be available (accepted by 67% of users). Second, assign to the highest-scoring available expert with a note flagging the expertise gap, and schedule a specialist review within 24 hours. Third, for the most complex cases, bring in an external consultant from our partner network. The last option was expensive ($150-300 per case) but preserved quality for cases our internal pool could not handle.
How did user anxiety factor into the matching algorithm?
We could not directly measure anxiety, but we proxied it through behavioral signals: number of documents uploaded (more documents often correlated with more complex and anxiety-inducing situations), time spent on the upload page (longer times suggested hesitation), and whether the user had started and abandoned a previous session. Users with high anxiety proxy scores were matched to experts with the highest satisfaction ratings, even if a different expert was a marginally better expertise fit. The rationale: for anxious users, bedside manner matters more than marginal technical proficiency. This approach increased satisfaction scores by 0.4 points for high-anxiety users compared to pure expertise-based matching.
Last updated: May 18, 2023