25,000 AI Interactions Later: What Users Actually Want From AI

25,000 AI Interactions Later: What Users Actually Want From AI | AI Product Manager Blog

25,000 AI Interactions Later: What Users Actually Want From AI

By Dinesh · November 18, 2023 · 13 min read

Last updated: November 2023

After analyzing 25,500 AI-assisted interactions at an AI-first tax platform, we found that users want five specific things from AI products, and that three of our initial assumptions about user needs were wrong. The biggest insight was simple and widely ignored: users want the AI to answer their actual question, not demonstrate how smart it is. Here is the full breakdown of what the data showed, the expectation gap we discovered, and how it reshaped our product.

Most AI product teams build based on what they think users want, which is usually some combination of "faster," "smarter," and "more personalized." Those are not wrong, but they are abstract to the point of uselessness. A 2023 McKinsey survey found that 72% of AI products miss their adoption targets, and the most cited reason was "the product didn't solve the problem users actually had." We almost made the same mistake. The data saved us.

How Did We Analyze 25,000 AI Interactions?

We did not just look at satisfaction scores. We coded every interaction across five dimensions using a combination of automated classification and human review of a 3,000-interaction stratified sample:

  1. Intent classification: What was the user actually trying to accomplish? (Not what they typed, but what they needed.)
  2. Resolution status: Did the interaction resolve the user's need? Fully, partially, or not at all?
  3. Effort level: How many messages did it take to reach resolution? One message is ideal; more than three signals friction.
  4. Trust behavior: Did the user accept the AI's answer, verify it elsewhere, or reject it?
  5. Follow-up pattern: Did the user come back with a related question within 24 hours? (Indicates incomplete resolution.)

This analysis took three weeks and involved two product team members reviewing coded samples daily. It was the most labor-intensive research effort of the year and the most valuable. The findings directly informed four product changes that collectively moved satisfaction from 4.3 to 4.7 out of 5.

What Are the 5 Things Users Actually Want From AI?

1. Direct answers to their specific question

This was the number one finding by a wide margin. 78% of interactions were a user asking a specific question and wanting a specific answer. Not a tutorial. Not a list of caveats. Not a comprehensive overview of the topic. Just the answer.

Example: A user asks "Can I deduct my home office if I'm W-2?" The right answer is one sentence: "Generally no, the home office deduction is available to self-employed taxpayers, not W-2 employees, under current tax law." The wrong answer is a 500-word explanation of home office deduction history, exceptions, and related deductions. Users rated the short, direct answer 0.8 points higher in satisfaction than the comprehensive one.

We called this the "just answer my question" insight, and it drove a major prompt redesign. See prompt engineering as product design for how we implemented it.

2. Honesty about what the AI does not know

The second most valued behavior was the AI admitting uncertainty. Users rated interactions where the AI said "I'm not confident about this" an average of 4.4/5, compared to 3.1/5 for interactions where the AI gave a wrong answer confidently. Users can forgive ignorance. They cannot forgive false confidence.

This finding confirmed the trust design work we had been doing, which is fully described in designing for 4.7/5 satisfaction. Honesty was not a nice-to-have. It was the second most important driver of user satisfaction.

3. A clear path when the AI cannot help

When the AI could not answer a question, users did not want to be abandoned. 67% of users who hit a dead end (no answer, no alternative, no escalation) gave the interaction a 1 or 2 out of 5. But when the same "I can't answer this" was paired with a handoff to a human expert, the rating jumped to 4.1/5. The resolution path mattered more than the resolution itself.

4. Consistency across interactions

Users who asked the same question twice and got different answers rated both interactions poorly, even if one answer was better. Inconsistency reduced trust scores by 1.3 points on average. Users wanted to know that the AI's personality, tone, and knowledge were stable. They wanted a reliable advisor, not a slot machine.

This was one of the hardest problems to solve because LLMs are inherently non-deterministic. We addressed it through prompt design (explicit tone and format constraints), temperature management (lower temperature for factual answers, slightly higher for exploratory guidance), and response validation (a lightweight classifier that flagged responses inconsistent with previous answers to the same user).

5. Respect for their time

Users strongly preferred shorter responses. The ideal response length was 2-4 sentences for factual questions and 1-2 paragraphs for guidance questions. Responses longer than 3 paragraphs had a satisfaction penalty of 0.5 points regardless of accuracy. Users read the first paragraph. They skimmed the second. They ignored the third.

What Users Want % of Interactions Where This Mattered Satisfaction Impact When Present
Direct specific answer 78% +0.8 points
Honest uncertainty 22% (when AI was uncertain) +1.3 points vs. false confidence
Clear escalation path 11% (when AI couldn't resolve) +2.1 points vs. dead end
Consistency 100% (latent expectation) -1.3 points when violated
Respect for time (brevity) 100% (latent preference) -0.5 points when responses too long

What Were the 3 Things We Thought Users Wanted but Didn't?

1. Personalization at the greeting level

We spent two weeks building a personalized greeting system. "Hi Sarah, based on your W-2 from last year, here's what I can help with today." Users were indifferent. Personalized greetings had zero measurable impact on satisfaction, engagement, or retention. Users did not come to the AI to be greeted. They came to get answers. The greeting was friction, not delight.

We removed the personalized greetings and replaced them with a simple "How can I help?" The saved screen space and reduced cognitive load were more valuable than the personality. Lesson: personalization that does not serve the task is decoration.

2. Proactive suggestions

We built a feature where the AI would proactively suggest topics: "Did you know you might qualify for the earned income tax credit?" Users actively disliked this. Proactive suggestions had a -0.3 impact on satisfaction when they appeared mid-conversation. Users interpreted them as the AI changing the subject rather than answering their question.

The exception: proactive suggestions at the end of a resolved interaction were slightly positive (+0.1). Once the user's question was answered, a gentle "You might also want to know about..." was acceptable. But only after the primary need was fully resolved.

3. Conversational tone

Our early prompts gave the AI a warm, conversational personality. "Great question! Let me walk you through this." Users in a tax context did not want warmth. They wanted competence. When we switched to a direct, professional tone, satisfaction increased by 0.4 points. The AI that sounded like a knowledgeable professional outperformed the AI that sounded like a friendly assistant.

This is domain-dependent. In a consumer entertainment or creativity product, warmth probably helps. In a professional services context where users are anxious about money and compliance, warmth reads as frivolous. Know your domain.

What We Assumed Users Wanted Actual Impact on Satisfaction What We Did Instead
Personalized greetings +0.0 (no impact) Simple "How can I help?" prompt
Proactive suggestions mid-conversation -0.3 (negative) Suggestions only after resolution
Warm conversational tone -0.4 vs. professional tone Direct, competent, professional tone

What Is the Expectation Gap and How Do You Close It?

The expectation gap is the difference between what users expect from an AI product before using it and what they actually experience. Across our user base, we measured this gap through pre-use surveys (what do you expect?) and post-use surveys (what did you experience?).

The three largest expectation gaps:

  1. Speed expectation: Users expected near-instant responses (under 2 seconds). Our average was 3.8 seconds. This gap closed after the first 3-4 interactions as users calibrated. Users who used the product 5+ times rated speed satisfaction 0.6 points higher than first-time users, despite the same response time.
  2. Perfection expectation: 44% of first-time users expected the AI to be right 100% of the time. After 10 interactions, that expectation dropped to "right most of the time, honest when it's not." The calibrated expectation was actually better for satisfaction because it aligned with reality.
  3. Scope expectation: Users expected the AI to handle any tax question, including ones that require a professional license to answer. Setting boundaries clearly ("I can help with general tax information, but for specific filing advice, I'll connect you with a licensed professional") reduced this gap and increased trust. Unbounded AI felt less trustworthy than bounded AI.

The practical implication: your onboarding should calibrate expectations, not inflate them. If your marketing says "AI-powered tax expert" and the product says "I can't answer that," you have created a satisfaction problem before the user types their first question.

How Did These Findings Change the Product?

We made four changes based on the 25K interaction analysis, and each one moved the needle:

  1. Response length limits: We added a hard constraint to prompts capping factual answers at 3 sentences and guidance answers at 2 paragraphs. Satisfaction increased 0.3 points.
  2. Tone redesign: Professional, direct, zero filler. Satisfaction increased 0.4 points.
  3. Escalation redesign: Every "I can't answer this" included a specific next step (human expert, document upload, or external resource). Dead-end rate dropped from 11% to 2.3%.
  4. Consistency monitoring: A lightweight classifier flagged responses that contradicted previous answers. Inconsistency rate dropped from 7% to 1.8%.

Combined, these changes moved overall satisfaction from 4.3 to 4.7. None involved changing the underlying model. None required more computing power, more data, or more engineering headcount. They required understanding what users actually wanted, which required looking at the data instead of assuming.

The broader lesson for anyone building AI products: your users are telling you what they want in every interaction. The question is whether you have the instrumentation to hear them and the discipline to act on what they are saying rather than what you wish they were saying. For how we built the trust layer that supported these findings, see designing for 4.7/5 satisfaction. For how the transparency patterns worked in our assignment system, see building trust in AI decisions. And for the YC context that shaped our iteration speed, see the YC playbook in regulated industries.

Frequently Asked Questions

How do you collect this interaction data without violating user privacy?

All interaction analysis was conducted on anonymized, aggregated data. Individual interactions were reviewed only by authorized team members under our data handling policy. We classified interaction types and satisfaction patterns at a cohort level, not an individual level. For the human-reviewed sample of 3,000 interactions, all personally identifiable information was stripped before review.

Are these findings specific to tax products or do they generalize?

The "just answer my question" finding, the preference for honesty about uncertainty, and the preference for brevity generalize broadly. I have seen similar patterns in AI products across healthcare, legal, and financial services. The finding about conversational tone is domain-specific: in professional contexts, directness wins; in consumer contexts, warmth may still add value. Always validate with your own users.

How many interactions do you need before you can draw reliable conclusions?

For broad patterns (like the preference for direct answers), we saw stable signal after about 5,000 interactions. For more nuanced findings (like the satisfaction impact of specific tone changes), we needed 10,000-15,000 interactions to reach statistical significance. The 25,500 interactions gave us high confidence across all findings. If you have fewer than 5,000, focus on qualitative research through direct user interviews to supplement the quantitative data.

What tools did you use for interaction analysis?

We built a lightweight internal tool that combined automated intent classification (using embeddings and a simple classifier) with a human review interface for the sampled subset. The automated classification handled volume; the human review handled nuance. Total engineering investment was about two weeks for the tooling, plus three weeks for the analysis itself. For most teams, a spreadsheet and a tagging system would work for the first pass.