Should AI Tell You When It's Guessing?

Should AI Tell You When It's Guessing?

Role: Principal InvestigatorXandY, 2026Survey N = 215Experiment N = 778
Survey MethodologyExperimental DesignBehavioral MeasuresThematic AnalysisStated vs. Revealed PreferenceLogistic RegressionAI / LLM UXUncertainty Communication
Executive Summary

When a user asks for information, should a chat-based AI tool provide a “confidence indicator” that shows how certain it is? To investigate this, I first ran a national survey measuring what people say they would want/do with such a feature. Then I conducted a randomized experiment to test how people actually respond to it.

Four key findings came out of this series of studies:

  1. Most people say they would love a confidence indicator, and nearly all say they would externally verify low-confidence information. In Study 1, participants rated a confidence indicator as highly useful, and 83% of participants placed themselves at the maximum point of the scale of likelihood (“very likely”) to externally verify information that was flagged as low-confidence. However, Study 2’s controlled behavioral experiment shows this is a dramatic overestimate (only 30% clicked to verify flagged low-confidence information).
  2. Flagging inaccurate info with a low-confidence indicator does increase verification rates and knowledge accuracy. In conditions where the LLM gave incorrect information, the confidence indicators worked well by increasing the rate of external verification behavior (clicking to double-check the claims with an official source) up by 13 percentage points, from 17.3% to 30.2%. Users who clicked to verify also scored much higher on the knowledge test.
  3. Despite its objective benefits, seeing a low-confidence flag significantly lowers users’ ratings of tool. Study 1’s qualitative data overwhelmingly showed confidence indicators are thought to be useful, and Study 2’s quantitative experiment showed they significantly improve knowledge accuracy (via verification behavior). But in the experiment, participants who saw an indicator flagging low confidence still ended up with significantly lower satisfaction, lower accuracy perceptions, and lower reported likelihood to use this AI tool again. So, flagging low-confidence answers has positive benefits for verification behaviors and belief accuracy, but also has costs in users’ subjective experience.
  4. This feature has strong potential value, but needs a strategic rollout. There may be ways to capture the benefits of this feature while mitigating the costs. If users are introduced to this feature with the right framing, they may be able to see low-confidence flags as a good thing that helps protect them rather than a frustrating sign of the tool’s limits.

Why This Question Matters

Most LLM tools respond in a confident tone regardless of whether they are pulling from solid sources or making aggressive inferences from scant evidence. Users usually can’t tell which is happening under the hood. As these tools move deeper into everyday work, that difference has increasingly high stakes.

One common proposal is to have the tool provide a confidence rating next to each response, such as a score from 1 (low confidence) to 10 (high confidence) that depends on the level of direct, clear, and verifiable evidence in support of its response.

That sounds like a simple feature to ship, but it’s actually pretty unclear how users would respond. In my dissertation research and beyond, I’ve published several peer-reviewed studies on how people interpret and respond to uncertainty. Disclosing uncertainties can do some strange things, with some types having positive responses like increasing trust and credibility while other types have negative responses like increasing skepticism and backlash (for a short summary, see How to Communicate Uncertainty Without Losing Credibility.)

But it’s pretty important to get this right. This feature could change how hundreds of millions of people read and respond to AI output. If done right, it could simultaneously prompt more judicious and responsible user behaviors, while also building stronger trust in AI overall! So let’s dive in.

A Note on Self-Report vs Actual Behavior

If we ask people whether they want a confidence indicator (and why) and whether they would use it, we can get some interesting information. But we need to keep in mind that it’s cheap and easy for survey participants to say “yes I’d want that” and “yes I’d definitely adjust my decisions based on it.” This is why it’s important to observe what users actually do.

So, I tested both. Here, Study 1 is a survey that captures stated demand for a “confidence indicator” feature, how people rank the feature against other things an assistant could add, and how they say they would act if they were given an answer that was flagged as “low-confidence.” Then, Study 2 is a controlled experiment with a behavioral component. It measures whether people actually decide to verify answers when they are flagged as “low confidence” and also how the use of a confidence indicator affects both the accuracy of beliefs and the subjective ratings of the tool.

As we go through, note the contrast between the two studies. The survey tells us what users think they want or do, while the experiment gives us a look at what actually happens.

Study 1 · SurveyWhat People Say They Would Want/Do

The first study is a short survey run on Prolific. After some data cleaning and quality screening, it covers 215 respondents—86% of whom use AI tools at least once per week. The survey first measured how a confidence indicator ranks in perceived usefulness compared to other potential features, then asks about reasons why it would or would not be useful, and also how likely they would be to externally verify information that was flagged as “low confidence.”

Most people say confidence indicators would be very useful

When asked directly how useful a confidence rating would be on a five-point scale from “not at all” to “extremely useful,” the answers skewed strongly positive. Two thirds of participants rated it at the top end (40% “very useful” and 26% “extremely useful”). Another 27% called it “moderately useful.” And only 8% in total picked “only a little” or “not at all useful.” These are overwhelmingly positive responses to a feature that they haven’t even seen or tried yet.

But it’s just too easy for a participant to rank features as maximally useful! So I decided to force them to make tough choices by ranking five possible additions to an AI tool against each other. This is often a good strategy for mitigating demand effects or acquiescence bias.

How people ranked five possible features (1 = most valued)Each panel is one feature; bars show how many of 214 people placed it at each rank.631582423264255Source citations531672393384175Confidenceindicator601352513344345Memory ofprior chats341312463424615Faster responses41232363744775Suggested follow-upquestionsrank position (1 = most valued, 5 = least)
Figure 1. Forced-choice ranking of five possible AI features, one small panel per feature, showing how many of 214 respondents placed each at rank 1 through 5. The confidence indicator is highlighted.

In the forced-rank measure, the confidence indicator landed near the top. Figure 1 shows that its rank distribution is front-loaded, with most people placing it first or second and only a handful ranking it last. It tied with source citations as the feature with the highest % of respondents (56%) ranking it in their top 2 places. This is still strong positive feedback about this proposed feature.

Most people say they would externally verify flagged responses

I also presented participants with a specific scenario in which an AI answer is labeled “2/10 - Low Confidence.” I then asked them how likely they would be to verify it against an external, official source before relying on it.

92% said they would verify a low-confidence answer0%25%50%75%Very likely to verify83%Somewhat likely9%Unsure2%Somewhat unlikely2%Very unlikely4%
Figure 2. Stated likelihood of verifying an answer labeled “2/10 - Low Confidence” before relying on it.

Figure 2 shows that the response was nearly unanimous, as 92% said they would be likely to verify (83% “very likely”; 9% “somewhat likely”). Taken at face value, the self-report data says that a low-confidence flag is useful and effective. It appears that if you just warn people about tenuous inferences, they will find it useful and 83% of them will be “very likely” to verify with an external source. But keep those optimistic numbers in mind for when we move on to Study 2.

The reasons why

Next, I gave an open-ended question to people who said this feature would be useful. This question asked them to explain in their own words why they think a confidence indicator would be useful. Using these qualitative responses, I conducted a thematic analysis and coded them into seven main themes, letting each response carry more than one theme. The treemap in Figure 3 sizes each of these categories by how many responses expressed it.

Interactive tool tip

Hover over any tile in Figure 3 to see a short definition of that theme. You can also click a tile, or a row in the list below to read some examples of the raw responses that comprise that theme.

Figure 3. Themes in the 197 “why useful” open-ended responses. Each tile is sized by how many responses expressed that theme; because a response can carry more than one, the shares sum to more than 100%.

Two things stand out from the qualitative coding. The single most common reason, which is present in nearly half (48%) of responses is triage. People want the rating as a signal for when to give further scrutiny to an answer. The second largest cluster is related but inverted. This 26% of responses mentioned that they would want a confidence indicator precisely because it would let them trust the AI more and check less. Both of these leading themes come from the same root. We’ve known for a long time in cognitive psych that our brains are efficiency-maximizing prediction machines. We don’t mind working hard if it’s needed, but also we’d love to conserve energy when possible. Applied here, a confidence indicator is a cue that tells us when further analysis is needed versus when we can relax and just trust what we’re seeing.

It’s also worth noting that a meaningful minority (11%) were openly skeptical even while rating the feature useful. They pointed out that an AI assessing its own confidence is still just the AI making fallible judgments.

Study 1 leaves us with a clear picture. People (say they) want the feature, rank it as highly useful, and almost without exception (say they) would take action if they received a low-confidence warning. The obvious next question is whether any of that survives contact with a live decision environment.

Study 2 · ExperimentWhat People Actually Do

Study 2 was a 2×2 between-subjects experiment with 970 participants recruited on Prolific (778 after data cleaning). Everyone was paid a baseline amount for their time, but to incentivize careful information gathering they were told they would earn a cash bonus for each correct answer on a knowledge quiz at the end of the study.

Participants were presented with a simulated conversation thread between them (the user) and a LLM interface analogous to ChatGPT, Claude, etc. The conversation showed the user asking a simple factual question and the LLM providing a response. To guard against any bias from any specific topic, I randomized the topics such that a participant either saw information about vitamin D and sleep, remote work and productivity, or earthquake prediction.

Example stimulus: the simulated conversation and the confidence assessment a participant saw
Figure 4. An example stimulus (inaccurate condition): the simulated conversation, and the confidence assessment, which gave an overall 5/10 and flagged the false dosage-and-timing claim at 2/10 with a recommendation to verify.

The main experimental manipulation occurred by randomizing participants into one of four conditions that were formed by crossing two factors:

Factor 1: Response Accuracy

Half of participants saw a fully accurate AI response. The other half saw a response where one of two key claims was objectively wrong in a way that is crucial to the topic but still sounds plausible.

Factor 2: Confidence Indicator

Half of participants saw a confidence indicator next to the LLM’s response, while the other half saw only the response. The confidence indicators gave an overall score and a breakdown of the confidence of subcomponents of the response. In the “accurate + indicator” condition, the response was rated “9/10” while in the “inaccurate + indicator” condition the response was rated “5/10” and was flagged for verification (see example above in Figure 4).

The NIH vitamin D fact sheet, the authoritative secondary source participants could open to verify
Figure 5. Example authoritative secondary source: participants in the Vitamin D condition could open a NIH factsheet to check the answer.

The Verification Decision

After reading the response, each participant chose what to do next. They could either go straight to take the knowledge quiz (to try earn their bonus cash), or they could first open a secondary source to check the LLM’s claims. They were told that checking the external source would likely cost them about two extra minutes of time. This was done because that tradeoff mirrors the real decision people make whenever they consider fact-checking an AI answer.

That choice—verify or not—was the primary outcome measure. It is a real behavior (not a survey response) that is made under competing motives of costs and benefits, and it answers the question product teams care about: does the indicator actually get people to check the AI’s work?

As shown in the example in Figure 5, the secondary source was written so that anyone in an inaccurate condition who checked would find the correct information from an authoritative entity. This also gave them an advantage on the quiz, so the decision to verify would have real downstream consequences for their cash payout.

After the quiz, participants rated their satisfaction, willingness to reuse the AI, perceived accuracy, and how useful a confidence feature would be in general. The full study design is visualized in Figure 6.

Experimental Design 2 × 2 Between-Subjects with Stimulus Sampling (3 Topics) Prolific Sample N = 970 recruited (778 after quality filter) Demographics & AI Usage Age, gender, income, education, AI tool use Task Instructions Told to learn from the AI response to prepare for a quiz. Cash bonus awarded for each correct answer. RANDOMIZATION Condition 1 Accurate Response No Indicator n = 214 Condition 2 Accurate Response + Confidence Indicator n = 179 Condition 3 Inaccurate Response No Indicator n = 196 Condition 4 Inaccurate Response + Confidence Indicator n = 189 STIMULUS SAMPLING: TOPIC ASSIGNMENT (WITHIN EACH CONDITION) Vitamin D & Sleep Whether vitamin D supplementation improves sleep quality Remote Work & Productivity Whether remote workers are more or less productive than in-office Earthquake Prediction Whether scientists can reliably predict earthquakes in advance Participant Reads AI Response 2 factual claims per topic (1 flipped in inaccurate conditions) PRIMARY DV Verification Decision "Proceed to quiz" vs. "Check a secondary source first" (~2 min cost) Did Not Verify Info 79% of participants Verified 21% of participants Reads secondary source to fact-check initial response SECONDARY DV Comprehension Quiz 2 questions per topic (cash bonus for correct answers) SELF-REPORT DVs Attitudinal Measures Satisfaction · Willingness to reuse · Perceived accuracy · Feature usefulness (all 1-5 Likert) Accurate conditions Inaccurate conditions Primary outcome measure
Figure 6. The full Study 2 design: a 2×2 of response accuracy and confidence indicator, with topic rotated, the verification decision as the primary outcome, then the quiz and self-report ratings.

Finding #1. A low-confidence flag drove people to verify.

Across all conditions, about 21% of participants chose to fact-check the response by clicking into the secondary source. The chart below in Figure 7 shows the verification rates in each pair of conditions (accurate responses and inaccurate responses). The left pair isolates effect of adding a 9/10 flag on a correct answer, and the right pair isolates the effect of adding a 5/10 flag on an incorrect answer.

0%10%20%30%40%16.8%no flag21.2%9/10 shown17.3%no flag30.2%5/10 shownAccurate responseInaccurate responseoverall 21.2%% who chose to verify (whiskers = 95% CI)9/10 flag: +4.4 pts (n.s.)5/10 flag: +12.8 pts (p = .003)
Figure 7. Verification rate by condition, with 95% confidence intervals. The score was calibrated: 9/10 on the accurate answer, 5/10 on the inaccurate one with subcomponents of 8/10 and 2/10.

In terms of verification rates, the low-confidence flag did exactly what it is supposed to do. It raised verification by 13 points (from 17.3% to 30.2%) an effect that is both practically meaningful and statistically significant effect (p = .003). As you might expect, the high-confidence flag did much less for verification, because users are less motivated to verify an answer that is already high confidence. The high-confidence indicator nudged verification up about 4 points, although this difference is not statistically significant (p = .27).

It’s also interesting to see that in conditions with no confidence indicators, the inaccurate info had the same verification rates as the accurate info (17.3% versus 16.8%). Basically, when left to themselves, people did not catch the error. The extra verification appeared only when the 5/10 low-confidence flag was added.

One honest limit. The formal test of whether the 2/10 flag lifts verification more than the 9/10 flag, the indicator-by-accuracy interaction, does not quite reach significance (p = .22), because the high-confidence flag also drifted upward and the study cannot cleanly separate the two simple effects. But the headline does not rest on that contrast. The 2/10 flag’s effect against its own no-flag baseline is large and significant on its own.

Key Finding

The low-confidence flag effectively boosted verification behavior. In contrast, the same wrong answer with no flag caused no more verification than a correct answer, which is evidence that the flag itself produced the behavior. Overall, the flag did precisely what it should: it pushed people to fact-check specifically when there was an accuracy concern.

Finding #2. When people did verify, it boosted their knowledge scores.

Verifying paid off in the knowledge quiz scores. People who clicked to verify with a secondary source scored higher on the comprehension quiz (M = 1.78 out of 2) than people who did not (M = 1.28, p < .001). Intuitively, this gap was even more dramatic in the inaccurate conditions. Non-verifiers who saw wrong information scored 0.61 on their knowledge test, while verifiers in those same conditions averaged 1.71 out of 2.

00.511.521.271.8All conditions0.631.74Inaccurate conditions onlyLighter bar = skipped verificationDarker bar = clicked to verifyMean quiz score (0–2)
Figure 8. Mean quiz score for people who verified versus people who didn't, with 95% confidence intervals. Checking helped overall, and helped a lot in the conditions where the AI was wrong.

Finding #3. The low-confidence flag harmed subjective ratings.

Even though the low-confidence indicator prompted people verify and thereby increased their knowledge accuracy, it created less positive attitudinal ratings of the tool overall. Figure 9 shows that users in the “inaccurate + indicator” condition (i.e., those who saw a 5/10 low confidence flag) reported significantly lower satisfaction, willingness to reuse the tool, and accuracy ratings. Basically, people rated the tool more highly when they simply saw inaccurate information without a warning flag.

123454.214.23.943.41Satisfaction3.783.73.652.97Willingness to reuse3.893.883.693.17Perceived accuracyAccurate, no indicatorAccurate + indicatorInaccurate, no indicatorInaccurate + indicatorMean rating (1–5)
Figure 9. Self-reported attitudes by condition, with 95% confidence intervals.

Finding #4. Users still say confidence indicators would be useful, regardless of which condition they were in.

Lastly, I ended Study 2 with the same measure used in Study 1 for assessing perceived usefulness of a confidence indicator feature. The purpose of this was just to see how useful people think confidence indicators would be—now that they were just in a scenario where they either were or were not given one.

123454.09Accurate,no indicator4.09Accurate +indicator3.94Inaccurate,no indicator3.97Inaccurate +indicator"How useful would this feature be?" (1 to 5). Rated high in every condition (all 3.9 to 4.1).
Figure 10. Rated usefulness of a confidence feature, with 95% confidence intervals. It was rated high in every condition, including by people who had just experienced the indicator.

Across every condition, participants rated a confidence indicator feature as very useful (means between 3.9 and 4.1 out of 5), and no condition stood out as meaningfully different. People who had just used the indicator rated it about the same as people who never saw it.

Interestingly, those in the “inaccurate + indicator” condition who had rated the tool more negatively gave about the same usefulness ratings as everyone else. One way to interpret this is that people get frustrated when a tool gives them a low-accuracy answer even if it’s flagged, but overall they like the idea of confidence indicators in theory.

Main Takeaways

There are some important practical implications to unpack here.

Takeaway 1

Users over-state their likelihood to verify and their demand for confidence indicators. While 92% of Study 1 survey participants said they would be likely to verify a flagged inaccurate response, only 30% actually did when given the chance in the behavioral experiment.

Takeaway 2

A low-confidence flag can move verification behavior and knowledge accuracy. The 5/10 low confidence made people significantly raised how often they verified it, from 17.3% to 30.2%. Verification behavior also resulted in dramatically more accurate answers on the knowledge test.

Verification is a high-value action. People who checked the secondary source scored much better on the quiz, especially on wrong answers. Checking helps a lot; the bottleneck is getting people to do it, and a low-confidence flag moved a meaningful share of them, the first lever in this study that did.

Takeaway 3

Despite objective benefits, users get frustrated when inaccuracy is flagged. Even though it helped them identify incorrect info and score more highly on the test (and earn more cash), people who saw inaccurate answers being flagged ended up rating the tool much lower in terms of satisfaction, willingness to reuse, and accuracy.

Study 3?

Study 2’s main limitation is ecological validity. Paid participants were viewing a static mockup and were reading an AI tool’s answers to questions that they did not ask. This means they did not have an intrinsic, genuine incentive to pursue accuracy (other than the cash bonus for accurate quiz scores). In a real session, a person’s own motivation to get the answer right would be stronger. It’s hard to know how that would affect behavior and attitudes. It could be that real users in a real interface would be more appreciative of careful confidence calibrations if they are intrinsically motivated to get the right answer. Or it could be that real users would

Either way, this is a valuable feature to continue testing. With some strategic framing, perhaps a product team could introduce users to this feature in way that emphasizes its benefits and mitigates some of the negative response to low-confidence flags. If users understand clearly that a low-confidence flag is actually the tool doing its job (and benefiting the user) then maybe we can arrive at a place where users verify low-confidence results AND reflect on that experience positively. Another potential angle for future development is to reduce the friction and cost of verification. Perhaps tools can integrate in-line source links, evidence summaries, or one-click fact-checking to streamline the verification process. That way, users may see the prompt to verify less as an inconvenient obstacle and more as a helpful enhancement.

The best next step is to run Study 3 in a live product setting with real users. Let’s do it!

Research Materials

I designed and ran both studies independently. Survey programming in Qualtrics, recruitment through Prolific, quantitative analysis in R and SPSS, and thematic coding of the open-ended responses done by hand. The confidence-indicator stimuli, secondary sources, comprehension quizzes, and survey instrument were all developed specifically for this project.