I’ve been running a series of informal experiments with my team to understand how we should advise people on using AI tools like ChatGPT, Claude and Gemini around our work.

In our last session, I gave everyone the same task: ask your AI tool to summarize a single article in plain language. Every person got a factually accurate summary, but the style and depth of analysis varied widely from person to person. We tested this further by entering the same prompt multiple times. Each time, the output was slightly different. We copied and pasted everything into a shared document so we could compare side by side.

Later that day, I asked everyone to reopen the tool they’d used that morning and repeat the prompt that had worked best: summarize this article in plain language. The outputs shifted again. Some changed dramatically. One team member whose morning summary had been light and conversational received a much more formal, serious version in the afternoon.

I then asked Claude to compare the morning and afternoon versions of one summary. “In short, the morning version emphasizes practical implications and accessibility; the afternoon version prioritizes completeness and specificity. They complement each other well — someone reading both would get a fuller picture than either one alone provides.”

Why would Claude produce these differences despite the same source material and prompt?

I asked the source.

Even with the same prompt and source material, LLMs don't produce identical outputs each time. This is by design — there's a degree of randomness (called "temperature") in how the model selects words, which means each run produces a slightly different path through the text.

But the differences here go beyond random word choices. The morning version landed on a journalistic structure while the afternoon version landed on an academic one. That likely reflects what the paper calls sensitivity to framing and order effects — one of the cognitive biases it catalogs. Small, invisible differences in how the model’s internal processing unfolds at the start of generation can cascade into meaningfully different organizational choices. Once the morning version opened with a news-style framing, the rest followed that pattern. Once the afternoon version opened with a descriptive academic tone, it followed that thread instead.

This is actually a concrete example of the robustness issues the paper identifies. The underlying reasoning about the paper’s content is consistent across both summaries — neither gets the facts wrong. But the presentation varies in ways that weren’t requested, which means someone relying on just one version would get a subtly incomplete picture depending on which run they happened to see.

It’s a useful reminder that LLM outputs are one plausible version of a response, not the definitive one.

It’s important to understand how these tools work. What looks like an authoritative answer is really just one of many possible outputs the system could have generated. The text will sound plausible whether it’s accurate or not, and you may get different results each time you ask.

Newsletter Microposts