Last November, I asked Claude to build a custom reference file summarizing some new, emerging writing preferences. Claude created the file, confirmed it was saved, and I moved on. I kept experiencing some frustrating patterns where the LLM overrode my explicit instructions, and assumed the problem was my prompting. I adjusted, rewrote instructions, created new custom skills, and kept having the same problem.
I finally caught a reference to the ghost skill in chat. Claude confirmed the failure, and I asked how to prompt better to avoid this. Ultimately it’s not an issue of prompting, it replied. So I asked: how am I to trust this tool when failures are invisible? Claude’s own assessment: It is more reliable as a drafting tool than a knowledge tool, and the product’s marketed capabilities exceed what it can consistently deliver.
It turns out the skill file had been created in a temporary container filesystem that resets between sessions. It never persisted past the conversation where I built it, but the memory edit telling Claude to reference that file did persist, across every conversation, for months. Each new session started with an instruction to consult the ghost, crowding out the active skill resources I was editing in real time. When it couldn’t find the working file, it didn’t say so. It either worked around the failure silently or implied it was drawing on the new skill when it wasn’t. This is Claude’s explanation, so I can’t independently verify it. Ultimately nothing serious happened, but wow. The fix required me to delete a months-old, throwaway conversation in Claude’s memory that triggered the ghost. How would someone without a coding background recognize any of this?
This persistent gap between marketing and performance is a big deal. LLMs handle a striking range of tasks at “pretty good,” which is genuinely unprecedented for consumer software. But “good enough sometimes” and “always good” and “good enough for warfare” and “good enough to replace American workers en masse” are different claims, and the gap between them gets wide fast when you’re applying the tools to real work in the real world, in non-standardized, non-optimized environments with non-optimal data. Ultimately LLMs are going to be good enough for some things and not others in the way tools are.
Most consumer LLM tools right now are general-purpose engines wrapped in different marketing stories. The underlying models do roughly the same thing, so differentiation is about interface, user experience and brand positioning. Consensus is building among users that Claude is the preferred model for those who write and code, while Gemini has an edge on visuals, but this is shifting as tastes and current events do. Of course, all of this could change tomorrow and probably will.
In the meantime, Anthropic, OpenAI, and Google are burning extraordinary resources on compute and talent, focused on massive data sets within generalist tools, rather than narrower applications of the tech that can be quite elegant. A narrow tool, even a very good one, doesn’t support billion dollar valuations, so the scope and the pitch have to be enormous to support the business model. The everything-app rhetoric isn’t strictly untrue (LLMs are “good enough” for a lot of reasonable things and very likely to continue improving with the science), but the marketing front-loads the most impressive demonstrations and categorically buries the failures, which happen frequently enough in practice to give the average risk-averse business person real concerns about outsourcing the finer details, if they deal in accuracy and trust.