I asked Claude.ai for a plain language summary of this buzzy new research paper. Claude offers the following:
Researchers from Caltech and Stanford conducted the first comprehensive survey of how and why large language models (LLMs) — AI systems like ChatGPT or Claude — fail at reasoning tasks. Rather than focusing on what AI can do well, they cataloged and organized the ways it goes wrong.Why it matters
LLMs are increasingly used in high-stakes settings, but their reasoning failures are often unpredictable and poorly understood. The researchers argue that systematically studying failures — like accident investigations in aviation — is essential for building more reliable AI.The researchers created a framework with two dimensions:
• Type of reasoning: informal (intuitive, social), formal (logic, math), and embodied (physical, spatial)
• Type of failure: fundamental flaws built into the architecture; domain-specific weaknesses; and robustness problems (inconsistent performance when small details change)Informal/intuitive reasoning
LLMs exhibit human-like cognitive biases — confirmation bias, anchoring, framing effects — but without human ability to recognize and correct for them. They also struggle with “theory of mind” (understanding what others believe or intend), and with applying consistent moral or ethical reasoning.Formal/logical reasoning
LLMs often can’t reverse simple logical relationships (if they know “A is B,” they may not infer “B is A”). They struggle to chain multiple reasoning steps together. Basic counting and arithmetic fail in ways that seem surprising given their other capabilities.Embodied/physical reasoning
LLMs have poor intuitions about the physical world — gravity, spatial relationships, object properties — because they’ve learned only from text, not from physical experience. This extends to visual AI systems as well.Many failures trace back to how LLMs are trained: they predict the next word in a sequence rather than reasoning deliberately. This makes them good at pattern-matching but unreliable when tasks require genuine logical inference, especially under slight variations in how a question is phrased.
Researchers have proposed fixes including better training data, techniques that force step-by-step reasoning (like “chain-of-thought” prompting), connecting LLMs to external tools like calculators or physics simulators, and architectural changes. However, no single fix is comprehensive — many improvements in one area don’t transfer to others.