AI experts refute Apple's 'limits to AI inference capabilities'



In response to Apple's paper, which pointed out that 'AI's inference capabilities have been over-hyped and are not as good as they claim,' AI experts published a paper arguing that 'most of Apple's findings are due to experimental design flaws, not limitations in basic inference.'

The Illusion of the Illusion of Thinking A Comment on Shojaee et al. (2025)

https://arxiv.org/html/2506.09250v1



New paper pushes back on Apple's LLM 'reasoning collapse' study - 9to5Mac

https://9to5mac.com/2025/06/13/new-paper-pushes-back-on-apples-llm-reasoning-collapse-study/

In June 2025, Apple published a paper called 'The Illusion of Thinking.' In the paper, Apple tested the reasoning capabilities of AI models such as Anthropic's 'Claude 3.7 Sonnet,' OpenAI's 'o1' and 'o3,' DeepSeek's 'DeepSeek-R1,' and Google's 'Gemini.' When they tried to reproduce human reasoning, they concluded that it was not as good as advertised.

Apple details the limitations of top-level AI models and large-scale inference models like OpenAI's 'o3' - GIGAZINE



Alex Laursen, head of AI governance and policy at Open Philanthropy, an NPO that evaluates philanthropic programs, boldly refuted this. Laursen called his paper 'The Illusion of the Illusion of Thinking,' which is a clear rebuttal to Apple's paper.

Laursen acknowledges that even current large-scale inference models have difficulty solving complex puzzles like the Tower of Hanoi, but he argues that Apple's paper confuses 'inference failures' with 'practical output constraints' and 'flawed evaluation settings.'

For example, Apple claims that LRM could barely complete the Tower of Hanoi game, which uses more than eight disks, but Laursen points out that Claude reached the limit of its token output. In fact, Claude returned an output that said, 'I'm going to stop here to save my tokens.'



In addition, in the 'river crossing problem,' some of the conditions presented by Apple were unsolvable. Although the AI model recognized that it was unsolvable, Apple treated them as 'unsolvable' all together, Laursen pointed out.

Furthermore, according to Laursen, Apple's evaluation script unfairly treated all partial and strategic outputs as 'completely failed.'

When Rosen tried to avoid the output limitations when solving the Tower of Hanoi, at least the Claude, Gemini, and OpenAI models were able to output solutions to 15 Towers of Hanoi.

In addition, Laursen suggested that the retest this time was preliminary, and that future research should focus on 'designing an evaluation that distinguishes between inference ability and output limitations,' 'verifying whether the puzzle can be solved before evaluating the model performance,' 'using complexity metrics that reflect the difficulty of the calculation, not just the length of the solution,' and 'considering multiple solution representations to separate the understanding and execution of the algorithm.' 'The question is not whether LRM can infer, but whether our evaluation can distinguish between inference and output,' he said.

in Note, Posted by logc_nt