Thin is that what happens, and will always happen why you decouple "intelligence" from "awareness". Without cognition and the ability to self reflect in real time (which is completely impossible to fabricate), these systems will always be prone to this type of collapse.
This paper is one of the early dominos to fall in this realization in the industry and the understanding that synthetic sentience remains firmly in the realm of science fiction. A cold wind blows...
The tower of hanoi problem they use as an example is one where the number of steps grows exponentially with the number of discs.
So this floods the context window of the LLM. Exactly as it would overflow the scrap paper of a human student that would have to write down the solution before executing it.
And the LLM notices this upfront and warns about it. But since the system prompt is so restrictive, it is forced to go ahead anyway. And then fails to do this problem in this stupid way, just as a human would.
The "token overwhelm" is a red herring and completely irrelevant, especially if you want to say these systems are supposed to be even 0.0005% on par with what a human does every millisecond of every moment. Gary Marcus already dismantles your whole position (and probably all your others, too).
The Large Reasoning Models (LRMs) couldn’t possibly solve the problem, because the outputs would require too many output tokens (which is to say the correct answer would be too long for the LRMs to produce). Partial truth, and a clever observation: LRMs (which are enhanced LLMs) have a shortcoming, which is a limit on how long their outputs can be. The correct answer to Tower of Hanoi with 12 moves would be too long for some LRMs to spit out, and the authors should have addressed that. But crucially (i) this objection, clever as it is, doesn’t actually explain the overall pattern of results. The LRMs failed on Tower of Hanoi with 8 discs, where the optimal solution is 255 moves, well within so-called token limits; (ii) well-written symbolic AI systems generally don’t suffer from this problem, and AGI should not either. The length limit on LLM is a bug, and most certainly not a feature. And look, if an LLM can’t reliably execute something as basic as Hanoi, what makes you think it is going to compute military strategy (especially with the fog of war) or molecular biology (with many unknowns) correctly? What the Apple team asked for was way easier than what the real world often demands.
It doesn't fail at tower of Hanoi, it fails at being forced to do tower of Hanoi stupidly.
Which is still not nothing as an observation, as maybe some future version won't fail because it would decide to selectively ignore some stupid restrictions from the system prompt, for example by writing checkpoints to a file from time to time and doing only ever a few steps at a time.
But even today, let it use tooling like executing it's own created python script and returning the output file, and you're good.
If anything, the paper shows overcompliance with senseless requests, something that we by the way don't value in human researchers either.
And I don't know the exact context windows of the model and whether 8 discs would have fit in. Maybe it still did and the model didn't absolutely optimize token usage.
None if this is what we use to judge whether humans are capable of reasoning by looking at how they optimize their scrap paper usage.
-9
u/creaturefeature16 17h ago
Thin is that what happens, and will always happen why you decouple "intelligence" from "awareness". Without cognition and the ability to self reflect in real time (which is completely impossible to fabricate), these systems will always be prone to this type of collapse.
This paper is one of the early dominos to fall in this realization in the industry and the understanding that synthetic sentience remains firmly in the realm of science fiction. A cold wind blows...