Has DeepSeek challenged the very weird hallucination problem? Reducing hallucinations now seems to be the remaining fundamental issue that needs scientific research. Everything else feels like an engineering problem.
To me, the second biggest problem is that the models aren't really conversational yet. They can maintain some state between prompt and response, but in normal human-human interactions responses can be interrupted by either party with additional detail or context provided.
"Write Python code for the game of Tetris" resulting in working code that resembles Tetris is great. But the back and forth asking for clarification or details (or even post-solution adjustments) isn't there. The models dwell and draw almost entirely self-referentially from their own reasoning through the entire exchange.
"Do you want to keep score?" "How should scoring work?" "Do you want aftertouch?" "What about pushing down, should it be instantaneous or at some multiple of the normal drop speed?" "What should that multiple be?"
as well as questions from the prompter that inquire about capabilities and possibilities. "Can you add one 5-part piece that shows up randomly on average every 100 pieces on average?" or "Is it possible to make the drop speed function as an acceleration rather than a linear drop speed?"....these are somewhat possible, but sometimes require the model to re-reason the entire solution again.
So right now, even the best models may or may not provide working code that generates something that resembles a Tetris game, but have no specifics beyond what some internal self-referential reasoning provides, even if that reasoning happens in stages.
Such a capability would help users of these models troubleshoot or fix specific problems or express specific desires....the Tetris game works but has no left-hand L blocks for example. Or the scoring makes no sense. Everything happens in a sort of highly superficial approach where the reasoning is used to fill in gaps in the top-down understand of the problem the model is working on.
I have my own automated LLM developer tool. I give a project description, and the script repeatedly asks the LLM for a code attempt, runs the code, returns the output to the LLM, asking if it passes/fails the project description, repeating until it judges the output as a pass. Once/if it thinks the code is complete, it asks the human user to provide feedback or press enter to accept the last iteration and exit.
For example, I can ask it to write a python script to get the public IP, geolocation, and weather, trying different known free public APIs until it succeeds. But the first successful try was dumping a ton of weather JSON to the console, so I gave feedback to make it human readable with one line each for IP, location, and weather, with a few details for the location and weather. That worked, but it used the wrong units for the region, so I asked it to also use local units, and then both the LLM and myself judged that the project was complete. Now, if I want to accomplish the same project in fewer prompts, I know to specify human-readable output in region-appropriate units.
This only uses text based LLMs, but the logical next step would be to have a multimodal network review images or video of the running program to continue to self-evaluate and improve.
> The models dwell and draw almost entirely self-referentially from their own reasoning through the entire exchange.
Sounds like a deficiency in theory of the mind.
Maybe explains some of the outputs I've seen from deepseek where it conjectures about the reasons why you said whatever you said. Perhaps this is where we're at for mitigations for what you've noticed.
If your prompt instructs the model to ask such questions along the way, the model will, in fact, do so!
But yes, it would be nice if the model were smart enough to realize when it's in a situation where it should ask the user a few questions, and when it should just get on with things.
>Everything else feels like an engineering problem.
That's probably the key to understanding why the hallucination "problem" isn't going to be fixed because language models, as probabilistic models it's an inherent feature and they were never designed to be expert systems in the first place.
Building an knowledge representation system that can properly model the world itself is going more into the foundations of mathematics and logic than it is to do with engineering, of which the current frameworks like FOL are very lacking and there aren't many people in the world who are working on such problems.
> the open-domain Frame Problem is equivalent to the Halting Problem and is therefore undecidable.
Diaconescu's Theorem will help understand where Rice's theorm comes to play here.
Littlestone and Warmuth's work will explain where PAC Learning really depends on a many to one reduction that is similar to fixed points.
Viewing supervised learning as paramedic linear regression, this dependent on IID, and unsupervised learning as clustering thus dependent on AC will help with the above.
Both IID and AC imply PEM, is another lens.
Basically for problems like protein folding, which has rules that have the Markovian and Ergodic properties it will work reliably well for science.
The basic three properties of (confident, competent, and inevitable wrong) will always be with us.
Doesn't mean that we can't do useful things with them, but if you are waiting for the hallucinations problem to be 'solved' you will be waiting for a very long time.
What this new combo of elements does do is seriously help with being able to leverage base models to do very powerful things, while not waiting for some huge groups to train a general model that fits your needs.
This is a 'no effective procedure/algorithm exists' problem. Leveraging LLMs for frontier search will open up possible paths, but the limits of the tool will still be there.
Stable orbits of the planets is an example of another limit of math, but JPL still does a great job as an example.
Obviously someone may falsify this paper... but the safe bet is that it holds.
The problem is confabulations. In my benchmark (https://github.com/lechmazur/confabulations/), you see models produce non-existent answers in response to misleading questions that are based on provided text documents. This can be addressed.
> the open-domain Frame Problem is equivalent to the Halting Problem and is therefore undecidable.
Thank you, Code as Data problems are innate to the von Nuemman architecture, but I could never articulate how LLMs are so huge they are essentially Turing-complete and equivalent computationally.
You _can_ combinate through them, just not in our universe.
this is very wrong. LLMs are very much not Turing complete, but they are algorithms on a computer so they definitely can't compute anything uncomputable
Turing machines are typically described as having an infinite tape. It may not be able to access that tape in finite time, but the tape is not bound to a finite tape
But it doesn't matter, it is an abstract model of computation.
But it doesn't matter Church–Turing thesis states that a function on the natural numbers can be calculated by an effective method if and only if it is computable by a Turing machine.
It doesn't matter if you put the algorithm on paper, on 1 or k tapes etc...
Rice's theorem I mentioned above is like Scott–Curry theorem in Lambda calculus. Lambda calculus is Turing complete, that is, it is a universal model of computation that can be used to simulate any Turing machine.
The similar problems with 'trivial properties' in TMs end up being recursively inseparable sets in Lambda calculus.
From what I see, the Deepseek R1 model seems to be better calibrated (knowing what it knows) than any other model, at least on the HLE benchmark: https://lastexam.ai/
There's nothing weird about so called hallucination ("confabulation" would be a better term), it's the expected behavior. If your use case cannot deal with it, it's not a good use case for these models.
And yes, if you thought this means these models are being commonly misapplied, you'd be correct. This will continue until the bubble bursts.
There was an amusing tongue-in-cheek comment from a recent guest (Prof. Rao) on MLST .. he said that reasoning models no longer hallucinate - they gaslight you instead... give a wrong answer and try to convince you why it's right! :)