I'm a beginner, so I'm unable to tell if it's hallucinating or not. Do you find ...

unshavedyak · on April 13, 2023

I've been using ChatGPT 4 the past couple weeks and also Phind just last night with a new library version. While yes, i did find that Phind was wrong a lot (though i don't think it was fully hallucinations, just wrong library version combinations), i think there's a more important point to be made.

Unless we get a very near breakthrough on self-validating accuracy of these models or models+plugin combinations, i suspect it may be a useful skill to learn to use LLMs to explore ideas even when hallucination is a risk.

Ie searching with Google is a skill we have to acquire. Validating results from Google is yet another skill. Likewise i feel it could be very useful to find a way to use LLMs in a way where you get the benefits while managing to mitigate the risk.

For me these days that usually translates to low risk environments. Things i can validate easily. ChatGPT was a good starting off point for researching ideas. It's also very useful to know how niche your subject matter is. The less results you find on Google for your specific edge case the more likely ChatGPT will struggle to have real or complete thoughts on the matter.

Likewise i imagine similarly this is true for Phind. Yea, it can search the web, but as my tests last night showed it still happily strings together incorrect data. Old library versions, notably. I'd say "Given Library 1.15, how do i do X?". It did eventually give me the right answer, but it happily wrote up coding examples that were a mix of library versions.

I imagine Phind will, to me, be similarly useful (if not more?) than ChatGPT, but you really have to be aware of what it might do wrong. .. because it will, heh.

rushingcreek · on April 13, 2023

We definitely still have work to do in this area and the feedback we've gotten here is incredibly helpful. Having the AI be explicitly aware of specific library versions so it doesn't mix-and-match is a high priority.

bentcorner · on April 13, 2023

I just tried it for a problem I solved in Azure Data Explorer and it solved it by making up some APIs that don't exist. It got close to how I solved the problem but cheated even with Expert mode enabled.

Rastonbury · on April 13, 2023

Seems like accuracy is the next killer feature for LLM search and teaching, will try again in 6 months

LeonenTheDK · on April 13, 2023

What a time to be alive where we likely need wait only a few months for the next big hurdle to be accomplished.

Exhilarating and terrifying at the same time.

kedean · on April 13, 2023

I dunno about that in this case. The "confidently incorrect" problem seems inherent to the underlying algorithm to me. If it were solved, I suspect that would be a paradigm shift of the sort that happens on the years scale at best.

mrtranscendence · on April 13, 2023

Yes, the "confidently incorrect" issue will be a tough nut to crack for the current spate of generative text models. LLMs have no ability to analyze a body of text and determine anything about it (e.g. how likely it is to be true); they are clever but at bottom can only extrapolate from patterns found in the training data. If no one has said anything like "X, and I'm 78% certain about it", then it's tough to imagine how an LLM could generate reasonably correct probability estimates.

famouswaffles · on April 14, 2023

What you're alluding to is calibration and base gpt-4 had excellent calibration before RlHF.

marginalia_nu · on April 13, 2023

It seems to be sort of a bit wrong more often than it hallucinates.

I've had it straight up invent a library that doesn't exist once, but that seems to be quite rare and you need to be deep in the weeds with a rare problem domain to get that.

More often I ask it how to do something, and it sort of provides an answer, but not quite. So I point out the flaw, and it fixes it, but not quite. Rinse and repeat. After anywhere between 4-10 iterations it's usually quite good. The experience is like code reviewing a really apologetic and endlessly patient junior developer.

Although I think what might be a beginner's saving grace is that it seems to be better at beginner questions than advanced questions, since there are more of them in the training data.