There is much irony in the certainty this article displays. There are no caveats, no qualification, and no attempt to grasp why anyone would use an LLM. The possibility that LLMs might be useful in certain scenarios never threatens to enter their mind. They are cozy in the safety of their own knowledge.
The other irony is that Dunning-Kruger is a terrible piece of research that doesn't show what they claim it shows. It's not even clear the DK effect exists at all. A classic of 90s pop psychology before the replication crisis had reached public awareness.
It's worth reading the original paper sometime. It has all the standard problems like:
1. It uses a tiny sample size.
2. It assumes American psych undergrads are representative of the entire human race.
3. It uses stupid and incredibly subjective tests, then combines that with cherry picking. The test of competence was whether you rated jokes and funny or unfunny. To be considered competent your assessments had to match that of a panel of "joke experts" that DK just assembled by hand.
This study design has an obvious problem that did actually happen: what if their hand picked experts didn't agree on which of their hand picked jokes were funny? No problem. Rather than realize this is evidence their study design is bad they just tossed the outliers:
"Although the ratings provided by the eight comedians were moderately reliable (a = .72), an analysis of interrater correlations found that one (and only one) comedian's ratings failed to correlate positively with the others (mean r = -.09). We thus excluded this comedian's ratings in our calculation of the humor value of each joke"
It ends up running into circular reasoning problems. People are being assessed on whether they think they have true "expertise" but the "experts" don't agree with each other, meaning the one that disagreed would be considered to be suffering from a competence delusion. But they were chosen specifically because they were considered to be competent.
There's also claims that the data they did find is just a statistical artifact to begin with:
"Our data show that peoples' self-assessments of competence, in general, reflect a genuine competence that they can demonstrate. That finding contradicts the current consensus about the nature of self-assessment."
> It assumes American psych undergrads are representative of the entire human race.
(1) Since it can't document an effect in them, it doesn't really matter whether they're representative or not.
> The test of competence was whether you rated jokes and funny or unfunny. To be considered competent your assessments had to match that of a panel of "joke experts" that DK just assembled by hand.
(2) This is a major problem elsewhere. Not just elsewhere in psychology; pretty much everywhere.
There's a standard test of something like "emotional competence" where the testee is shown pictures and asked to identify what emotion the person in the picture is feeling.
But, if you worry about the details of things like this, there is no correct answer. The person in each picture is a trained actor who has been instructed to portray a given emotion. Are they actually feeling that emotion? No.
Would someone else look similar if they were actually feeling that emotion? No. Actors do some standard things that cue you as to what you're supposed to imagine them feeling. People in reality don't. They express their emotions in all kinds of different ways. Any trial lawyer will be happy to talk your ear off about how a jury expects someone who's telling the truth to show a set of particular behaviors, and witnesses just won't do that whether they're telling the truth or not.
Sometimes I envy that. But not today.