I think you're assuming that OpenAI is incentivized to benchmark honestly. Like ...

somenameforme · on Aug 26, 2023

Also for a topic like this, subjectivity is all there really is. Even if you create some metric, what you prioritize is going to be subjective. Because performance is going to vary against different sorts of tasks, and there are a literally infinite number of categories of tasks, so it's not like you can ever truly get a fair sampling.

Because of this, a sample of subjective opinions is probably much more valuable than any official metric, especially if that metric comes from, as you mentioned, individuals/orgs who are highly motivated to game it endlessly. Even when it comes from an external source you end up with a similar risk of it being gamed. It's like how old school Google puzzle interviews went from seeing who was most clever [in that domain], to seeing who'd booked up the most.

refulgentis · on Aug 26, 2023

Well, no, we have the HumanEval results for the June release.

somenameforme · on Aug 27, 2023

Which is both (1) a subjective selection to measure the effectiveness of various chatbots and (2) now subject to gaming from companies using opaque/closed/inaccessible/unverifiable systems, like OpenAI.