They might get new SOTA because the metric is accuracy, but if the metric was ac...

They might get new SOTA because the metric is accuracy, but if the metric was accuracy weighted by sample efficiency, then SOTA would look a lot less impressive.

Simplest way to weigh by sample efficiency: multiply accuracy by ratio of test set to training set sizes. Everyone's training/testing on 80/20 splits, so everybody's SOTA would go down by 3/4s.