They might get new SOTA because the metric is accuracy, but if the metric was accuracy weighted by sample efficiency, then SOTA would look a lot less impressive.
Simplest way to weigh by sample efficiency: multiply accuracy by ratio of test set to training set sizes. Everyone's training/testing on 80/20 splits, so everybody's SOTA would go down by 3/4s.
Simplest way to weigh by sample efficiency: multiply accuracy by ratio of test set to training set sizes. Everyone's training/testing on 80/20 splits, so everybody's SOTA would go down by 3/4s.