Good performance on the benchmark should imply robust in-domain performance on the task.
比如,应该有一个专门的指标评测理解能力,一个专门的指标评测推断能力,但是现在的指标只能评测整体性能
比如,数据集应该在各个方面都有足够的 varience
比如,数据集应该在标注的手法上不太依靠人,这样人类表现才不至于成为机器表现的上限
文中还给出了现在造数据集时的四类方法,发现没有一个是可以满足 validity 条件的。
解决:We need more work on dataset design and data collection methods.(废话
Reliable Annotation:
Benchmark examples should be accurately and unambiguously annotated.
避免这三种 (i) examples that are carelessly mislabeled, (ii) examples that have no clear correct label due to unclear or underspecified task guidelines, and (iii) examples that have no clear correct label under the relevant metric due to legitimate disagreements in interpretation among annotators.
解决:Test examples should be validated thoroughly enough to detect ambiguous or mislabeled cases.
Statistical Power:
Benchmarks should offer adequate statistical power.
Benchmark evaluation datasets should be large and discriminative enough to detect any qualitatively relevant performance difference between two models
解决:Benchmark datasets need to be much harder and/or much larger.
Disincentives for Biased Models:
Benchmark should reveal potentially harmful social biases in systems, and should not incentivize the creation of biased systems.
benchmarks are often built around naturally-occurring or crowd-sourced text, it is often the case that a system can improve its performance by adopting heuristics that reproduce potentially-harmful biases
解决:We need to better encourage the development and use auxiliary bias evaluation metrics.