What Will it Take to Fix Benchmarking in Natural Language Understanding?

  • 来源:arxiv 2104.02145v1
  • 动机:NLU 任务的评测不好,好的评测指标(或者说数据集的任务定义)需要满足本文提出的四个性质。本文也探讨了该怎么提升模型在这四种性质上的表现。
  • 四个性质及解决方式:
    1. Validity:
      • Good performance on the benchmark should imply robust in-domain performance on the task.
      • 比如,应该有一个专门的指标评测理解能力,一个专门的指标评测推断能力,但是现在的指标只能评测整体性能
      • 比如,数据集应该在各个方面都有足够的 varience
      • 比如,数据集应该在标注的手法上不太依靠人,这样人类表现才不至于成为机器表现的上限
      • 文中还给出了现在造数据集时的四类方法,发现没有一个是可以满足 validity 条件的。
      • 解决:We need more work on dataset design and data collection methods.(废话
    2. Reliable Annotation:
      • Benchmark examples should be accurately and unambiguously annotated.
      • 避免这三种 (i) examples that are carelessly mislabeled, (ii) examples that have no clear correct label due to unclear or underspecified task guidelines, and (iii) examples that have no clear correct label under the relevant metric due to legitimate disagreements in interpretation among annotators.
      • 解决:Test examples should be validated thoroughly enough to detect ambiguous or mislabeled cases.
    3. Statistical Power:
      • Benchmarks should offer adequate statistical power.
      • Benchmark evaluation datasets should be large and discriminative enough to detect any qualitatively relevant performance difference between two models
      • 解决:Benchmark datasets need to be much harder and/or much larger.
    4. Disincentives for Biased Models:
      • Benchmark should reveal potentially harmful social biases in systems, and should not incentivize the creation of biased systems.
      • benchmarks are often built around naturally-occurring or crowd-sourced text, it is often the case that a system can improve its performance by adopting heuristics that reproduce potentially-harmful biases
      • 解决:We need to better encourage the development and use auxiliary bias evaluation metrics.