On Position Embeddings in BERT

• 来源：ICLR 2021 https://openreview.net/pdf/be0283e323f1b118c975dbc46f7f75c59b467fe0.pdf
• 这篇文章制定了 position embedding (PE) 的三个性质，对不同 PE 在不同任务上进行了对比
• PE 的三个性质：monotonicity, translation invariance, and symmetry
• 1) neighboring positions are embedded closer than faraway ones;
• 2) distances of two arbitrary m-offset position vectors are identical;
• 3) the metric (distance) itself is symmetric.
• 两类 PE
• absolute PEs (APEs)：single positions are mapped to elements of the representation space
• relative PEs (RPEs)：the difference between positions (i.e., x − y for x, y ∈ N) is mapped to elements of the embedding space。
• （这里 WE 指 word embedding，不是两个矩阵
• 本文研究四个：(1) the fully learnable APE (Gehring et al., 2017), (2) the fixed sinusoidal APE (Vaswani et al., 2017), (3) the fully learnable RPE (Shaw et al., 2018), and (4) the fixed sinusoidal RPE (Wei et al., 2019).
• sinusoidal PE 似乎就是以 embedding 为参数的 sin 函数作为 PE：
• https://tva1.sinaimg.cn/large/008eGmZEgy1gp35q27c08j31g80niq86.jpg
• 作者提出了 learnable sinusodial PE，就是把幅度换成了 learnable 的，而不是一圈那样（其实这一张图比上面所有公式都容易看）
• 结论
• RPEs perform better in span prediction tasks since they meet better translation invariance, mono- tonicity , and asymmetry; the fully-learnable APE which does not strictly have the translation in- variance and monotonicity properties during parameterizations (as it also performed worse in measuring translation invariance and local monotonicity than other APEs and all RPEs) still performs well because it can flexibly deal with special tokens (especially, unshiftable [CLS]).