这是一篇 2007 年的老文,里面提出信息抽取不能局限于句子内,而要推广到文档级。具体来说,有以下几类句子内不能解决的情况:
- 跨句子的指代。The most straightforward of these is when an anaphoric expression is used to refer to one of the fact’s fields.
- 事件的不同成分分散在不同句子里。various parts of the description being linked by anaphoric expressions or alternative description. These cases are referred to as connected multiple sentence facts. For cases such as these some inference will be required to combine together all the parts of the fact description.
- 人的思考与理解,比如在“谋杀”事件里,🔫 很可能是凶器。in other cases there may be no direct connection between the sentences describing the fact. These facts can only be identified using a deeper understanding of the text such as discourse analysis or the application of world knowledge
- 常识推理,比如总统只能有一个,你上我就得下。There are other cases in which information is not mentioned in the text but has to be inferred using world knowledge
感觉这种 taxonomy 是要比 DocRED 里面的分类要好!