True, but that just goes to show how brittle these models are - how shallow the ...

True, but that just goes to show how brittle these models are - how shallow the dividing line is between primary facts present (hopefully consistently so) in the training set, and derived facts that are potentially more suspect.

To make things worse, I don't think we can even assume that primary facts are always going to be represented in abstract semantic terms independent of source text. The model may have been trained on a fact but still fail to reliably recall/predict it because of "lookup failure" (model fails to reduce query text to necessary abstract lookup key).