Tensor Memory Bypass Cache?

It seems to me that there is a good speedup possible if the GPU and HBM had a 'cache bypass'. That is, there are likely a large number of frequent matrix multiplies that could be computed by hardware lookup rather than an actual multiply. Such a pre-multiply cache would free up more of the actual multiply hardware, substituting the cache response for the result.

This 'memoizing' trick is widely used in compute-intensive situations but I'm unaware of any GPU/HBM hardware to support this.

Given that the multiplies are now computing 4 or 8 bit results this seems like a reasonable number of matrix multiplies could be cached.