Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Tensor Memory Bypass Cache?
2 points by daly on May 23, 2024 | hide | past | favorite
It seems to me that there is a good speedup possible if the GPU and HBM had a 'cache bypass'. That is, there are likely a large number of frequent matrix multiplies that could be computed by hardware lookup rather than an actual multiply. Such a pre-multiply cache would free up more of the actual multiply hardware, substituting the cache response for the result.

This 'memoizing' trick is widely used in compute-intensive situations but I'm unaware of any GPU/HBM hardware to support this.

Given that the multiplies are now computing 4 or 8 bit results this seems like a reasonable number of matrix multiplies could be cached.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: