Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hi Mark, the library looks cool, excited to try it out. Coincidentally I am starting work on a project that is investigating a lot of Post training quantization methods. I read the blog and I am curious to understand what kind of overheads are involved in quantizing a layer?


There's a bunch of overhead associated with PTQ - but TL;DR is that much of that overhead goes away when you're using `torch.compile()` and `torchao.autoquant()`

Essentially the latency overhead comes from quantizing and dequantizing weights and activations. For large layers this overhead is small because by quantizing your weights for example you reduce memory bandwidth pressure but for small layers the overhead of potentially looking up a table, reading scaling factors, quantization/dequantization and finally handling zero points might not be worth it.

However, even if such overhead exists you can still quantize your model and get it to be smaller it might not be faster is the problem. We solve the speed problem in 2 ways - `torch.compile()` will fuse operations like a dequant and matmul into a single kernel and `torchao.autoquant()` will do kernel level profiling to see whether a layer is actually made faster when quantizing and if not it skips quantizing that layer.


I see, thank you for the explanation!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: