Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Pardon my ignorance, but how do matrix operations on quantized data work? Is hardware support needed?

AFAIU int4 matrix multiplication is supported by cuda, but I'm not sure about other operations. The blog post mentioned fp6, and I don't think this is supported by cuda. Or maybe the data are upscaled to something common like fp16 before doing math?



It's a great question! Int4 is an easy one to understand. PyTorch supports int8 but not int4 so what you can do is "pack" 2 int4 values into a single int8 value. You still get speedups even without hardware support because you're sending less data to the GPU and workloads like small batch size LLM inference are memory bandwidth bound and not compute bound. So indeed your intuition is correct you pack the values and before doing a matmul you "unpack" them back into an int8 and then upcast to fp16 to do a matmul




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: