Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

with quantization, 20B fits effortlessly in 24GB

with quantization + CPU offloading, non-thinking models run kind of fine (at about 2-5 tokens per second) even with 8 GB of VRAM

sure, it would be great if we could have models in all sizes imaginable (7/13/24/32/70/100+/1000+), but 20B and 120B are great.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: