Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> The 20B model runs on my Mac laptop using less than 15GB of RAM.

I was about to try the same. What TPS are you getting and on which processor? Thanks!



gpt-oss-20b: 9 threads, 131072 context window, 4 experts - 35-37 tok/s on M2 Max via LM Studio.


interestingly, i am also on M2 Max, and i get ~66 tok/s in LM Studio on M2 Max, with the same 131072. I have full offload to GPU. I also turned on flash attention in advanced settings.


Thank you! Flash attention gives me a boost to ~66 tok/s indeed.


55 token/s here on m4 pro, turning on flash attention puts it to 60/s.


i got 70 token/s on m4 max


That M4 Max is really something else, I get also 70 tokens/second on eval on a RTX 4000 SFF Ada server GPU.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: