gpt-oss:20b is ~14GB on disk [1] so fits nicely within a 16GB VRAM card. [1] htt...

dragonwriter · 2025-08-05T17:38:54 1754415534

You also need space in VRAM for what is required to support the context window; you might be able to do a model that is 14GB in parameters with a small (~8k maybe?) context window on a 16GB card.

artembugara · 2025-08-05T17:18:06 1754414286

thanks, this part is clear to me.

but I need to understand 20 x 1k token throughput

I assume it just might be too early to know the answer

Tostino · 2025-08-05T17:30:21 1754415021

I legitimately cannot think of any hardware that will get you to that throughput over that many streams with any of the hardware I know of (I don't work in the server space so there may be some new stuff I am unaware of).

artembugara · 2025-08-05T17:39:10 1754415550

oh, I totally understand that I'd need multiple GPUs. I'd just want to know what GPU specifically and how many

Tostino · 2025-08-05T17:48:36 1754416116

I don't think you can get 1k tokens/sec on a single stream using any consumer grade GPUs with a 20b model. Maybe you could with H100 or better, but I somewhat doubt that.

My 2x 3090 setup will get me ~6-10 streams of ~20-40 tokens/sec (generation) ~700-1000 tokens/sec (input) with a 32b dense model.