Do they bake in the actual weights or the architecture? If it's just the archite...

jasonni · on June 26, 2024

from their announcement, "Isn’t inference bottlenecked on memory bandwidth, not compute?", it seems weights are still in memory. It may have limit onchip cache for computing. Input tokens go through a batch pipeline to relieve memory bottleneck. Similar to Groq.

htthbjk · on June 25, 2024

They mention something about not wasting space for unnecessary memory.