Do they bake in the actual weights or the architecture? If it's just the architecture I don't understand where a speedup that considerable can come from.
from their announcement, "Isn’t inference bottlenecked on memory bandwidth, not compute?", it seems weights are still in memory. It may have limit onchip cache for computing. Input tokens go through a batch pipeline to relieve memory bottleneck. Similar to Groq.