Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Do they bake in the actual weights or the architecture? If it's just the architecture I don't understand where a speedup that considerable can come from.


from their announcement, "Isn’t inference bottlenecked on memory bandwidth, not compute?", it seems weights are still in memory. It may have limit onchip cache for computing. Input tokens go through a batch pipeline to relieve memory bottleneck. Similar to Groq.


They mention something about not wasting space for unnecessary memory.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: