Do they bake in the actual weights or the architecture? If it's just the architecture I don't understand where a speedup that considerable can come from.
from their announcement, "Isn’t inference bottlenecked on memory bandwidth, not compute?", it seems weights are still in memory. It may have limit onchip cache for computing. Input tokens go through a batch pipeline to relieve memory bottleneck. Similar to Groq.
The short answer to your first question (in my opinion) is no, ASICs and FPGAs are not really equivalent (generally) in the context of how they are used in different applications. ASICs are baked-in circuits; once you make them, they are fixed. You can make the circuits and digital logic in them dynamic at runtime (for example, a CPU, or a more complex example, a domain-specific runtime accelerator like Sohu). An FPGA lets you reprogram the circuit inside it (via one level of extra abstraction, the logic and routing configuration inside it). In essence, an FPGA makes the ultimate tradeoff of being fully reconfigurable down to the "gate" level at the cost of other things like clock speed, area, transistor size, power, and so on. These tradeoffs may be so significant that it looks like GPUs and ASICs are worth it for deep learning inference rather than FPGA.
In general, you can implement deep learning accelerators on both FPGA and ASIC. Xilinx and now AMD have been slowly adding more and more stuff to their "FPGA chips," like an AI engine (vector processors and a network-on-chip), as well as high bandwidth memory, in addition to their configurable logic already there, to make it more viable as a possible solution for companies to deploy deep learning stuff to their devices or in applications where they want to integrate deep learning alongside regular FPGA processing stuff. I don't know how that shakes out in the industry, but I do know that lots of academics use FPGAs as a good platform for experimenting and prototyping accelerator architectures and such.
ASIC just stands for Application Specific Integrated Circuit. So, yeah, it is like an FPGA, but takes longer to turn around a new version because you have to wait for somebody to etch you some silicon but you may get higher density than with the FPGA. You can do (very) small volumes at old densities for cheap, but if you are trying to track the front of the technology wave with commercially viable shipping quantities, you often need tens of millions of dollars per generation. This means that these folks have room for 1-3 generations before their money is gone.
LLMs that are attached to normal CPUs need lots of fast memory because they are doing very large matrix operations with very few arithmetic units which implies a lot of data motion. Changing that architecture might save on the need to move so much data, but it isn't at all clear what these people are proposing.
It also isn't at all obvious why their stuff would be any better than an ordinary vectorized arithmetic unit (often provocatively called a "tensor" chip).
In their announcing page, the section "How can we fit so much more FLOPS on our chip than GPUs?" tells some details. It's said "only 3.3% of the transistors on an H100 GPU are used for matrix multiplication". They trade off programmbility with computation density. And from the "Isn’t inference bottlenecked on memory bandwidth, not compute?" section, I guess they use similar tricks like Groq. Looking forward to more architecture details and comparation with Groq.
The results sort of speak for themselves - custom ASIC's are the way of the future. How hard is it though for Nvidia to design a custom ASIC like this?
No one is sure that Transformer model is the final best structure. However, you can still use RTX3090 or RTX2090 to run AI models today, no matter the neural network structure is LTSM/RNN/Transformer. Programmablity and compatibility have more value in some aspect of economic considaration.
I hope in the future, when Chip manufacture cost is no longer the bottleneck of AI, we can have more options.
If you can design and ship ASICs faster than anyone else for new problems then you can become the dominant player. All the new LLM startups just use the OpenAI API interface since they were first to market dominance.
Actually, it is more along the lines of "if you don't HAVE to keep shipping new ASICs for each new problem because you have a better architecture, you can become the dominant player".
That is how FPGAs got important. And general purpose CPUs or GPUs.