Sohu: The First Transformer ASIC

mysterEFrank · on June 25, 2024

Do they bake in the actual weights or the architecture? If it's just the architecture I don't understand where a speedup that considerable can come from.

jasonni · on June 26, 2024

from their announcement, "Isn’t inference bottlenecked on memory bandwidth, not compute?", it seems weights are still in memory. It may have limit onchip cache for computing. Input tokens go through a batch pipeline to relieve memory bottleneck. Similar to Groq.

htthbjk · on June 25, 2024

They mention something about not wasting space for unnecessary memory.

galaxyLogic · on June 25, 2024

Does using ASICs mean something like using FPGAs? If so AMD should be well-prepared for this architecture ever since their Xilinx acquisition.

But isn't it the case that for LLMs what is really needed is lots of fast memory? ASICs can't help much there can they?

stefanpie · on June 25, 2024

The short answer to your first question (in my opinion) is no, ASICs and FPGAs are not really equivalent (generally) in the context of how they are used in different applications. ASICs are baked-in circuits; once you make them, they are fixed. You can make the circuits and digital logic in them dynamic at runtime (for example, a CPU, or a more complex example, a domain-specific runtime accelerator like Sohu). An FPGA lets you reprogram the circuit inside it (via one level of extra abstraction, the logic and routing configuration inside it). In essence, an FPGA makes the ultimate tradeoff of being fully reconfigurable down to the "gate" level at the cost of other things like clock speed, area, transistor size, power, and so on. These tradeoffs may be so significant that it looks like GPUs and ASICs are worth it for deep learning inference rather than FPGA.

In general, you can implement deep learning accelerators on both FPGA and ASIC. Xilinx and now AMD have been slowly adding more and more stuff to their "FPGA chips," like an AI engine (vector processors and a network-on-chip), as well as high bandwidth memory, in addition to their configurable logic already there, to make it more viable as a possible solution for companies to deploy deep learning stuff to their devices or in applications where they want to integrate deep learning alongside regular FPGA processing stuff. I don't know how that shakes out in the industry, but I do know that lots of academics use FPGAs as a good platform for experimenting and prototyping accelerator architectures and such.

ted_dunning · on June 25, 2024

ASIC just stands for Application Specific Integrated Circuit. So, yeah, it is like an FPGA, but takes longer to turn around a new version because you have to wait for somebody to etch you some silicon but you may get higher density than with the FPGA. You can do (very) small volumes at old densities for cheap, but if you are trying to track the front of the technology wave with commercially viable shipping quantities, you often need tens of millions of dollars per generation. This means that these folks have room for 1-3 generations before their money is gone.

LLMs that are attached to normal CPUs need lots of fast memory because they are doing very large matrix operations with very few arithmetic units which implies a lot of data motion. Changing that architecture might save on the need to move so much data, but it isn't at all clear what these people are proposing.

It also isn't at all obvious why their stuff would be any better than an ordinary vectorized arithmetic unit (often provocatively called a "tensor" chip).

jasonni · on June 26, 2024

In their announcing page, the section "How can we fit so much more FLOPS on our chip than GPUs?" tells some details. It's said "only 3.3% of the transistors on an H100 GPU are used for matrix multiplication". They trade off programmbility with computation density. And from the "Isn’t inference bottlenecked on memory bandwidth, not compute?" section, I guess they use similar tricks like Groq. Looking forward to more architecture details and comparation with Groq.

drcode · on June 25, 2024

No, asics and fpgas are different concepts

Zaheer · on June 25, 2024

The results sort of speak for themselves - custom ASIC's are the way of the future. How hard is it though for Nvidia to design a custom ASIC like this?

jasonni · on June 26, 2024

No one is sure that Transformer model is the final best structure. However, you can still use RTX3090 or RTX2090 to run AI models today, no matter the neural network structure is LTSM/RNN/Transformer. Programmablity and compatibility have more value in some aspect of economic considaration.

I hope in the future, when Chip manufacture cost is no longer the bottleneck of AI, we can have more options.

ifuknowuknow · on June 25, 2024

what results

ChrisArchitect · on June 26, 2024

Official post: https://www.etched.com/announcing-etched

ifuknowuknow · on June 25, 2024

[flagged]

jkelleyrtp · on June 25, 2024

?

If you can design and ship ASICs faster than anyone else for new problems then you can become the dominant player. All the new LLM startups just use the OpenAI API interface since they were first to market dominance.

ted_dunning · on June 25, 2024

Actually, it is more along the lines of "if you don't HAVE to keep shipping new ASICs for each new problem because you have a better architecture, you can become the dominant player".

That is how FPGAs got important. And general purpose CPUs or GPUs.