More

emacs28 · on March 19, 2025

First author here. The hardware architectures are realistic - we developed & evaluated real example hardware implementations for them, validated on FPGA, and they achieved state-of-the-art ResNet performance in a deep learning accelerator system implementation compared to prior accelerators evaluated on similar FPGAs. See the associated accelerator system source code here:

https://github.com/trevorpogue/algebraic-nnhw

The hardware architectures focused on in the paper are systolic array designs, an efficient type of hardware design for matrix multiplication (e.g., the Google TPU uses this), as opposed to more SIMD-like vector architectures like GPUs. It may be possible to extend the proposed KMM algorithm to other types of hardware architectures also in future work. Regarding floating point - this work is applicable for integer matrix multiplication acceleration, it may be possible to extend the concept to floating point data types in future work also.

almostgotcaught · on March 19, 2025

> systolic array designs, an efficient type of hardware design for matrix multiplication (e.g., the Google TPU uses this), as opposed to more SIMD-like vector architectures like GPUs

this is wrong. TPUv4 has tensor cores just like NVIDIA has tensor cores just like AMD has tensor cores. no one uses a systolic array because bandwidth/connectivity is much scarcer than compute. the only people that keep talking about them are academics that don't actually fab/sell chips.

https://cloud.google.com/tpu/docs/v4

https://www.nvidia.com/en-us/data-center/tensor-cores/

https://rocm.docs.amd.com/projects/rocWMMA/en/latest/what-is...

ninja edit: before you gotcha me with "a tensor core is a systolic array!!!" - most tensor cores are actually outerproduct engines not riffle shuffle engines (or whatever you wanna call the topology corresponding to a systolic array).

imtringued · on March 19, 2025

https://cloud.google.com/tpu/docs/system-architecture-tpu-vm...

>The primary task for TPUs is matrix processing, which is a combination of multiply and accumulate operations. TPUs contain thousands of multiply-accumulators that are directly connected to each other to form a large physical matrix. This is called a systolic array architecture. Cloud TPU v3, contain two systolic arrays of 128 x 128 ALUs, on a single processor.

robinhouston · on March 19, 2025

I don't see any contradiction between your claim that TPU v3 uses systolic arrays and the parent post's claim that TPU v4 does not.

FL33TW00D · on March 19, 2025

The TPU obviously uses a systolic array: https://jax-ml.github.io/scaling-book/tpus/

almostgotcaught · on March 19, 2025

Fair enough - my understanding was they moved away from systolic arrays. I stand corrected. I will also say it is well-known they're basically impossible to program/build a compiler for.

FL33TW00D · on March 19, 2025

This is why Google has 500 people working on the TPU compiler team.

sifar · on March 19, 2025

I have used Karatsuba's & Winograd's Inner product [0] algorithm in my work for wide multi-simd integer multipliers and matrix multiplication HW for DSPs. The latter cuts down the MACs by half - n^3/2 instead of n^3. I think the paper talks about it's derivative - FFIP.

The issue is memory bandwidth. These techniques indeed help you save multiplier area however the performance is still bandwidth limited - you'd need to be able to feed more data per cycle to increase performance.

One thing the paper doesn't talk about is energy. For DNN, at the network level the energy consumed by integer macs is not that high. Localizing data computation would have a much more impact on energy reduction than optimizing MACs.

[0] https://ieeexplore.ieee.org/document/1687427

thijson · on March 19, 2025

On an FPGA integer adders are much more abundant than integer multipliers. So this algorithm definitely helps get more utilization out of the FPGA. Once the multiplier is small enough, say 3 bits by 3 bits, it can fit into several LUT6's.

emacs28 · on March 21, 2024

It produces identical/bit-equivalent results as conventional/naive matrix multiplication for integer/fixed-point data types

emacs28 · on March 15, 2024

For everyone discussing the reduced accuracy/numerical stability of the algorithms in floating-point, this is true. But note that the application of the algorithms in the work is explored for fixed-point MM/quantized integer NN inference, not floating-point MM/inference. Hence, there is no reduction in accuracy for that application of it compared to using conventional fixed-point MM.

pclmulqdq · on March 15, 2024

"Conventional fixed-point MM" is a large suite of algorithms. It is correct that this is a 2x reduction in MULs compared to naive fixed-point matrix multiply, but there is a large body of literature out there with other circuits. This is a cool trick to add to the group.

p1esk · on March 15, 2024

Inference world is gradually switching from INT formats to FP formats. FP8 is already supported in modern hardware, and FP4 support is coming. In my experiments I get better perplexity in language models with FP4 than with INT4.

SJC_Hacker · on March 18, 2024

How is FP fundamentally different than integers? I've done FPGA programming and it just seems like the programmer has to decide where/when to do the shifting based on the expected range of the data. I'm not sure how this is "natively supported" in hardware.

p1esk · on March 18, 2024

If you have designed FPUs you should know that FP computation involves a lot more additional operations than just shifting (e.g. rounding, subnormals, and special value handling). That’s why, for example, CPUs use different hardware blocks for INT vs FP computation.

But that’s not the point. The point is, this particular method to speed up matmul is not suitable for FP.

abetusk · on March 15, 2024

I'm no expert but I suspect this is wrong. To me, this is like saying you don't need to worry about integer overflow because your operations are only working on fixed integers. Really? You don't care if you multiply or add two large numbers and they spill over?

The more appropriate answer, I suspect, is that the numerical precision and stability sacrifices are more than adequate for normal usage.

If I'm wrong about this, I would certainly like to know.

pclmulqdq · on March 15, 2024

In hardware, you control your integer widths completely, so if you add two 32-bit ints to a 33-bit int, there is no chance of overflow. The same goes for multiplications, etc.

SJC_Hacker · on March 18, 2024

Yeah with shifts you can guarantee no overflow, but you have to decide under what circumstances is avoiding overflow/clipping worth the loss of precision.

Fixed point typically requires alot more programming, but sometimes its worth it if you know what the data ranges are.

emacs28 · on March 15, 2024

> you have to build hardware that matches the dimensions of the algorithm

Yes the benefits are realized in custom hardware designs as opposed to software, however, the hardware architectures work for multiplying matrices of arbitrary dimensions by splitting up larger matrices into smaller tiles, then summing up the tile products to form the final larger matrix products (i.e. GEMM)

emacs28 · on March 15, 2024

Thanks, good summary. Regarding numerical stability, the application is for fixed-point arithmetic, and therefore numerical stability is not an issue (the result is identical compared to using the conventional inner-product)

emacs28 · on March 15, 2024

IMHO, for fixed-point MM accelerators, there is no catch, I think it's an overlooked algorithm. It's based on an algorithm by Winograd who coincidentally also proposed another unrelated algorithm that later became very popular for CNN acceleration which would take some visibility away from this other algorithm by Winograd... But that is speculative

yorwba · on March 15, 2024

On the other hand, if you tried it with floating point, you'd lose significant digits. Since the approach is to sum (a[i] + b[i+1])(a[i+1] + b[i]) and subtract the sums of a[i]a[i+1] and b[i]b[i+1] in the end to get a[i]b[i] + a[i+1]b[i+1], you may be taking the difference of two large values to get a small value, losing precision.

on March 15, 2024

[dead]

3abiton · on March 15, 2024

On a tangent, go is so elegant.

OJFord · on March 15, 2024

LLM hype and this submission in particular keep making me think of a lecturer I had for Topics in Large Dimensional Data Processing, circa 2016: as I recall he was enthusiastically adamant that the most important thing, breakthroughs etc., in years/decades to come was going to be faster matrix operations. Anyway, I'm pretty sure I recognise FIP (not FFIP of course) from that course.

I wish I could remember his name, I believe he left academia after my year and went to work in industry, I'd just be curious to see what he's up to now. I'm not saying it was a particularly novel or prescient comment/attitude, we may not have had quite such ML hype but certainly 'big data' was all the rage at the time, it's just something that's stuck in my mind. One of those areas I always meant to study more, just realistically probably never had the mathematical chops for and certainly those I did have atrophied.

bee_rider · on March 15, 2024

Maybe I’m joking, but: our society is just a vehicle for economics at this point, our economy is built around science, our science has mostly been turned into observations about engineering, some time ago we changed all of engineering into differential equations, and differential equations can be solved by discretizing them and doing linear algebra, and most of linear algebra can be done with matrix multiplications (triangular solves and orthonormalizations if you are fancy). All you need is matmul.

OJFord · on March 15, 2024

> our science has mostly been turned into observations about engineering

You may be joking but that in particular seems pretty astute.

Superficially it seems accurate, and reasonably ascribable to economic forces, fewer concentrations of capital in people (aristocrats) spending it on a hobby interest or academic pursuit of their own - today's equivalents mostly prefer philanthropy (Musk is, I suppose, for whatever else you might think of him, a notable exception - preferring to explore space, AI, etc. not really it seems for personal monetary gain). But I wonder if that is fair, to modern scientists, or is it just that 'all the low-hanging stuff's been done'?

SJC_Hacker · on March 18, 2024

For life sciences need grad students / postdocs to do the grunt work of pipetting, dissected, plating etc. And whatever the equivalent is in chemistry (titration/GC/mass transfer I guess)?

But those tools created by engineers are pretty darn important, and allow plenty of experiments/observations to be performed that were previously out of reach.

MrYellowP · on March 17, 2024

So, what you're saying is ...

... that the Matrix creates the world around us.

Thanks.

gojomo · on March 16, 2024

Unless he was a guest lecturer, if the course was for credit, wouldn't his name appear on your official transcript?

OJFord · on March 16, 2024

I don't think so, this may be the case in your country of course. I may well have recorded it in my notes if I dig them out, but this was a fourth year course and they certainly degraded over the years.

emacs28 · on May 10, 2023

Personally I love my overpriced Samsung z fold, I don't use a laptop anymore (just a desktop), I can easily read double-column research articles wherever I am, it's great for drawing diagrams, and all of that without having to remember both your phone and a tablet everywhere you go.

emacs28 · on Feb 22, 2023

CCX stands for Core Complex

CCD stands for Core Complex Die (and neither terms refer to the IO die)

rektide · on Feb 22, 2023

Whoof, oops, thanks. I'd been using CCD and IOD as terms until this post, but had a "omgosh, I've been doing it wrong" panic & changed into what we have here. My mistake. Thank you for correcting us back!!

https://hn.algolia.com/?query=rektide%20ccd&type=comment

emacs28 · on Feb 17, 2023

This can be resolved by changing the cells' format from General to Text. This makes the cells display the text exactly as entered. Select the relevant cells -> right click on them -> Format Cells... -> Text -> Ok

emacs28 · on Feb 6, 2023

One good approach could be to base the architecture on the TPU v1 from [1]. There are also open-source accelerators you could get inspiration from, for example [2][3]. If you want to do less work/not hand code the RTL yourself then you could look into methods for automatically mapping OpenCL to an FPGA accelerator architecture (or a service like [3] provides pre-designed architectures for multiple FPGAs).

[1] https://arxiv.org/abs/1704.04760

[2] https://github.com/jofrfu/tinyTPU

[3] https://github.com/tensil-ai/tensil

hikingsimulator · on Feb 6, 2023

Thanks!