danlark1's comments

danlark1 · 2025-12-20T06:34:49 1766212489

Surprisingly I didn't put 2023, it was merged with another submission possibly with the help of mods

dredmorbius · 2025-12-21T16:36:16 1766334976

FYI: It's possible for that to be edited by others.

danlark1 · 2025-10-27T11:18:41 1761563921

Disclaimer: the author of the comment is the founder and CTO of ClickHouse

OJFord · 2025-10-27T14:21:20 1761574880

And all their comments are shilling Clickhouse either directly or via a project built on top of it, without disclosure.

tuukkah · 2025-10-27T14:44:12 1761576252

Considering that it's an open source tool, I don't know if it's that bad to be shilling for the commons, basically.

zX41ZdbW · 2025-11-04T21:10:02 1762290602

I've edited my profile to provide a link to GitHub.

edoceo · 2025-10-27T14:05:49 1761573949

Disclosure, not disclaimer.

They want to own the claims made.

danlark1 · 2025-10-27T14:34:16 1761575656

Yes, sorry, it should be "disclosure"

danlark1 · 2025-10-05T13:58:53 1759672733

Great work!

Popular narrative that NEON does not have a move mask alternative. Some time ago I published an article to simulate popular bit packing use cases with NEON with 1-2 instructions. This does not include unpacking cases but can be great for real world applications like compare+find, compare+iterate, compare+test.

https://community.arm.com/arm-community-blogs/b/servers-and-...

Sesse__ · 2025-10-05T15:56:51 1759679811

I never understood why they couldn't just include a movmskb instruction to begin with. It's massively useful for integer tasks, not expensive to implement as far as I know, and the vshrn+mov trick often requires an extra instruction either in front or after (in addition to just being, well, pretty obscure).

NEON in general is a bit sad, really; it's built around the idea of being implementable with a 64-bit ALU, and it shows. And SVE support is pretty much non-existent on the client.

ashtonsix · 2025-10-05T16:15:02 1759680902

Not having (1 << k) - 1 as a single instruction sucks when it HAS to be in a hot loop, but you can usually hoist this to the loop prolougue: my stuff uses dummy inline assembly hints to force compilers to do this `asm volatile("" : "+w"(m));`.

I personally think calibrating ARM's ISA on smaller VL was a good choice: you get much better IPC. You also have an almost-complete absence of support for 8-bit elements with x86 ISAs, so elements per instruction is tied. And NEON kind-of-ish makes up for its small VL with multi-register TBL/TBX and LDP/STP.

Also: AVX512 is just as non-existent on clients as SVE2; although not really relevant for the server-side targets I'm optimising for (mostly OLAP).

TinkersW · 2025-10-05T17:08:13 1759684093

8 bit is not absent in x86 SIMD, it is a slightly less covered than 32 & 16 bit, but you can fully implement all the common 8 bit ops and most are 1 instruction(with AVX2). There are even various horizontal ops on 8 bit values(avg, dot etc).

Also AVX512 is way more common than SVE2, all Zen4 & Zen5 support it.

dzaima · 2025-10-05T17:59:17 1759687157

More specifically, basically the only absent 8-bit ops that have 32-bit equivalents in AVX2 are shifts and multiplies. Shifts are quite annoying (though, with a uniform shift they can be emulated on AVX-512 via GFNI abuse in 1 instr), multiplies are rather rare (though note that there is vpmaddubsw for an 8-bit→16-bit multiply-add). There's even a case of the opposite - saturating add/sub exist for 8-bit and 16-bit ints, but not wider.

namibj · 2025-10-05T22:38:33 1759703913

GFNI is distinct from AVX-512; it was merely introduced in cores that also had AVX-512.

Sesse__ · 2025-10-05T18:55:48 1759690548

Does _any_ SIMD instruction set have (1 << k) - 1 as a single instruction?

camel-cdr · 2025-10-05T19:35:34 1759692934

Not sure in which context this is used, but you can do -1 << k in most ISAs but that still requires a bit-not. But if you want to use the value in a bitwise instruction, then there are often variants that can invert the input operand.

E.g. in RVV instead of vand.vv(a, vadd.vi(vsll.vv(1,k),-1)) you could do vandn.vv(vsll.vv(-1,k))

AVX-512 can do this with any binary or ternary bitwise logic function via vpternlog

Sesse__ · 2025-10-05T21:18:21 1759699101

I don't know either; I was talking about the lack of PMOVMSKB in NEON and then the comment I replied to started talking about (1 << k) - 1 not being a single instruction sucks. Which I don't think it is in any non-NEON set either.

ashtonsix · 2025-10-05T14:50:05 1759675805

Nice article! I personally find the ARM ISA far more cohesive than x86's: far less historical quirks. I also really appreciate the ubiquity of support for 8-bit elements in ARM and the absence of SMT (make performance much more predictable).

danlark1 · 2025-04-22T19:10:06 1745349006

I am the author of the optimization of partial sorting and selection in Clickhouse. It uses Floyd-Rivest algorithm and we tried a lot of different things back at the time, read [1]

Overall clickhouse reads blocks of fixed sizes (64k) and finds top elements and then does top of the top until it converges.

[1] https://danlark.org/2020/11/11/miniselect-practical-and-gene...

danlark1 · on July 19, 2020

Hi, I thought that this is not the most interesting part of all the benchmarks, for example, all we need to test that with the quotient and the remainder: dividend = quotient * divisor + remainder, remainder < divisor and multiplication does not overflow which is free of division operations.

Yet, I added several tests like dividend < divisor, close to zero remainders, a lot of random stuff just to make sure each time I add a new approach, it is correct.

specialist · on July 19, 2020

Nicely done.

FWIW, I dimly recall a numeric library, maybe a float to string, which tested everything for verification. Took a few days to run.

Then maybe use the spot checks to test for regressions. Weird compiler, toolchain, processor combos. That sort of thing.

lifthrasiir · on July 19, 2020

> FWIW, I dimly recall a numeric library, maybe a float to string, which tested everything for verification. Took a few days to run.

You can definitely test against all ~2^31 IEEE 754 binary32 values to make sure that float-to-decimal conversion is correct; that's what I've done with the Rust standard library (took 2 hours per test). I believe testing all ~2^63 binary64 values are also feasible by now, but only with dedicated hardwares. For that reason I believe the library had only tested with binary32.

saagarjha · on July 19, 2020

What kind of hardware can crank through the 2^64 space in reasonable time?

grandmczeb · on July 19, 2020

It’s actually surprisingly feasible. 1,000 cores running at 3GHz for a week do ~2^64 cycles.

bloak · on July 19, 2020

Don't you mean 10,000 cores?

grandmczeb · on July 19, 2020

Whoops, yes! Too late to edit.

danlark1 · on July 19, 2020

Hi, I believe they can be done for 32 bit platforms also with the multiword division in Knuth https://skanthak.homepage.t-online.de/division.html what I've chosen for fallback if not x86_64 platform.

I will try to implement the same optimizations in Rust in the upcoming weeks

UPD And we opened an issue :) https://github.com/rust-lang/compiler-builtins/issues/368

mjcohen · on July 20, 2020

Many years ago I implemented a set of multiple precision integer routines in fortran. For division, I slavishly copied Knuth's routine, using all the sample divisions in his book to test, and it worked fine.

danlark1 · on July 19, 2020

Hi everyone, the author is here. Yes, I believe the title should be changed to `Optimizing 128-bit Division`

Yet, I was not expecting it to be here. Overall, I put some knowledge hidden in Hacker's Delight book, Knuth, GMP and GNU in the article with my knowledge of low level optimizations. In the end it turned out to be a good thing to write and to submit into LLVM

drfuchs · on July 19, 2020

You might want to also check that 128-bit multiplies are implemented as only two 64-bit multiplies and a couple of additions (assuming no overflow; three multiplies if you need to get the right answer mod 2^64).

ncmncm · on July 19, 2020

There is an `if` statement in a `for` loop that looks like it could be eliminated. Abbreviating:

  if (dd.all >= dr.all) {
    dd.all -= dr.all;
    quo.s.low |= 1;
  }

could be

  bool c = dd.all >= dr.all;
  dd.all -= (-c & dr.all);
  quo.s.low |= c;

Clang might implement this with `cmov` instructions, coded either way.

titzer · on July 19, 2020

Great article, I read to the end.

I faced a similar problem implemented 64/64 division on 32-bit x86. I cribbed some code off the internet and came up with this:

https://github.com/titzer/virgil/blob/2c674dbe3512da23c4a644...

This is the code that my compiler generates for a 64/64 divide when it doesn't know that the divisor is max 32 bits. It works by using the x86 64/32 div instruction. The same trick would work on x86-64 by using the 128/64 divisions.

I am not sure how good the machine code you get out of your C++ is in the end, but it's hard to beat this handwritten assembly. Of course, you can do better on x86-64 because there are more registers and you don't have to spill on to the stack.

WalterBright · on July 19, 2020

What's the license? Hopefully, Boost or Boost compatible?

danlark1 · on July 19, 2020

For compiler-rt it is Apache 2.0 license

https://github.com/llvm/llvm-project/blob/master/compiler-rt...

WalterBright · on July 19, 2020

Thank you. We've standardized on the Boost license for D, as it is the least restrictive, and well-accepted in the C++ community.

Is it possible you can make a Boost licensed version of it so we can add it to D?

https://www.boost.org/LICENSE_1_0.txt

danlark1 · on July 19, 2020

That's a difficult question. As I am working at Google I need to consult each open source I want to publish on my behalf. This does not involve contributing to the list of the approved projects and telling about these contributions.

If you want to workaround this for now, I suggest looking into libdivide (https://github.com/ridiculousfish/libdivide), it is published with the boost license and the library contains all the needed artifacts I described in the article (unfortunately, not combined).

WalterBright · on July 20, 2020

Thanks for the pointers. I suspect Google would be ok with Boost, since they are a C++ house and Boost is the major library. A big reason D picked it was because Boost is corporate lawyer approved.