Hacker Newsnew | past | comments | ask | show | jobs | submit | danlark1's commentslogin

Surprisingly I didn't put 2023, it was merged with another submission possibly with the help of mods


FYI: It's possible for that to be edited by others.


Disclaimer: the author of the comment is the founder and CTO of ClickHouse


And all their comments are shilling Clickhouse either directly or via a project built on top of it, without disclosure.


Considering that it's an open source tool, I don't know if it's that bad to be shilling for the commons, basically.


I've edited my profile to provide a link to GitHub.


Disclosure, not disclaimer.

They want to own the claims made.


Yes, sorry, it should be "disclosure"


Great work!

Popular narrative that NEON does not have a move mask alternative. Some time ago I published an article to simulate popular bit packing use cases with NEON with 1-2 instructions. This does not include unpacking cases but can be great for real world applications like compare+find, compare+iterate, compare+test.

https://community.arm.com/arm-community-blogs/b/servers-and-...


I never understood why they couldn't just include a movmskb instruction to begin with. It's massively useful for integer tasks, not expensive to implement as far as I know, and the vshrn+mov trick often requires an extra instruction either in front or after (in addition to just being, well, pretty obscure).

NEON in general is a bit sad, really; it's built around the idea of being implementable with a 64-bit ALU, and it shows. And SVE support is pretty much non-existent on the client.


Not having (1 << k) - 1 as a single instruction sucks when it HAS to be in a hot loop, but you can usually hoist this to the loop prolougue: my stuff uses dummy inline assembly hints to force compilers to do this `asm volatile("" : "+w"(m));`.

I personally think calibrating ARM's ISA on smaller VL was a good choice: you get much better IPC. You also have an almost-complete absence of support for 8-bit elements with x86 ISAs, so elements per instruction is tied. And NEON kind-of-ish makes up for its small VL with multi-register TBL/TBX and LDP/STP.

Also: AVX512 is just as non-existent on clients as SVE2; although not really relevant for the server-side targets I'm optimising for (mostly OLAP).


8 bit is not absent in x86 SIMD, it is a slightly less covered than 32 & 16 bit, but you can fully implement all the common 8 bit ops and most are 1 instruction(with AVX2). There are even various horizontal ops on 8 bit values(avg, dot etc).

Also AVX512 is way more common than SVE2, all Zen4 & Zen5 support it.


More specifically, basically the only absent 8-bit ops that have 32-bit equivalents in AVX2 are shifts and multiplies. Shifts are quite annoying (though, with a uniform shift they can be emulated on AVX-512 via GFNI abuse in 1 instr), multiplies are rather rare (though note that there is vpmaddubsw for an 8-bit→16-bit multiply-add). There's even a case of the opposite - saturating add/sub exist for 8-bit and 16-bit ints, but not wider.


GFNI is distinct from AVX-512; it was merely introduced in cores that also had AVX-512.


Does _any_ SIMD instruction set have (1 << k) - 1 as a single instruction?


Not sure in which context this is used, but you can do -1 << k in most ISAs but that still requires a bit-not. But if you want to use the value in a bitwise instruction, then there are often variants that can invert the input operand.

E.g. in RVV instead of vand.vv(a, vadd.vi(vsll.vv(1,k),-1)) you could do vandn.vv(vsll.vv(-1,k))

AVX-512 can do this with any binary or ternary bitwise logic function via vpternlog


I don't know either; I was talking about the lack of PMOVMSKB in NEON and then the comment I replied to started talking about (1 << k) - 1 not being a single instruction sucks. Which I don't think it is in any non-NEON set either.


Nice article! I personally find the ARM ISA far more cohesive than x86's: far less historical quirks. I also really appreciate the ubiquity of support for 8-bit elements in ARM and the absence of SMT (make performance much more predictable).


I am the author of the optimization of partial sorting and selection in Clickhouse. It uses Floyd-Rivest algorithm and we tried a lot of different things back at the time, read [1]

Overall clickhouse reads blocks of fixed sizes (64k) and finds top elements and then does top of the top until it converges.

[1] https://danlark.org/2020/11/11/miniselect-practical-and-gene...


Hi, I thought that this is not the most interesting part of all the benchmarks, for example, all we need to test that with the quotient and the remainder: dividend = quotient * divisor + remainder, remainder < divisor and multiplication does not overflow which is free of division operations.

Yet, I added several tests like dividend < divisor, close to zero remainders, a lot of random stuff just to make sure each time I add a new approach, it is correct.


Nicely done.

FWIW, I dimly recall a numeric library, maybe a float to string, which tested everything for verification. Took a few days to run.

Then maybe use the spot checks to test for regressions. Weird compiler, toolchain, processor combos. That sort of thing.


> FWIW, I dimly recall a numeric library, maybe a float to string, which tested everything for verification. Took a few days to run.

You can definitely test against all ~2^31 IEEE 754 binary32 values to make sure that float-to-decimal conversion is correct; that's what I've done with the Rust standard library (took 2 hours per test). I believe testing all ~2^63 binary64 values are also feasible by now, but only with dedicated hardwares. For that reason I believe the library had only tested with binary32.


What kind of hardware can crank through the 2^64 space in reasonable time?


It’s actually surprisingly feasible. 1,000 cores running at 3GHz for a week do ~2^64 cycles.


Don't you mean 10,000 cores?


Whoops, yes! Too late to edit.


Hi, I believe they can be done for 32 bit platforms also with the multiword division in Knuth https://skanthak.homepage.t-online.de/division.html what I've chosen for fallback if not x86_64 platform.

I will try to implement the same optimizations in Rust in the upcoming weeks

UPD And we opened an issue :) https://github.com/rust-lang/compiler-builtins/issues/368


Many years ago I implemented a set of multiple precision integer routines in fortran. For division, I slavishly copied Knuth's routine, using all the sample divisions in his book to test, and it worked fine.


Hi everyone, the author is here. Yes, I believe the title should be changed to `Optimizing 128-bit Division`

Yet, I was not expecting it to be here. Overall, I put some knowledge hidden in Hacker's Delight book, Knuth, GMP and GNU in the article with my knowledge of low level optimizations. In the end it turned out to be a good thing to write and to submit into LLVM


You might want to also check that 128-bit multiplies are implemented as only two 64-bit multiplies and a couple of additions (assuming no overflow; three multiplies if you need to get the right answer mod 2^64).


There is an `if` statement in a `for` loop that looks like it could be eliminated. Abbreviating:

  if (dd.all >= dr.all) {
    dd.all -= dr.all;
    quo.s.low |= 1;
  }
could be

  bool c = dd.all >= dr.all;
  dd.all -= (-c & dr.all);
  quo.s.low |= c;

Clang might implement this with `cmov` instructions, coded either way.


Great article, I read to the end.

I faced a similar problem implemented 64/64 division on 32-bit x86. I cribbed some code off the internet and came up with this:

https://github.com/titzer/virgil/blob/2c674dbe3512da23c4a644...

This is the code that my compiler generates for a 64/64 divide when it doesn't know that the divisor is max 32 bits. It works by using the x86 64/32 div instruction. The same trick would work on x86-64 by using the 128/64 divisions.

I am not sure how good the machine code you get out of your C++ is in the end, but it's hard to beat this handwritten assembly. Of course, you can do better on x86-64 because there are more registers and you don't have to spill on to the stack.


What's the license? Hopefully, Boost or Boost compatible?



Thank you. We've standardized on the Boost license for D, as it is the least restrictive, and well-accepted in the C++ community.

Is it possible you can make a Boost licensed version of it so we can add it to D?

https://www.boost.org/LICENSE_1_0.txt


That's a difficult question. As I am working at Google I need to consult each open source I want to publish on my behalf. This does not involve contributing to the list of the approved projects and telling about these contributions.

If you want to workaround this for now, I suggest looking into libdivide (https://github.com/ridiculousfish/libdivide), it is published with the boost license and the library contains all the needed artifacts I described in the article (unfortunately, not combined).


Thanks for the pointers. I suspect Google would be ok with Boost, since they are a C++ house and Boost is the major library. A big reason D picked it was because Boost is corporate lawyer approved.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: