> I'd bet that the approach used here was tested and found to be faster on Haswell
I'd bet it’s an error.
> if you noticed that the page you linked is Haswell specific
I did. Was disappointed though, I expected to find something newer than Haswell from 2013, like Zen 2 or Skylake. When doing micro-optimizations like that, the exact micro-architecture matters.
> I expected to find something newer than Haswell from 2013, like Zen 2 or Skylake.
I'm sure optimizations for more recent architectures would be appreciated, and Daniel is wonderfully accepting of patches. Be careful though, or you might inadvertently end up as the maintainer of the whole project!
When I have free time, I’m generally more willing to contribute to my own open source projects no one cares about. Like this one: https://github.com/Const-me/Vrmac BTW did substantial amount of SIMD stuff there, for both 3D and 2D parts.
That “function” compiles into a single CPU instruction. The OP is perfectly aware of that, that’s why really_inline is there.
> on beating popcnt using AVX2 instructions
It’s easy to do with pshufb when you have many values on input. I have wrote about it years before that article, see there: https://github.com/Const-me/LookupTables#test-results
> I'd bet that the approach used here was tested and found to be faster on Haswell
I'd bet it’s an error.
> if you noticed that the page you linked is Haswell specific
I did. Was disappointed though, I expected to find something newer than Haswell from 2013, like Zen 2 or Skylake. When doing micro-optimizations like that, the exact micro-architecture matters.