But you are also right that it is not a huge difference :) In no-optimization mode, it keeps the imul in for Multiply. the result is 1.2477 vs. 1.2668, relative to Noop.
The site is really cool, thanks for pointing me to it!
Interestingly, gcc-8 does some really weird stuff. Its version of Multiply is 1.4 times slower than its version of Shift, but according to the disassembly, the Shift implementation is actually using an imul, whereas the Multiply implementation is doing a lea/shl.
The site is really cool, thanks for pointing me to it!