There's a front-end operating cost to the bookkeeping instructions, and a complexity cost to all the variants of the same instructions. For short vectors, the cpu can use the same hardware it uses today in SIMD, just that the SIMD work is in microcode instead of asm. The cost of fetching the permutation vector arg out of L1 isn't terribly high compared to the cost of fetch/decode on the bookkeeping instructions. And the cost of supporting all those instruction variants could be replaced with more functionality the front-end.
There's a front-end operating cost to the bookkeeping instructions, and a complexity cost to all the variants of the same instructions. For short vectors, the cpu can use the same hardware it uses today in SIMD, just that the SIMD work is in microcode instead of asm. The cost of fetching the permutation vector arg out of L1 isn't terribly high compared to the cost of fetch/decode on the bookkeeping instructions. And the cost of supporting all those instruction variants could be replaced with more functionality the front-end.