> It is very unlikely that such a configuration will perform remotely as fast as...

> It is very unlikely that such a configuration will perform remotely as fast as a native SIMD implementation

Why? AFAICS, a "real" vector ISA is mostly a superset of a SIMD ISA. What do you think is missing?

Of course, if you want to eke out the absolute maximum performance then you need to be aware of the microarchitecture you're targeting, and a length-agnostic vector ISA like SVE or RVV don't buy you that much. But I don't see how that's worse than having to redo your code whenever the vendor introduces new HW with a new SIMD ISA.

I guess one could argue that an implementation of a vector ISA targeting, say, linear algebra, would be different than an implementation focusing on maximum short-vector performance. Say, by having a vector length >> execution width, and using tricks like vector pipelining etc. to get performance rather than focusing on minimizing short-vector latency. But, that's a question of what the implementation is optimized for rather than saying what the ISA is good for, no?