It's pretty common for runtime libraries to optimize low-level routines like memcpy, math, etc. with multiple different paths chosen on the basis of CPU capability bits. It's not the whole code that's twice the size; it's small functions which are implemented 2 or 3 or 5 times depending on what features are available.