Haha :) Maybe the shipped a better product, but the management said "No, it's not possible that this could run that fast. Something must be wrong.", so they put in some "waiting".
Because the CPU has 32k of cache in this case (ARM) so the memcpy was evicting the entire cache several times in the loop as a side effect of doing the work. The actual function of the loop had good cache locality as the data was 6 stack vars totalling about 8k.
So? Copying a megabyte is a really expensive thing to do inside a loop, even ignoring caches. (A full speed memcpy would take 40 microseconds, based on a memory bandwidth of 24 GB/s, which is a long time.)
I think it was covered by "naive memory management" and "shitty outsourcing". I'm paid to fix their stuff.