I believe the blocks end up stored as an array-of-structs where the structs have 8*8 = 64 elements. Doing the DCT in multiple blocks requires somehow transposing this into a struct-of-arrays-like format, maybe a gather of every 64th element (likely a waste of memory bandwidth) or some sort of unpckl/unpckh-like instructions. Either way, this may impose non-trivial overhead, and so the benefits of extra parallelism are hidden.
(And, of course, that's all assuming there's enough registers, and I don't remember enough about JPEG to make a guess.)
(And, of course, that's all assuming there's enough registers, and I don't remember enough about JPEG to make a guess.)