This makes sense in theory, but is hard to get working in practice. We tried usi...

This makes sense in theory, but is hard to get working in practice.

We tried using nvjpeg to do JPEG decoding on GPU as a additional baseline, but using it as a drop-in replacement to a standard training pipeline gives huge slowdowns for a few reasons:

(1) Batching: nvjpeg isn't batched; you need to decode one at a time in a loop. This is slow but could in principle be improved with a better GPU decoder.

(2) Concurrent data loading / model execution: In a standard training pipeline, the CPU is loading and augmenting data on CPU for the next batch in parallel with the model running forward / backward on the current batch. Using the GPU for decoding blocks it from running the model concurrently. If you were careful I think you could probably find a way to interleave JPEG decoding and model execution on the GPU, but it's not straightforward. Just naively swapping out to use nvjpeg in a standard PyTorch training pipeline gives very bad performance.

(3) Data augmentation: If you do DCT -> RGB decoding on the GPU, then you have to think about how and where to do data augmentation. You can augment in DCT either on CPU or on GPU; however DCT augmentation tends to be more expensive than RGB augmentation (especially for resize operations), so if you are already going through the trouble of decoding to RGB then it's probably much cheaper to augment in RGB. If you augment in RGB on GPU, then you are blocking parallel model execution for both JPEG decoding and augmentation, and problem (2) gets even worse. If you do RGB augmentation on CPU, you end up with and extra GPU -> CPU -> GPU round trip on every model iteration which again reduces performance.