This seems really neat. Haven't really dug in yet, but a couple things I'd be curious to see if it facilitated would be facilitating GPU-accelerated deep learning work (either training or inference) on non-nvidia GPUs, which seem poorly supported natively with most current frameworks, via the OpenCL backend, and also facilitating ease of deployment for inference on environments where having large, complex dependencies is a pain, like Lambda.
The Raspberry Pi results are really cool to see. Sure, for a highly-tuned library like cuDNN the existing operators might be close to as fast as you can get, but it's unlikely that every platform where it may be interesting to deploy a deep learning model can get the same amount of attention. I hope that these results mean that in many cases we can get highly optimized implementations without exhaustive manual effort.
Up until recently at work I had to work with Keras on CPU-only and the easiest way to get things working was TensorFlow backend. Earlier this year the 1.3 update gave me roughly 15% performance increase just from
pip install tensorflow --upgrade
Fortunately I got a new machine with a cuDNN capable GPU, but I'll be testing the NNVM for Keras backend when its working. We might get to squeeze some epochs out of my old machine, now in the hands of a secretary who won't really notice we're training a net while he replies some emails.
Huh, so it's a fork of Halide that replaces the frontend with a set of adapters for various neural net frameworks. Halide is super cool and really ought to be better known. I wonder if they tried to collaborate with the Halide people at all or if they're just doing their own thing?
Calling it a fork isn't really fair. It just reuses bits and pieces from Halide where it made sense for them to do so, and it's appropriately credited. We're happy for people to build on our stuff.
part of TVM https://github.com/dmlc/tvm is built with primitives in Halide. Halide is indeed super cool and TVM benefit a lot from its experience. While Halide optimizes CPU and image processing workload well. TVM also focus specifically on deep learning and offers more optimizations on multi-core, GPU and other hardwares. This requires rethink of quite a lot of designs.
NNVM compiler is built on top of TVM, with additional graph level optimizers, and the two forms an end to end pipeline
This is going to be a really useful tool for us-- production ML systems are incredibly difficult to build when you need to support a bunch of different frameworks. ONNX is nice as a standardized runtime layer... but being able to recompile a model into another final target would be amazing.
How feasible is CPU only deep learning? I keep hearing about outrageous training times for GPUs on the order of weeks with something like 4-8 GPUs, is anyone actually using CPUs instead?
> I keep hearing about outrageous training times for GPUs on the order of weeks with something like 4-8 GPUs
This is for training a competitive model from scratch on a fundamental problem like image recognition. If you don't care about the last 1-2%, it's possible to train a useful model in a few hours (but still on a GPU).
> is anyone actually using CPUs instead
There are useful things you can do without a GPU. For example, "Transfer learning", which can be as simple as chopping off the last layer of someone else's GPU-trained model and substituting your own, can be done on a CPU in reasonable time. This is because you typically need less data and because far fewer parameters need to be fit.
I wouldn't recommend for training. CPU's work just fine for low intensity inference. Especially online/interactive inference, these are situations where you pass in one example at a time. GPU excel when you pass data in continuous predictable batches.
Well... I feel like people are only focusing on training. CPUs for inference are sometimes helpful in online inference settings where you don't ever really "batch" requests. In this case, they can be cheap.
My (limited) understanding is that 4-8 GPUs will be around 2 orders of magnitude faster than a CPU at training, and that it's not very realistic to use a CPU for training anything but the smallest models.