Ispc: A SPMD Compiler for High-Performance CPU Programming

gonewest · on Jan 10, 2018

Used in the implementation of a production raytracer at Dreamworks Animation:

http://tabellion.org/et/paper17/index.html

meuk · on Jan 10, 2018

I think SIMD programming is a simple way to improve the performance of applications that use the CPU. The real solution for applications that are computation bound and not inherently sequential is to perform the computations on an architecture that allows massive parallelism (a manycore architecure like Xeon Phi, a GPU, or even an FPGA).

rurban · on Jan 10, 2018

Sure, but you still need to the software to manage this massive parallelism. It cannot just be OpenMP or OpenCL with C++. It needs to be a better and faster system, such as ispc or pony. ispc is faster, pony is safer, both are better and faster than C++.

dragontamer · on Jan 10, 2018

> both are better and faster than C++

"Better" is a matter of opinion, so I'll stay away from that claim.

But if we include AMD's HCC or Microsoft's C++ AMP projects (which turn C++ code and directly run it on the GPU), then I think you are mistaken about "faster".

http://rocm-documentation.readthedocs.io/en/latest/ROCm_API_...

https://msdn.microsoft.com/en-us/library/hh265136.aspx

Full GPU acceleration from C++ code. The SIMT model is quite powerful for allowing massive parallelism. And the "tiling" feature indicates that C++ is powerful enough to represent the various memory structures inside of the GPU. (LDS vs Global memory)

------------

Transparency of the memory model is incredibly important. GPUs only get their massive TFlop numbers if they are calculating from their LDS cache / Shared Memory locations. Even the biggest, beefiest GPUs with HBM2 only have ~480GB/s of main-memory performance, which is insufficient to reach 10TFlops. (Remember: a single float is 4-bytes. So 480GB/s bandwidth only gives 120GFlops at best)

Careful programming of memory, and transparency of the "LDS" cache and awareness of the limited register space of the GPU is needed to even

MaxBarraclough · on Jan 10, 2018

At a glance, ISPC looks a lot like OpenCL. Does it outperform Intel's OpenCL-on-CPU engine? If so, why?

mattpharr · on Jan 11, 2018

Original ispc author here.

I have no idea how performance compares to OpenCL on CPUs today, but it was in the same ballpark a few years ago.

The big difference is that OpenCL imposes a device model which is (IMHO) ridiculous if you're running everything on the CPU. With ispc, you have:

* Ahead-of-time compilation to binary code (no driver compiler in the way, so you can look at the ASM and know that's what will run.)

* Straightforward interop with C/C++ code: it compiles to the C ABI, so going from whatever other language to ispc code is just a function call. (Similarly, ispc can call out to C code.)

* Straightforward interop with application data structures: you can (and should!) pass pointers back and forth between C/C++ code and ispc code, do computation using the same data structures, etc.

All three of those are much uglier / more painful with the device model.

MaxBarraclough · on Jan 11, 2018

I'll play OpenCL's advocate:

> a device model which is (IMHO) ridiculous if you're running everything on the CPU

How so? It's rather GPU-flavoured, sure, but is this a problem? My understanding is that it all maps down to CPUs just fine... even if no-one's really using OpenCL purely for parallel CPU work.

> Ahead-of-time compilation to binary code

OpenCL offers this too - `clGetProgramInfo` lets you access the compiled binary, and `clCreateProgramWithBinary` lets you make use of that binary.

> no driver compiler in the way, so you can look at the ASM and know that's what will run

Intel's OpenCL development tooling is really pretty good - it's not hard to inspect the assembly. Same goes for AMD's tooling.

> going from whatever other language to ispc code is just a function call.

Neat. OpenCL can't do either, as everything it does has to work sensibly with GPUs.

> ispc can call out to C code

Same again.

> Straightforward interop with application data structures

Fair point. I don't know if/how OpenCL handles the question of struct layouts, or memory compatibility more generally.

I've passed structs between CPU and GPU with OpenCL, and it worked, but I think that's a hail Mary situation where really there's no assurance that the compilers' data layouts will match.

Even the definition of 'int' must be free to vary. I can't see how it couldn't be.

mattpharr · on Jan 12, 2018

IMHO, the problem with the device model is that it imposes a bunch of unnecessary overhead on the programmer for cases where memory is shared and you're running on the same processor.

If I just want to call a function, pass some parameters, have it do some work, and get a result, things like OpenCL require all sorts of annoying boilerplate just to pass parameter values, map buffers, copy results out, etc. Sure it's all straightforward to write, but it's friction, and it's annoying.

Regarding clGetProgramInfo: does that return actual native executable code or IR? (I assume it's free to do either but in practice returns the latter, and that there's the usual "final driver compiler" between that code and what runs on the hardware, but I don't know.) An issue with that is that you can't be sure of what will run on users' systems; you're at the mercy of the version of the driver they've got installed.

MaxBarraclough · on Jan 15, 2018

> annoying boilerplate just to pass parameter values, map buffers, copy results out, etc. Sure it's all straightforward to write, but it's friction

Agree. It's quite a lot of work to orchestrate even a simple kernel.

> I assume it's free to do either

Looks that way - it seems AMD's engine lets you configure it. There are bunch of 'non-native' representations:

* the OpenCL C source itself (which may end up getting stored in the ELF)

* LLVM IR

* AMDIL (based on LLVM IR but not identical)

* HSAIL (again, like LLVM IR but not identical)

* SPIR (yet again, except that later versions of this IR aren't directly based on LLVM)

http://openwall.info/wiki/john/development/AMD-IL

The poorly-documented "-fbin-exe" flag gets you the real native code.

http://developer.amd.com/wordpress/media/2013/07/AMD_Acceler...

I believe there's a way to get it to build for GPUs other than your own. Whether it's exposed through the API, I'm not sure, but I'm fairly sure it can be done with the dev tools.

(That took quite a bit of digging, which I suppose proves your point.)