"thousands in the case of SIMD or GPUs" What does 'Single Instruction Multiple D...

dragontamer · on Oct 20, 2020

Fork join by and large doesn't use mutexes.

If you need to synchronize, wait for the next fork/join cycle instead of doing explicit mutexes. The "join" provides your synchronization point.

If you have a complicated set of highly-synchronous calculations, then you don't fork at all. You simply use the "Single Thread" to perform all those calculations (and therefore negate the need of cross-thread synchronization).

Fork-join has low-utilization, but its very easy to program. In some cases, fork-join remains efficient (ex: spawn a thread per pixel on the screen), because all pixels do NOT need to synchronize with each other (or call mutexes).

-----

If a sync point within the children are needed, you usually make due with a barrier instead of a mutex.

> What does 'Single Instruction Multiple Data' have to do with threads?

SIMD processors, such as GPUs, emulate a thread in their SIMD units. Its called a "CUDA Thread". Its not a true thread in the sense of CPU-threads, but the performance by-and-large scales as if it were real threads. (With exception of the "thread-divergence problem")

Ultimately, the fork-join model translates trivially to SIMD. Any practitioner of CUDA, OpenCL, ROCm, or ISPC can prove that to you easily.

jschwartzi · on Oct 20, 2020

Yeah in this model you wouldn't need a mutex because each "thread" is independent and operates either on independent data or uses constant-but-shared data. As soon as the result of one thread depends on the result of another thread you have to have some mechanism for synchronization.

I mean it's not really any different from writing a C/C++ program that avoids use of mutexes by having each thread operate on separate parts of the process address space. I'm still intrigued but it's not mind-blowing to me to fork a bunch of threads and join them when the function execution completes.

dragontamer · on Oct 20, 2020

There's nothing mindblowing about Edgar Dijkstra's "Go To Statement Considered Harmful". Which largely argued that you should organize your code into easily composable function calls. Kinda obvious in hindsight.

Its more about discipline than anything else. A recognition that fork-join is much easier than other methodologies (such as Async).

peheje · on Oct 20, 2020

Thanks for the explanation. The only SIMD programming I've seen is where the programmer would carefully call the cpu-brand based instructions and painstakingly manage the memory register making sure the numbers to be added, multiplied etc. is evenly divided then given to the SIMD ALUs.

Sounds like what you are saying that fork join model translates easely by the compiler to these SIMD instructions?

Some compilers can also vectorize plain loops, but you would advocate for fork join?

dragontamer · on Oct 20, 2020

> Sounds like what you are saying that fork join model translates easely by the compiler to these SIMD instructions?

Why do you think CUDA has become so popular recently? That's exactly what CUDA, OpenCL, and ISPC does.

> Some compilers can also vectorize plain loops, but you would advocate for fork join?

CUDA style / OpenCL style fork-join is clearly easier than reading compiler output, trying to debug why your loop failed to vectorize. That's the thing about auto-vectorizers, you end up having to grok through tons of compiler output, or check out the assembly, to make sure it works.

ALL fork-join style CUDA / OpenCL code automagically compiles into SIMD instructions. Ditto with ISPC. Heck, GPU programmers have been doing this since DirectX 7 HLSL / OpenGL decades ago.

There's no "failed to vectorize". There's no looking up SIMD-instructions or registers or intrinsics. (Well... GPU-assembly is allowed but not necessary). It just works.

-------

If you've never tried it, really try one of those languages. CUDA is for NVidia GPUs. OpenCL for AMD. ISPC for Intel CPUs (instead of SIMD intrinsics, ISPC was developed for an OpenCL-like fork-join SIMD programming environment).

And of course, Julia and Python have some CUDA plugins.

peheje · on Oct 20, 2020

Must admit never tried it. Thanks for the insights I'll have a go at some point.

dragontamer · on Oct 20, 2020

If you got an OpenMP 4.5 or later compiler (and GCC / CLang both support OpenMP), you can also use #pragma omp simd.

https://www.openmp.org/spec-html/5.0/openmpsu42.html

Its not as reliable as a dedicated language like OpenCL or ISPC. But this might be easier for you to play with rather than learning another language.

OpenMP is just #pragmas on top of your standard C, C++, or Fortran code. So any C / C++ / Fortran compiler can give this sort of thing a whirl rather easily.

---------

OpenMP always was a fork-join model #pragma add on to C / C++. They eventually realized that their fork-join model works for SIMD, and finally added SIMD explicitly to their specification.

leephillips · on Oct 20, 2020

And Fortran co-arrays, no?

TheRealKing · on Oct 21, 2020

Fortran Coarray is far beyond simple Fork-Join. It enables one-sided remote memory access, something that is impossible with OpenMP or CUDA, as far as I am aware, and requires the highest levels of skill to do it right in MPI.

0xffff2 · on Oct 20, 2020

Do you mean it doesn't use explicit mutexes? I don't see any way this model would avoid using mutexes (or some mutex-like construct) under the hood, in which case I'm not sure I see the advantage.

The term "fork/join cycle" is intriguing and meaningless to me as a non-Julia user. What exactly is this cycle?

dragontamer · on Oct 20, 2020

Well, I only visit Julia now and then. I've been programming various parallel programs though for myself in a variety of languages, trying to grok parallelism better.

> The term "fork/join cycle" is intriguing and meaningless to me as a non-Julia user. What exactly is this cycle?

https://www.researchgate.net/profile/Alina_Kiessling/publica...

There are many forks-and-joins across a program, when you're doing the fork-and-join paradigm. Each time the threads need to communicate, you join, and then use the "master" thread to pass data to all of the different units.

-------------

For example, most video-game engines issue a fork to calculate the verticies of all objects in the video game (the fork turns into a GPU call). This is called the vertex-shader.

Once all the verticies are calculated, the GPU joins these threads together, and the main-program / game engine continues.

The next step is the geometry shaders: so the CPU forks (aka: spawn thousands of GPU threads), and joins on the results of the geometry shaders. (Tesselation may spawn more vertexes. Ex: you model a rope as a square, but then the geometry shader turns the square into a rope-shape at this stage)

Then the pixel shaders. For every pixel of your 1080 x 1920 screen, a GPU SIMD-thread is forked off, and each pixel's final color is calculated based on the results of vertex-shaders and geometry shaders before it.

Each of these cycles is a fork-join cycle. Thousands of threads spawning, thousands of threads joining, the CPU calculating some synchronization data together, and then spawning thousands of threads again.

(In practice, modern game engines are now async for speed reasons. But the general fork/join model is still kinda there if you squint)