Anatomy of a Reducer

drcode · on May 15, 2012

Someone should really write a blog post on what this all means for a novice. My impression is (and I might be wrong, since I haven't had the time to study the new reducer library yet) that the point of this library is the following:

Many algorithms are tree-like in nature. Suppose, for instance, you have a path-finding algorithm that looks at a map to find the best path for something. In that case, you're going to need to search different path options iteratively to find the best path, leading to a nested algorithm.

Now, this type of algorithm is tricky to parallelize: Sometimes the root of your "search tree" might be small (let's say 3 items) other times it might be large (1000 items). If you know your computer has 8 cores, it's very hard to meter out an equal amount of work to each core- Sometimes several cores need to split the work of a single branch. Other times, a single core might need to handle multiple branches. Also, it may be hard to know ahead of time how much work each branch will require, making the "parcelling out" of work even more difficult.

However, the work for each branch can be expressed as a collection of smaller "jobs", with a function applied to each. And each item in this collection would then also be a collection with functions applied to it (representing each sub-branch.) If you express a nested algorithm in this way, it can be viewed of as a nested collection with differing arbitrary functions mapped, filtered, etc across different parts of it.

If you use this new Reducer library to implement this type of nested operation, Clojure can automagically use the Java Fork/Join library to parcel out equal amounts of work to each core. In effect, you can think of the reducer library as letting you "flatten" these nested collections (without the memory & CPU cost this would entail) to figure out how to split the work out evenly and in the most efficient manner possible, hiding all the ugliness required for this from the programmer.

Please, if you know this library better than I, let me know where I'm going wrong.

sax · on May 15, 2012

If you haven't read his introductory post check it out here: http://clojure.com/blog/2012/05/08/reducers-a-library-and-mo...

My understanding is this: the genesis for fold (as opposed to foldl) is from the linked talk: http://vimeo.com/6624203.

If fold (a concurrent reduce) is used instead of foldl (a sequential reduce), then the mapping function can be pushed down and applied concurrently to the various elements. A generic mechanism can then be used for dispatching concurrent operations. These functional operation are composed of first mapping (and/or) filtering and then reducing.

[EDIT: typo and re-phrase]

lmkg · on May 15, 2012

I'm not an expert, but here's my take on it: The point of this library is to decouple things that are usually coupled but don't need to be, and provide a uniform interface for combining them back together. The hardest part of this is looking at things from a slightly different perspective and getting used to the new division of responsibilities, but when you get used to it the overall picture is actually simpler.

The idea is that you have four types of things: you have items, you have collections that hold items, you have ways of operating on those collections, and you have operations you want to perform on collections. The library deals with separating the latter two, since separating elements from a container is basically a solved problem.

A "reducer" is a view into a collection, which is responsible for knowing how to apply collection operations (map, filter, reduce) to its elements.

The collection operations (usually map, filter, and reduce) are separate, called mappings and filterings in this article. These are real entities, which are composable. I think of them similar to defining a database query, but not executing it.

Because reducers and operations are decoupled, you can always take an arbitrary "query" and throw it at at arbitrary reducer, and expect the collection to be able to handle the query, and always return the same result regardless of collection type.

Fork/join is an example use-case: you can create a collection type which is just a thin wrapper over fork/join[1]. You can then wrap this collection type as a view into your data, and run collection queries against it. This allows you to use fork/join as a drop-in replacement for naive data structures. It also keeps tinkering with fork/join separate from both the underlying data structure, and the work being processed.

The other primary use case seems to allow defining, storing, and composing queries without running them. This ties in heavily to lazy evaluation. The composability is a key feature because it allows you to pass around partial queries, rather than partial results, and saves greatly on time and space by not creating intermediate data structures (cf. stream fusion in Haskell).

[1] Fork/join would have to be a "foldable" rather than "reducable" collection, which as far as I can tell just means it's a little more lax with how you combine results back from a reduce. These relaxations allow for parallelization.

pron · on May 15, 2012

Fork-join is not a collection (or any data structure for that matter). It is a parallel computation process meant to make good use of all available processors. You can use fork-join to process a collection like an array or a tree in parallel, and you can use it to factor a single integer in parallel.

pron · on May 15, 2012

If I understand correctly, does this mean that a map operation (that is not reduced further; i.e. we want a collection as the output) could be carried out in parallel efficiently using reducers only if there is an efficient (O(1)) way to concatenate the collections produced at each combining step (in the "internal nodes" of the fork-join computation tree)?

This would constrain how collections are implemented to make good use of this technique.

weavejester · on May 15, 2012

As I understand it, all of Clojure's internal data structures (save for the list, which can be substituted by a lazy seq) can be combined efficiently already.

pron · on May 15, 2012

True. I'm just wondering if this is indeed a requirement.