> Functional languages require infrastructure that
> inevitably adds overheads over what can theoretically
> be attained using assembler by hand. In particular,
> first-class lexical closures only work well with
> garbage collection because they allow values to be
> carried out of scope.
Inevitable is an overly broad characterization. For example, in this particular case, Manticore uses a combination of control-flow analysis and reflow analysis to inline higher-order functions in instances when we can determine that the free variables (if there are any) are obtainable in some other static way. This observation extends to the general allocation result; modern functional language compilers do quite a bit to get that under control (see my IFL paper on arity raising/unboxing/ripping apart datatypes, which provides a survey of the state of the art circa 2009) and, particularly given the additional register freedom when you're not wedded to the C calling convention, can do some real magic to avoid performing allocations, particularly in inner loops. GHC's is type-directed unboxing/type unrolling, IIRC, which in practice is on par (modulo separate compilation issues) with the implemented MLton work cited in this post.
So, it's less that poor performance is "inevitable" and more that it requires fairly insane amounts of work by the compiler developer. And, for example, I seem to remember that the Visual C++ team spent more person-hours on template compiler error message clarity in one product cycle than we will spend on the entire Manticore project in its lifetime. And yes, I know the "sufficiently smart compiler" joke.
A more fundamental complaint is that functional language implementations tend to provide _fragile_ performance. That is, without understanding some fairly gritty details of the compiler and runtime, it is difficult to understand why, for example, the second time you made a call to a non-inlined combinator providing a second function argument at the same type, your program became slower in MLton but not in GHC (YMMV; a representative but not authoritative example). A good way to see this phenomenon is to look at optimized benchmarks for OCaml or GHC (e.g. in the PL shootout) and ask, "is this idiomatic ML/Haskell code?" Many benchmarks for functional languages that perform close to C look suspiciously like the original C benchmark, with just a few functional sprinkles on top.
I don't know of a good answer to that performance problem without requiring users to become near-experts with the compiler, and that's not good practice.
I am not convinced that requiring users to become near-experts in their given compiler (and by extension their chosen platform) is a bad practice.
I think it is a fine balance when writing code to find a way to express algorithms such that they a) are understandable to other developers and b) maximize the opportunity for the compiler to optimize the code into machine instructions.
That does not mean that compiler writers can stop after creating an assembler, since its just turtles all the way up from there; but it does mean that compiler writers should be thinking about how to maximize the machine/developer interface.
Given this view, you certainly cannot extract language creators from language compilers, which either leads us to a world where all work is done creating new languages or one where we all develop in lisp ;)
Hopefully the language should be designed so that it's most easy and natural to write things in a way that the compiler likes. Thus, no requirement to become near-experts.
Assuming that performance is a major goal for the language.
> However, few practical applications benefit from persistence in practice so this is often not advantageous.
This statement is questionable.
What is true is this: many professional software developers are comfortable with imperative/OOP/mutability. This is easier (not simpler) because it's familiar. The benefits of immutability are not obvious/familiar.
Persistent data structures enable immutable data with cheap(ish) operations that can produce updated versions of that data. This means I can expose data to multiple threads, calling functions, etc... without worrying about that data being placed into an inconsistent state or viewed in an inconsistent state by callers.
Back when much of our profession malloc'ed their memory, we always had to answer the question "who is going to free() this chunk of the heap?" And when we answered wrong, or forgot to answer, we leaked memory. Garbage collectors, while imperfect from a controlability and performance standpoint, are much better at this than humans.
An excellent analogy exists with mutable data. When I expose a piece of mutable data via any public interface, I have to ask the question "who can change this data and how will consistency be enforced?" I'm left with the answer "I don't know, I hope no one gets it wrong" or even worse: "they better lock the right semaphore(s) in the correct order". Immutable data (implemented via persistent data structures) enables me to instead say "here's the data, if any caller would like to update, they can a) pass me an updated version which I can validate, or b) call into some part of my public interface and instruct me what update to make"
Or, in the case of Clojure (my blub at the moment), put the runtime in charge of maintaining consistency during mutation using refs and the STM.
There are many benefits to immutability. However, there are only a few smaller additional benefits of functional purity after we're already able to enforce mutability and immutability. Sometimes I really do want to modify some data internally before returning it, and baked-in purity keeps me from "cooking" my data properly.
When you expose a piece of mutable data, usually it's wrapped up in an interface that enforces the access and consistence of that piece of data so that you have control rather than let it up to the callers. If you don't want to have the synchronization wrapper, it's ok. Just document it clearly and let the callers deal with it.
If you have a benchmark comparing C with a functional language then it is almost certainly an extremely simple program. Arguably so simple that it is of little practical relevance today.
A lot of simple tasks are highly relevant. For example, in numerical simulation, you probably want a fast matrix-vector multiply. Conceptually this is extremely simple, and it's easy to benchmark. Examples of practical relevance: climate modeling, simulating combustion in engines, and optimizing control surfaces on airplanes.
"Sort" is simple. Does that make it irrelevant?
How about finding the standard deviation of a bunch of numbers?
How relevant are numerical benchmarks when comparing languages? In practice you'll probably use BLAS, LAPACK, etc. for numerical computing and get on with it; you're unlikely to write code that is both faster and numerically stable.
I am not an expert in functional languages, but I reason about it (that is, the question "Are FP languages inherently slow?", not the article) in the following way. I am curious to know if this reasoning is, well, reasonable.
- CPUs are inherently not functional. They are imperative, as they are Turing machines. They don't speak high-level math.
- Therefore, programs written in a functional language must always be translated into imperative code for execution. Compare this with an imperative language where the compiler need not perform super-fancy translation.
- If the problem can be mapped efficiently to machine code by the compiler, the functional program is fast; otherwise, it may be slow. Either way, it may be very hard to know what is going on behind the scenes.
- This is not always a lose; sometimes the compiler can perform extensive optimizations because it speaks math, and the functional language will win handily.
- Writing in an imperative language, you always know how your program will behave; however, you lose the opportunity to have the compiler deduce a better way (assuming there is a better way) by writing in a functional language.
- But if there is no better way, you have the opportunity to make the execution very fast by modeling the imperative solution around the constraints of the hardware.
Of course, there are other dimensions to performance, among them programmer productivity and number of defects.
It just seems funny to have an argument between the user having to use the machine's phrasing vs. having to use the phrasing of higher math. Higher math is clearly not the only abstraction. Ergonomics for one person coding or for two people collaborating, on a practical task, seem more important to me.
That is true of languages, but it is in fact the case that there are algorithms that run in O(n) time when written impurely, but require O(nlogn) or such when written in a purely functional style, even in state of the art FP languages. Since FP languages provide impure facilities for use in the real world, you can still use the O(n) algorithms, but at the cost of purity.
So the relevant question here really is not "are functional languages inherently slow?", but "is pure functional programming inherently slow?". AFAIK, the answer to that question is not conclusively proven yet.
It's not so simple. Any language (that is actually implemented) is always designed with regular specific consideration for implementation strategies. In fact I would go so far as to say the implementation is the design in practice.
For example choice of boxing or unboxing values is an integral part of language design and has profound impact on the implementation's performance both for speed and memory usage.
That may be technically true, but in practice some language features require more overhead than others.
If a language requires a feature where making it "fast" is so difficult that nobody can implement a fast version of it, then I think it's fair to say that language is slower than the alternatives.
Knowing that language X can, in theory, be faster than language Y does me no good at all if there are no implementations of language X that are actually faster than Y.
> the author claims that C limits what can be accomplished.
No he doesn't! What he claims is that benchmarks are almost always chosen to be relatively easy to implement in C. What's easy and what is possible are two very different things.
Note: I don't really have a dog in the FP fight. That said, your statement makes no sense.
The ubiquity of C does not mean that it limits what can be accomplished. Clearly, it can can be made to accomplish quite a bit, but that does not mean that it has no limitations.
He appears to be referring to the writing of benchmarks. Specifically for those in which functional programming excels over imperative (for example undo support), it becomes a significant undertaking to write the C version so it just doesn't get done. This means that there is a selection bias towards benchmarks that can be easily written in C. These simpler benchmarks tend to lack any need for the higher-level constructs that functional programming provides, which FP advocates would argue is exactly where FP gives the greatest gain.
You're right, point taken, I clearly misused the word; it has nothing to do with any personal opinion. I was using 'biased' thinking about its meaning on the Machine Learning context [1].
My intention was to state that idiomatic code is not necessarily the most performant one, and that the codes used in the benchmarks game may contain some not very idiomatic optimization codes [2].
That said, it tends to evaluate the implementation of a certain compiler, and not exactly how the constructs and concepts of those languages result on the measured performance.
I did a benchmark. Naive Haskell implementation took 28 seconds, heavily optimized / hinted Haskell version took 11 seconds. Naive C implementation took 0.3 seconds.
I have no idea what is heavily optimized about that Haskell code. I am no Haskell expert, but all you did was use an unboxed array. I added one strictness annotation to the "v" parameter on your iter function and it reduced the time from 10 seconds to 1.5 on my laptop.
None of the other code you wrote uses a linked list, so I don't know why you used them for Haskell. You used vector types for most of them (which Haskell has, by the way). Pretty disingenuous.
I admit that the prominent use of lists in newbie Haskell tutorials could have mislead you, but you seem to know the difference in data types since you used vectors everywhere else.
This is my first Haskell code, so I read several sources, and posted my code to some online sources asking for help. That's what I came up with. I tried a bunch of strictness hints, I found one that cut the time in half. Lots of trial and error. Apparently, there's another one I didn't find. Could you submit a pull request?
I accused you of that, because it seemed like you knew what you were doing with the other implementations, and then just suddenly forgot data structure performance when you got to Haskell. It was an asshole move, and I am sorry.
They aren't linked lists, but you still used a linked list internal to your "iter" function, where you used a vector (or array) there in other implementations.
I replaced the array/"surrounding" list with an unboxed vector, and removed the strictness annotations and it runs at ~0.85 on my machine, while the C code runs at around ~0.5.
The code is so incredibly close to your SML code (where you also spell out that you are using the Vector type). I am just blown away by how matter of fact you were in your statement.
Please either change your benchmark to use the same data type as all of the other examples, or stop using it as an example of Haskell's performance.
My point is, I am not Haskell expert, but I know that linked lists aren't an efficient data structure for what we're doing. You didn't use lists in your other implementations, so it was just weird that you did for Haskell (originally, and then sort of again for life-array.hs).
For your life-array.hs implementation, you replaced the Grid with an array, but not the "surroundings" with one, even though you used Vector for both in your SML implementation.
Sorry if I came off a bit assholish. People have been saying that you need to be a Haskell expert to get good performance forever, and in my experience this just isn't true. I will admit that there is clearly a problem with how things are introduced, as it is common that new Haskellers think they can use lists for every thing. I am sure part of the problem is that this is one of the few data structures available from the Prelude.
No sweat, thanks for the gist, I'll add it the repo. I'm impressed, actually, if it brings the runtime under a second, because only a couple languages managed that.
The act of finding the correct strictness hints seems a bit like black magic.
Well, GHC is still slower than C, C++, Ada and Fortran, so the link doesn't really refute his point.
And, looking at the code, a lot of the Haskell is subverting purity and laziness by using things like Data.ByteString.Internal and Data.ByteString.Unsafe, so I think that actually makes the case even stronger.
That's an interesting choice of words. Quite obviously no useful program can be totally pure (otherwise it'd do nothing but heat the machine). And Simon Peyton Jones is on record saying that he's not sure if he'd carry the laziness torch quite so far if he had to do it all over again. Neither is to say that these aren't powerful or useful techniques, but rather that in Haskell they're chosen as defaults -- impure and/or strict code are the outliers that require justification, not the other way around. So given that, in what way is using ByteStrings (safe or unsafe) subversion? Far from it -- use the tool that makes sense.
With most functional languages, the usual technique to writing fast code is to first write it to be correct, and then transform it to be fast. And that's exactly what unsafe ByteStrings are: optimizations. There's nothing wrong with optimized code, and I see no problem with the technique. The only relation it bears to the OP is that it's an explicit unboxing. Perhaps the compiler should automatically do that unboxing, but that's easier said than done.
I guess my biggest problem is all of the people claiming Haskell is pure, lazy and fast. It seems disingenuous when to actually be fast requires swapping out the standard data structures with non-pure, non-lazy versions.
To put it another way, a naive implementation in C will compile and be relatively fast, while a naive implementation in Haskell will, at the very least, require replacing the standard data structures with things like ByteStrings, just to match the default C speed.
Pure and fast is certainly feasible -- you don't always need to use unsafe bytestrings for example. Laziness makes performance hard to predict because you have no clue when something will get evaluated -- there's no contract with the compiler. Maybe one day some group of really smart people will figure out how to take advantage of that to write compilers that produce really fast code, but I think language-wide laziness is not a feature bound for mainstream languages. Targeted laziness? Probably, it's very useful for expressing data structures. But not everywhere by default.
In any case, I don't think people are trying to be disingenuous. Haskell is, or can be, all three of those things. It's just very hard to get them all at the same time.
I'd have to disagree, considering that GHC maintains similar speeds to Mono and JVM languages. The question was about functional languages being slower, but he didn't specify slower than what. If the unspecified alternative is imperative languages, then there are plenty of imperative languages that are slower in that graph and will likely never be as fast as Haskell. Granted, functional languages are never going to be as fast as low-level languages like C, but that doesn't mean they're necessarily slow either.
Semi-OT: is there a good primer on GC people could recommend? His discussion of different GC strategies piqued my interest, but I'm not entirely sure where to start.
The GC book by Jones is the place to start to understand GC in general. He also has a new "GC Handbook," but I haven't looked through it yet to see how well it covers modern parallel and concurrent collectors, particularly for NUMA machines.
Having spent a large portion of the last five years working on a parallel concurrent collector that scales to 48-AMD and 64-Intel cores, I can say the game's totally different on NUMA architectures. Unfortunately, not much has been written yet on it (I have an MSPC paper; the Intel folks have a paper on Pillar that was at ISMM; that's about it). Feel free to ping me directly if you get to advanced material and need more pointers. Many of the systems have not yet been published, as it's shockingly difficult to publish on modern GC implementations these days.
Despite the (not great) title, I think it's actually a fairly interesting article, which isn't so much about the conclusion, but some difficulty and success points w.r.t. functional-language efficiency.
Inevitable is an overly broad characterization. For example, in this particular case, Manticore uses a combination of control-flow analysis and reflow analysis to inline higher-order functions in instances when we can determine that the free variables (if there are any) are obtainable in some other static way. This observation extends to the general allocation result; modern functional language compilers do quite a bit to get that under control (see my IFL paper on arity raising/unboxing/ripping apart datatypes, which provides a survey of the state of the art circa 2009) and, particularly given the additional register freedom when you're not wedded to the C calling convention, can do some real magic to avoid performing allocations, particularly in inner loops. GHC's is type-directed unboxing/type unrolling, IIRC, which in practice is on par (modulo separate compilation issues) with the implemented MLton work cited in this post.
So, it's less that poor performance is "inevitable" and more that it requires fairly insane amounts of work by the compiler developer. And, for example, I seem to remember that the Visual C++ team spent more person-hours on template compiler error message clarity in one product cycle than we will spend on the entire Manticore project in its lifetime. And yes, I know the "sufficiently smart compiler" joke.
A more fundamental complaint is that functional language implementations tend to provide _fragile_ performance. That is, without understanding some fairly gritty details of the compiler and runtime, it is difficult to understand why, for example, the second time you made a call to a non-inlined combinator providing a second function argument at the same type, your program became slower in MLton but not in GHC (YMMV; a representative but not authoritative example). A good way to see this phenomenon is to look at optimized benchmarks for OCaml or GHC (e.g. in the PL shootout) and ask, "is this idiomatic ML/Haskell code?" Many benchmarks for functional languages that perform close to C look suspiciously like the original C benchmark, with just a few functional sprinkles on top.
I don't know of a good answer to that performance problem without requiring users to become near-experts with the compiler, and that's not good practice.