*"Just as you can't add quality to your code just before shipping, seldom can yo...

CJefferson · on Aug 13, 2011

Yes, I disagreed with that. In almost any project I've worked with, which has never been profiled, then over 90% of the time is spent in a few functions, which can be easily optimised.

Of course, after a few passes through the profiler, things start getting harder, but that's a seperate issue.

corysama · on Aug 13, 2011

I get to deal with the separate issue fairly often. After a few passes, 90% of the time is being spent in 90 functions taking no more than 2% each. At that point, I have to start ripping out layers of std::map<std::string, boost::shared_ptr<boost::any> >. That's even less fun than it sounds...

Silhouette · on Aug 13, 2011

> You should be able to significantly improve performance with a few changes .. in code written [well?] enough.

Unfortunately, I don't think that is usually what happens. In an established code base that is reasonably well designed and well written, the performance is rarely dominated by a small number of very expensive bottlenecks that are ripe for local optimisation. Such things have usually been worked out a long time ago and there is no low hanging fruit left to pick.

At that point, if you've got five layers of architecture and each is only running at 75% efficiency because of all the little details, that might take your overall performance down to perhaps 25% of what it could be. If your software is I/O bound anyway or runs in half a second, you might not care. If that is the difference between taking a week to complete your research calculations and taking a month, or it represents a 5x increase in your hardware budget, then you probably care very much.

joe_the_user · on Aug 14, 2011

...In an established code base that is reasonably well designed and well written, the performance is rarely dominated by a small number of very expensive bottlenecks that are ripe for local optimisation. Such things have usually been worked out a long time ago and there is no low hanging fruit left to pick.

Uh yeah. But I think this discussion is about creating a code and when to optimize it. It is kind of a tautology that when the code has been optimized and low-hanging fruit have been picked, they are, uh, gone.

Silhouette · on Aug 14, 2011

Sure, but if you get to "just before shipping" then hopefully you are already well past that point. As the article said, you typically can't just retrofit more performance at that stage without some serious rewriting.

psykotic · on Aug 13, 2011

If you don't think ahead, it's easy to expose an API with implicit implications of poor or at least limited performance. The rest of your code base becomes dependent on it, and the month before shipping when you start really focusing on optimizing you realize it's too late to do much of anything to fix it. Sometimes it can be fixed but only by adding a ton of behind-the-scenes code complexity to that API to try to speculatively optimize for anticipated usage patterns.

A simple, innocent-looking example is a string data type that offers (and implicitly encourages) use of random access. If you're using a fixed byte-width for characters that's unproblematic, but it's an issue if you later replace it with a variable width representation like UTF-8. If all your string processing client code is written to rely on random access rather than incremental iterators, you have a problem. You could choose to store an index array alongside the character array. That would roughly double your string storage requirements, but at least you could still offer O(1) random access.

Or you could add code complexity by having each string remember the last index it looked up (sometimes called a finger), and if the next requested index is near by, it uses simple O(distance) forward or backward movement to return the nearby character's value. That seems like a good compromise. But what about multithreading? It looks like we need this cached index to be thread local. Not too bad. But what if we're interleaving accesses to the same string within a single thread? Now the different accesses are fighting over the same cached index. It looks like we need a multiple-entry cache to support this usage pattern.

If you're a good engineer you can probably find the right compromise along the above lines that works for your code base. But it would have been much better if your API design hadn't constrained your future opportunities for changing internal representations and optimizing performance.

Here's another example: OpenGL's immediate mode graphics interface where you can specify vertex data piecemeal. Because everything written in that style is code driven rather than data driven, the driver can't just allocate a single dynamic vertex buffer with a fixed vertex format and push attributes onto it in response to glColor(), glVertex(), etc. What you can do (and what the good OpenGL drivers do) is to speculatively keep around a bunch of staging buffers with different vertex formats behind the scenes that are first created dynamically in response to recurring vertex and primitive patterns. You can see how it goes: It's a lot like the implicit caching in the string random access example, and it has all the same problems, and a whole lot of new ones.

These kinds of late-in-the-day performance fixes for early API design decisions (even though the early API is perfectly clean and modular, in fact maybe too modular) are doable but they are costly in complexity and programmer time. They are also fragile. In my two examples, if you're in the sweet spot of the behind-the-scenes caching system, you'll be seeing good performance. But change one small thing that from your point of view shouldn't impact performance, and you might suddenly fall off a performance cliff because the cache is now too small, or whatever. Incidentally, this problem of fragile automatic optimization isn't restricted to library design. It's also a very serious issue with compilers. Modern compilers rely way too much on a fragile form of "crystal ball optimization", as one of my coworkers likes to call it. It makes it very difficult to write code with predictable high performance and even harder to maintain.

Damn, that was a lot of text.

joe_the_user · on Aug 13, 2011

You're right that if you can create a nice, modular API and have it be slow-as-heck if it's model doesn't correspond to the underlying hardware or processes.

But I would claim you are still in better shape than if you'd done spaghetti code.

Caching of a multitude of type can solve many problems of classes that treat resources as more available than they really are. You, yourself give the solution for your string example - in the yourself, if the actually needs random access in any performance-driven fashion, you'll need the extra-array so there's no problem. Double buffering and similar stuff have put to rest the problem of repainting pieces of windows in ordinary GUI programming. I'd imagine something similar could solve whatever the problem is you're talking about with OpenGL.

And given that you admit optimizations tend to be fragile constructs, it seems your argument strengths the argument, my argument, they should generally be done last.

psykotic · on Aug 13, 2011

> You're right that if you can create a nice, modular API and have it be slow-as-heck if it's model doesn't correspond to the underlying hardware or processes.

Well, the underlying problem is that the design of an apparently abstract interface can commit you to a concrete choice of data representation and implementation strategy. If you expose and encourage use of random access in your string data type, you are committing yourself to a fixed byte-width representation unless you want to pile on implementation complexity of the sort I described while still only managing to solve the problem imperfectly. (As an aside, my favorite implementation trade-off for this problem might be Factor's approach where any given string internally has a fixed byte width of either 1 or 4 depending on whether it contains exclusively ASCII or some non-ASCII characters. It gives you some of the benefits of variable-width representations at virtually no additional implementation complexity.)

> But I would claim you are still in better shape than if you'd done spaghetti code.

Probably, but why the strawman?

> You, yourself give the solution for your string example - in the yourself, if the actually needs random access in any performance-driven fashion, you'll need the extra-array so there's no problem.

But the size of the index array negates any space efficiency advantage a variable width representation might have had in the first place! It would only make sense if you needed to zero-copy pass a UTF-8 or UTF-16 buffer to a foreign API while still supporting constant-time random accesses into the buffer on your side.

> And given that you admit optimizations tend to be fragile constructs, it seems your argument strengths the argument, my argument, they should generally be done last.

Optimizations can definitely be fragile, but in my post I was referring specifically to automagic black box optimizations with my fragility claim. Put another way, performance guarantees should be an explicit part of your API design, along with error conditions and everything else. That's one thing STL got right. You have to lay the groundwork for future performance optimization possibilities with good API design. Those possibilities will be strictly limited if you don't think ahead. That might mean going with a bulk-oriented, data-driven API rather than a chatty, code-driven API, e.g. compare the way render state and shader constants work in Direct3D 10 versus its predecessors.

This is where experience and forward thinking is necessary. Crazy notion, I know.

Edit: Replied to the rest of your post.

chanux · on Aug 13, 2011

`it is never too early to write good code`

Can't word it any better!