Re: 'root causes' -- I find the words somewhat important. Like, if you say you're looking for a root cause, then people tend to be in the sort of moral mindset, and have a harder time seeing it as a collection of contingent events.
Also, in the (truly amazing) "How Complex Systems Fail", he's pretty down on "root cause":
I'm with you on this one. Especially in an iterative context, there is zero value in looking for the true root cause.
The point of the exercise is to identify economical interventions that will get the system to produce better results. If you go much beyond that, people can get off into moral, analytical, or philosophical weeds and get lost.
As long as you do retrospectives and five-whys frequently, you can count on useful analytical depth to come over time. If an issue is really both important and subtle, it will crop up again. The next time you'll have another perspective, so it will be easier to find. And by waiting, you'll have avoided examining all the equally subtle but unimportant things.
Like the idea of focus on a single core progression for a startup. Very much like the list of Gotchas and Edge Cases. My favorite:
>5. Getting Test Users. I often hear people rationalize a PR push as the only way they can get enough users to test product market fit.
...
The solution isn’t PR, it’s go to some events and make some friends in that market.
Yo, the author here. Thanks for the feedback. I totally meant idempotency, drat. (In fact, on Hadoop, thanks to speculative execution of reduce tasks, you also have to worry a bit about reentrancy, but what I was talking about was, in fact, idempotency).
Shutting down the pipeline: I hear you on prod/non-prod. For our setup, the pipeline ends up writing to a datastore, so if we kill the pipeline, the datastore is still up, it just stops updating. Which is working so far. May end up flagging suspect data as you suggest, instead of the full stop (or only full stop if more than a very small percentage of the data is suspect).
No problem. I am not too familiar with Hadoop, but those speculative reduce tasks sound like a real blast to debug.
I can see why the approach in your blog would have a lot appeal in that environment. It sounds like some sort of error flagging in combination with a set of heuristics around what failed, how often, what time of day, etc would be the way to go.
I find that intelligent monitoring systems like that are ultimately necessary in systems like this anyway, you just usually end up discovering that the hard way (I know I have, several times. It's one of those lessons you are tempted to unlearn in the interests of expediency). Does Hadoop help you out with that sort of thing?
Because I have that same problem -- if someone has nice visual slides, the Slideshare is often kind of useless.