Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The lack of investment into decreasing on-call pain is a real factor. I work on Oracle cloud (OCI) and at least some of the orgs (VP-down) have figured out that this is something worth focusing on, and the on-call gets better and better as a result. My original team had an average of something like 50 pager-worthy (sev2) events per week until we got moved into a new org that had the right philosophy and we relentlessly drove that down because management realized that engineers made miserable by mundane ops fake-emergencies would eventually get fed up and leave, and that's not what they wanted (afaik, OCI has no such forced attrition). So we got put on a program of relentlessly tracking and categorizing the sev2 counts and committing to improving those numbers over a period of time. 25% of dev sprints were dedicated to improving ops (tools, better alarms, fixing long-backlogged bugs that led to pages), and now that team's ops are pretty easy and they are free to work on new features, which everyone prefers. I've since moved to another team whose ops had already had this optimization done, and I've never experienced a bad week of on call there.

I won't pretend OCI is a panacea (lol google oracle cloud toxic work environment for latest stories) but at least they don't lack this particular piece of wisdom. The sheer number of regions they plan to operate doesn't really allow them to ignore dumb ops problems.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: