Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
So You've Been Paged: A Guide to Incident Response (scalyr.com)
46 points by kawera on Nov 14, 2016 | hide | past | favorite | 58 comments


> Pager duty is essentially wage theft. I disagree—it's a part of the job; I dare say it's an important part of the job ... equally important if you are a developer. You write software and you get paged when it breaks ... how is that wage theft? I find no feedback more effective.

What I do consider frustrating, however, is when I'm responsible for alarms/incidents that I can take no corrective action; this is how friction grows between teams: dev and ops.


On a company level in the market there are Service Level Agreements (with requisite adjustments in price) for this sort of thing. It's clear that the market assigns a value to different response times. Further, if there isn't an value to having an employee on pager duty then the company would not do it. If that value is not reflected when compensating the employee then from an economic perspective, yes, the employee is losing out.


> If that value is not reflected when compensating the employee then from an economic perspective, yes, the employee is losing out.

I agree, but it is worth noting that the compensation might not be monetary: To some people getting more time-off for being on-call might be a better "value" than extra pay. There's many other additional avenues for compensation to consider.


Yep.

I am, for all intents and purposes, "on call" 24/7. I really don't mind, though.

In return, I go to bed when I feel like it, I get up when I feel like it, and I work when I feel like it. I work from home, too. I think it was around 7:00 a.m. this morning when I fell asleep and I woke up just before 3:00 p.m. I can't remember the last time I set an alarm. There's times that I'll go for two weeks without talking to my boss. On a nice, beautiful, summer day, I often say "f--k work" and go for a nice ride on the Harley instead. In the evening, after I get home, then I'll "go to work".

From the company's standpoint (we're an ISP), they're glad to have someone available in the middle of the night that can fix any problems that may arise (a.k.a. SHTF). Certainly, there are times when my boss might prefer that my ass was in a chair in our office every day from 8-5. In that case, though, something that broke at 5:05 p.m. wouldn't get fixed until I came in at 8 a.m. the next day.

It's not perfect and it definitely isn't for everybody but it seems to work out for us and, really, that's all that matters.


My quick guide to pager duty.

Step 1: Find a new job.

There are way too many opportunities out there to subject yourself to this nonsense. Pager duty is essentially wage theft.


I think this is because most companies do pager duty wrong. I highly recommend the google SRE book[0] (notes here[1], chapter 11 covers oncall/pager). One thing mentioned in this book is compensation for being oncall. At Google we get fairly decent pay compensation for holding the pager, enough where it can incentivize people to be on the rotation.

(I'm a software engineer at google who is oncall at this moment)

[0] http://shop.oreilly.com/product/0636920041528.do

[1] http://danluu.com/google-sre-book/


Thanks for the sources, I'll have to check them out. We've recently spent a considerable amount of time tracking, analyzing and evolving our PagerDuty alerting[0], learnings of which we've shared in a blog post.

[0] https://goshippo.com/blog/evolution-our-pagerduty-playbook-f...


> At Google we get fairly decent pay compensation for holding the pager, enough where it can incentivize people to be on the rotation.

Google is a big company. I expect them to 1) Have people in all timezone so that there is no night shift 2) Have many people on rotation so each individual is rarely on shift.

That's not comparable to smaller companies.


Just some anecdotal about my team at Google. I definitely do night shifts (we do 1 week long rotations), but with a 30-minute SLA on responding to pages. I've had bad nights where my first page happens at 11pm, then they keep going until 4am. The better part is that we are oncall once a quarter or so (12 people on the rotation).

The thing is, there is basically a waiting list to join the rotation. Compensation is nice for those that are motivated by it. But it also exposes you to a lot of the infrastructure that you normally don't deal with (so it's a great way to learn).

We have a higher-up-the-stack SRE team that does 12-hour shifts so it's not as bad for them. They handle the larger scale issues that are beyond job specific issues (ex: datacenter issues).

I can understand this sucking if you are on a small engineering team where you can't do things like this (I've got friends at companies that employ < 10 developers, I've heard the stories). I guess I wasn't thinking about it for smaller eng teams where the number of people available to support the product isn't there.


I agree that it is wage theft the majority of the time. I had a previous gig that paid a 20% salary bonus for months you were on pager duty. If the pager went off, you got a 10% weekly bonus that increased with each incident up to 40% maximum. People fought like cats and dogs to get on pager duty at that place...I wish more businesses took that approach.


That seems like the wrong incentive, if you were also the people responsible for making sure pageable events didn't happen.


Completely agree.

If it's unimportant enough that the company can't afford to pay you a large multiple of your normal salary to work on it out-of-hours, it's unimportant enough that it can wait until you're next in the office.


Where in the article is it talking about who is taking the call or how much they are paid at all?


I didn't read the article, I was responding to the comment.


So who should support emergency issues? Nobody?


If you're running 24 hours then you should staff for that. If I work my 8 to 10 hours and then have to work another 2 or 3, it tends to wear one down.

Ideally, there would be a financial incentive to fix production bugs so businesses aren't waking up developers. This isn't true when you have exempt employees (US term for no paid overtime) doing your support. I've seen manager go totally friggin stupid with employees who went above and beyond[2].

I still think some managers think we would "code ourselves a mini van"[1] if they paid us for support.

1) http://dilbert.com/strip/1995-11-13

2) https://news.ycombinator.com/item?id=3015969


I really don't like the model of throwing support over the wall to some team that didn't write the software. There is no incentive to improve the software under those circumstances, and the people that get paged aren't in the position to fix what's broken.

If you can fix the problem at 3am without thinking, then so can a computer program. Write that computer program.


I remember working at a utility company. I didn't last long (I was the "web master" - oh for the early 2000s), and nor did much of the support staff.

Why? Mainly because we didn't have "authority" to fix things. That had to come from a PM.

It might also have had something to do with the fact that after Ernst & Young came in to do some IT re-structure consulting, we had the situation where:

Support staff were supposed to be on call rotation for pager duty. PMs were not.

PMs were given company cell phones. Support staff were not.

PMs were given company cars. Support staff were not (note that when I say 'support staff' I don't really mean 'help desk' but programming. And this was a government utility where, rightly or wrongly, nearly everyone had a company car).

Someone observed this. Management's response was that the cars and phones gave the PMs "flexibility". Meanwhile support staff were to take calls on their own phones, and if need be drive in to work at 2am in their personal vehicles. PMs were not to be disturbed out of hours.

A little rant-y, but the underlying point? Companies with poor "pager policies" are likely to be problematic all around.


> There is no incentive to improve the software under those circumstances, and the people that get paged aren't in the position to fix what's broken.

My problem is there is no incentive for a company to fix the software if they aren't staffed for support. Also, I see a lot of DevOps people thinking they are the only ones who can fix or diagnose the problems. Often, software developers aren't experienced in networks, databases, or hardware. Thinking that devops is the be all of 24 hour support is a problem.

> If you can fix the problem at 3am without thinking, then so can a computer program. Write that computer program.

I've notice a bit of a link from uncompensated support to allowing time adding fixes to the software.

Frankly, from a hit-by-a-bus problem, if no one but the people who wrote the software can support it, then you really have a problem.


> Frankly, from a hit-by-a-bus problem, if no one but the people who wrote the software can support it, then you really have a problem.

In theory, you have a few people that wrote the software. Who is doing code reviews? If everyone on your team dies in some horrible accident, the software is going to be the least of everyone's concerns.

Vacations are a more reasonable concern. Include vacations in your stated SLO. Support staff tends to want to go on vacation at the Usual Times too (Thanksgiving, Christmas, July 4, whatever.)


> There is no incentive to improve the software under those circumstances, and the people that get paged aren't in the position to fix what's broken.

Sure, one solution is to make developers responsible for support, too. I find it hard to believe that's the best solution.


Personally, I like that solution, but I'm a big fan of "he who breaks it, fixes it".

If I implement a shitty "solution" to a problem and it ends up causing even worse problems then I should be the one to "feel the pain" and it should be my responsibility to fix it/do it over/do it right. Nobody else should suffer because I f--ked up.


> There is no incentive to improve the software under those circumstances

Classy. This comment is 100% exactly the attitude. This is why Step 1 is to find a new job. This poster is punishing you like a parent would a child. Is this the way you want to be looked at and treated? Did you study as hard as you did and spent so much time developing your craft for this?

There are better opportunities out there. 1,000s of them. Avoid the jrockways of the world and you will have a much happier career and life.


You insulted the person without at any point refuting his point about the incentive and ability to respond quickly to outages.


without at any point refuting his point about the incentive

Grandparent post described a way to create incentive. He posited that without it there wasn't incentive, he did not prove - nor even attempt to argue - that it was the only (or best) way to create incentive.

ability to respond quickly to outages

Depending on the nature of a problem and training of the people involved, there is zero reason to assume the programmer of an application is more qualified to triage and remedy an outage than someone specializing in operations. The problem could be in the infrastructure around the app. Even if the underlying problem is something the developer is well suited to fix, they still might not be have the expertise to deliver the best temporary solution to get online ASAP, and stay there until the real fix can be landed in the codebase.

To conclude, parent poster does not need to waste breath articulating why such obviously fallacious reasoning is bullshit.


> ... there is zero reason to assume the programmer of an application is more qualified to triage and remedy an outage than someone specializing in operations.

That's why we've got AWS, Azure, Google Cloud, and so on, right? The programmer gets to deal with only their code and let someone else worry about the rest of the infrastructure.


Unless you're talking about AWS lambda, all of those things still have software infrastructure that need system administrators.


If you're running 24 hours then you should staff for that.

Right, so what's the problem? The original article isn't talking about putting people who aren't paid for pager duty on pager duty. The OP is whining about the mere idea of it and suggesting that if you have that job you should quit.


The same people who do this is other industries. Paid staff scheduled to work those hours. Honest question: You didn't know that?


Another poster answered the question but your honest question isn't that honest. DevOps is paid for emergency ops - it's part of the job description.


Anything in the world can be part of a job description. That speaks nothing about if it is reasonable or ethical.

Many, many, many industries have overnight staff to handle these issues, but likely because software engineers tend to be young, lack a union and are paid salary with no overtime compensation, it is somehow acceptable in this industry. Which is why I honestly asked that question. Why do you and the above poster, find it so normal to be on pager duty when so many other industries, often with much more critical services, do not require it?


Why do you, and the above poster, find it so normal to be on page duty when so many other industries, often with much more critical services, do not require it.

False premise. Many jobs require it, including low paid ones across many industries. Many of my jobs which were not in IT required it sometimes (like working at a golf course and being called out for a jammed golf cart garage door in one case - which I clocked hours for btw).

If it's part of your job description and the salary you get covers it then I'm not seeing why you should "find a different job."


> If it's part of your job description and the salary you get covers it then I'm not seeing why you should "find a different job."

Two reasons:

1. Your salary doesn't cover night and weekend pages in most first world countries regarding labor law (the US is backwards in this regard, but luckily the economy does well enough at the moment jobs are a plenty).

2. Quality of life. If your job requires on call rotation, interrupting your off-work hours or your sleep, and you're not getting paid enough (<$100K/year), immediately start seeking out another job. It's a sellers market.


So, in other words, find a new job if you don't make enough to justify getting paged. Right, OK.

That has nothing to do with this article, though.


Just doing my part to ensure people who read my comment aren't taken advantage of by their employer.


Real industries that run 24-hours a day and have to be up all the time, run three shifts, they don't screw their day-shift into having to be available at the drop of a hat to cover.

They also pay significant overtime multipliers when they do have to call somebody in outside of their scheduled hours.


How does being on pager duty not coincide with being paid to work these hours?


There is at least one big company out there, where every dev is part of what they call the "on call rotation" -- where being "on call" typically means being on pager duty -- without any special compensation for that. The only exemption is if you're working on a project that isn't yet in production. They tell you about it during their recruitment process and they communicate it very clearly: take it or leave it, there are no special cases. It sucks, but a lot of people take it anyway.


Yup. We're on call for a week at a time. Only compensation is the "privilege" of working from home on the Friday of the week we're on call. No extra pay, no comp time, nada. Luckily the number of calls is very low, maybe 1-2 per week.


This only counts if you aren't highly compensated for the unpredictable hours.


Step 0: Agree, on the condition that any alert received outside working hours becomes a priority task for someone the next day -- whether that means them fixing a bug, adjusting the alert, or investigating and explaining why the cause is extremely unlikely to recur.

This is working well for me, has improved the service for users, and has made our monitoring system much more useful. (Alerts used to be about as accurate as "Main website broken", but are now more like "microservice X is taking >10s to respond".)


Last time I was on paid pager duty there was a slim likelihood that there would be any chance of the 3 a.m. call. Therefore we had a strict rota as it would have been unfair if someone was getting all of the free cash. This only amounted to £300 a month due to everyone wanting to be on the rota, however, I am sure that paid for my Christmas by the end of the year.

Plus the minor incidents that did happen also turned out to be good team moments, everyone would hear that 'you fixed it'. Okay you did reboot-retry on that file server that wasn't responding and everything was fine five minutes later, but a lot of knowledge went into pressing that reboot button and, as far as manager types are concerned, you saved the day. To their non-technical minds that could be voodoo wizardry so they are pleased, feather in cap given.

I also believe that doing some out of hours emergency support is good for one's own education, you are able to think quickly on your feet as it is a heightened urgent situation at hand. Just being placed in this position helps you gain experience of this type of problem solving. Total focus on the task in hand is easily achievable, one is not thinking about lunch or going home and only half focused.

Once you have that experience then find a new job!


> Okay you did reboot-retry on that file server that wasn't responding and everything was fine five minutes later, but a lot of knowledge went into pressing that reboot button and, as far as manager types are concerned, you saved the day.

You likely understood the consequences of your action too (and if not, you just learned! :P )


> Pager duty is essentially wage theft.

It is only reasonable to hold that stance if it goes both ways. If you consider it wage theft to expect you to work on-call hours, then you also must consider it proper work ethic to always work 40 hours a week, never taking an afternoon to... well, do anything. 40 hours, no more, no less. Don't be late in the morning, either.

If that is how you want to work, there are certainly jobs that offer it.


I don't see how it's wage theft if you were aware of it when you accepted the job.


It ranks up there with unpaid internships (which should be illegal).


Unpaid internships are a way for companies / organizations to know that the intern comes from an affluent background because poor people cannot afford to do those types of internships.

Good luck getting rid of something that keeps it all in the club.


Yep.

Personally. I'm a free-market type and so I struggle with any moral objection to these internships. But I find it amusing that internships are so popular within institutions (govt, media, big tech) full of people with politics that should find fault with them.

By contrast: when I was the work experience boy at an electrical goods maker, I would have accepted no pay (I needed it for Uni). But they made sure to pay me out of a sense of professionalism.


> Personally. I'm a free-market type and so I struggle with any moral objection to these internships. But I find it amusing that internships are so popular within institutions (govt, media, big tech) full of people with politics that should find fault with them.

Same story here. I am truly amazed at the people who talk to me about opportunity and diversity and still have unpaid internships. It is an amazingly effective filter for a certain class of individual.

> By contrast: when I was the work experience boy at an electrical goods maker, I would have accepted no pay (I needed it for Uni). But they made sure to pay me out of a sense of professionalism.

The vocational fields got it right with paid apprenticeships. Shows the value of your work increasing as your skill and knowledge does.


In the US, unpaid internships that provide "immediate advantage from the activities of the intern" _are_ illegal.


I work remotely on retainer for a company I worked for when I lived back in Oakland. I work 24/7 pager duty effectively. A lot of times I'll be summoned after working 8-10 hours to just look things up for an important client or confirm sales numbers (which always are correct). I work ~9AM-6PM EST hours but folks at the company generally work 12PM-8PM PST so there's really no good way to plan around those sort of support calls.

It wasn't always this way though. The company used to have other developers but they never replaced them when they left. The business unit I work under switched to a maintenance mode effectively where we're just upgrading existing systems for CVEs, supporting the AWS setup, and dealing with important client requests when they rarely come in.

I'd push for more money but there just isn't the budget for it and they've made it clear to me. I even had to fight to stay full time as they wanted me to work less days but be on call still, "we need to figure out the best way to utilize our resources" (paraphrased).

I will say it's detrimental to my health. I wake up most mornings and immediately grab my phone out of fear I missed an alert in the night. That whole bit about utilizing their resources hasn't been sitting well with me though so I'll probably move on after I finish the extra documentation of our systems they're now pushing for.

Just putting a counter point to the people saying, "You signed up for this, you should know what you were doing." It's not always that simple. I have a family now and can't just pick up and change everything at a whim.

I'll also say I get nervous going into movie theaters, etc. where I'll be disconnected for a few hours. It's just not a healthy situation at all.


> The good news is this: All issues eventually get resolved (unless you just give up and quit. Please don’t do that.)

Well, that isn't exactly true. The issue might continue based on the way work is scheduled in your project. Its amazing how the business and managers often don't schedule the "stop problem for happening" work when it becomes apparent that the support / devops staff can fix production issues themselves. I've been there and watching the business and your managers reject the time needed to fix the problem during the next iteration / release / sprint is soul killing. It really doesn't bring as much business value as these new features after all.


Please increase the contrast!


After a few years in management, when it comes to paging there are two groups:

1) People that want to fix their own code. They usually don't have to be told "you have pager duty" -- they just care. They even become upset if they don't know about the page and will even install their own paging systems without your knowledge.

2) People that can't be bothered. They dont' answer the phone, don't answer slack - instead they just watch a re-run of the mindy project and eat ice cream while your production system is crashing.

Generally people in 1) fix things for 2). Wage theft is really from group 2) - because they steal from 1). Each time I have fixed something for group 2) - I either fire them personally monday or start an all out campaign to get them fired. Firing is hard. But letting the entire team down just makes it easy. Ironically, if you want less pages just fire more of 2).


There is a third group -- which I would place myself in. I feel a strong responsibility to an environment I work in; but I also have a need for a disconnect and work/life balance. I absolutely am not ok with people in group #2, but I think group #1 is similarly unhealthy -- being constantly aware of issues can lead rapidly to burnout.

This is why you have to have an on-call rotation, with an SLA (i.e. all pages are ack'd within ten minutes) with enforcement for people who regularly miss pages, and keep the life-disruption to one or two team members. Obviously, anyone who's worked on a large software product knows you might get an escalation even if not on-call, but that's a hugely different workload than being attached to a pager and having to respond to them.


Every time I feel our industry has evolved beyond the "devs vs. managers" culture, something like this comes up to make me despair again.

I find it absolutely horrifying that you don't stop at the false dichotomy of diving every dev in those two groups, with no shades of grey in between, but actually go one step further and recommend firing everyone who isn't in the first group.

Yes, people care. Yes, they want to fix their own code. They also want to teach their little kids to ride a bike without training wheels. Your system might have problems every week, but that moment when your kid gives you a big grin and says "look, daddy, I'm doing it alone!", that happens once in that kid's life. So unless your system is controlling air traffic or supplying oxygen to hospital patients or something like that, you might want to consider that there's more to people's life than their work.


You can't really expect everyone to be available on-call though. If the people in 1) want to fix their own code, why can't they do it more often?

Where I work we have a separate pager duty group, that anyone can join. Why would you? Because a) you care, and/or b) the compensation is great.

If you don't want to, that's great. No one will hold it against you. We are people after all.


I'm not sure what industry this was in, but neither reflect my experience working in Telecom and HIPAA-related environments.

We couldn't push any fixes without full UAT/validation testing. No "install own paging systems" would be tolerated on national phone/video networks or anything HIPAA related. We wouldn've been walked out the door ASAP.

I think it really depends. We had critical SLA and support levels we had to respond to. If a backup was called and there wasn't a very specific reason the primary was not able to respond, there would be consequences.

I hated not being able to leave or do something spontaneously (especially on a weekend, for example), because we always had to be near. Of course, all that would change if they made it attractively worth my while. But almost nobody does. :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: