For the system at work, I am on call one week every seven weeks. For most of the past ten years, I have been on organized on call rotations for the systems I have been developing. I think being on call is a logical way of taking responsibility for your work. You also learn a lot from it. However, it is stressful and an inconvenience, so you should get paid for it.
Why Developers Should Be On Call
Many systems need to be available around the clock, so somebody has to be on call for support. In my mind, it is natural that the people developing it also support it.
Skin in the game. As a developer, taking responsibility for your code means not just writing it, but testing it and making sure it does what it is supposed to, even when it is running in production. When you know you might get woken up in the middle of the night if there is a problem, you become more careful coding and testing. In his book Antifragile, Nassim Nicholas Taleb mentions how Roman engineers had to spend some time under the bridges they built – to ensure they did a good job.
Holistic view. Troubleshooting the system forces you to get a good understanding of how it all works. If you don’t work with support, it is easy to end up knowing some parts really well, but not seeing the whole picture of how all the parts connect and interact. Also, you get to see firsthand all the ways in which the customers are using the system.
Drives improvements. When you work on fixing problems in the system, it is immediately clear if the logging or troubleshooting tools aren’t as good as they should be. As a developer of the system, you can fix these problems as you run in to them. You can improve the logging by adding more relevant information, or by removing log statements that only clog up the logs. You can write scripts and other tools that help in troubleshooting and recovery. You can remove false alarms. Over time, many such small improvements add up to a much better and more resilient system.
Alarms
Phone calls and alarms. At TriOptima, we have an emergency number that can be called if there is a problem with the system. The number is forwarded to the developer on call. Over time, we have also added more and more monitoring of, for example, exceptions thrown, queue lengths and memory, CPU and disk utilization. The monitoring will generate alarms that will wake up the developer on call. These days we are almost never woken up by phone calls, because we get the alarms from the monitoring before most customers notice a problem.
Heartbeat jobs. For two of the major interacting systems, we run periodic heartbeat jobs (every minute and every five minutes). If they are not executed as expected, alarms are raised. These have proven to be extremely useful, because they tend to catch a whole range of problems. For example, recently the server sending us data ran out of threads. One effect of that was that our heartbeat job did not run, so we got an alarm. On other occasions, we have become aware of connectivity problems through the heartbeat jobs. In the book Release It!, Michael T. Nygard calls these synthetic transactions.
Easy roll backs. At work, we deploy new features and bug fixes as they are ready. This means many small deploys throughout the day. If we get woken up at night because of a problem, the solution is often to roll back the most recent deploy. This is an easy and quick operation thanks to Kubernetes. A few changes, such as database migrations, can be harder to roll back. But since we know this, and have skin in the game, we try to write these to be as easy as possible to roll back. We also make high risk deployments in the morning, to have the whole day to monitor the system.
Improving the alarms. We are continuously tuning the alarms to make them more precise, and to weed out false alarms. One example is the queue length alarms. We started out by raising an alarm if the average length during the last 10 minutes was over a threshold. However, after fixing such problems, the queue length alarm did not clear until the average came down below the threshold. Not clearing it right away could mask a new problems happening. So we changed it to be raised as soon as the threshold was reached, and to clear it as soon as the queue length got down to zero. This made that particular alarm more precise and responsive. Only last Friday I was tweaking another alarm – some synchronization problems can wait until regular office hours, so those no longer wake anybody up.
Compensation and Scheduling
Pay. People on call should get paid extra for it. There is a significant impact on your life if you have to be ready to log on and troubleshoot issues at any time during a week, so you deserve to be compensated for that. I think the best system is when you both get a fixed amount just for being on call, whether there are incidents or not, and you also get paid every time you get called out. Getting time off in lieu is also a possibility.
Scheduling. When I have been on call, it has always been one week at a time, and I think that is a good duration. It is good to draw up the schedule far in advance (many months ahead). That way it is easier to plan other activities in your life. It is also really good to be able to swap weeks with other developers in the team. It is hard to say what the ideal cycle time for on call is. Too often, and it is hard to live a normal life. Too seldom and you don’t get enough practice and get rusty. Every seven or eight weeks is a good interval for me.
Escalation. There should always be an escalation path if there is a real crisis. If the whole system is down and you are not making any progress fixing the problem, you need to bring in more help. Typically a manager will be the escalation point, and they will try to get hold of whoever else can help. I also tell my colleagues that I don’t mind being called, even if I am not on call that week, if there is something I can help with. I have only been called a few times in ten years, but just knowing that you can call other team members if necessary helps lower the stress of being on call.
Onboarding. New members on the team need time to get to know the system and practice troubleshooting before they join the on call rotation. After that, we pair up with the new member for a week of on call together, before they are scheduled on their own.