Lessons Learned from Over a Decade of On-Call
In the shadows of every application, customer-facing or internal, there is a team of on-call engineers tirelessly working to ensure that the lights stay on. I have handled on-call duties for the last 12 years, from handling 24/7 shifts to more manageable shifts of a few hours a day.
Every organization has its strategy of alerting and on-call. The experience of being on-call in different teams for different stacks is never the same, but the lessons one learns can be applied everywhere.
The What, When, Who, and Why
Setting up effective alerts is the key to a successful on-call strategy.
- The very first step in setting an alert is to understand the ‘What’ — What type of event requires enough and urgent attention to wake a human up from their sleep. Imagine being woken up by a louder-than-life phone call for fault in a non-production environment.
- Next is comprehending the ‘When’ — When the alert should be triggered should be decided by thresholds and event conditions that balance well between detecting issues early and avoiding false alarms from routine fluctuations.
- Nailing the ‘Who’ — Identifying the right individuals and the right teams to notify based on the severity of the alert will guarantee an effective response.
- And, beyond these technicalities would be the strategic ‘Why’ — making sense of why the event that one is being paged for is critical. Understanding why an alert was needed on the event helps with strategically resolving it and preventing a business impact. This eventually results in a system where alerts are purposefully aligned with risk mitigation. For example, a timely alert on slow response times can reveal a problem like a DDos attack on a slow endpoint, and understanding the importance of quickly mitigating this can prevent platform downtime.
Over-alerting Is Detrimental in More Ways Than One
An on-call week brings with it an innate sense of urgency, alertness, and unrest regardless of whether alerts are triggered or the shift is completely uneventful. The worry of missing pages while taking a shower or forgetting one’s phone at home while walking the dog can be overcome by a very simple addition of a secondary on-call engineer, but overcoming the frustration from getting paged because of a false alert will need prioritization of alert management.
It is important to consider that every time the phone rings, the person on-call at the very least will get distracted from whatever they were doing to acknowledge the alert. This not only adds to the cognitive load but if the triggered alert was unnecessary, similar to the story of the shepherd boy that cried wolf too many times, having responded to too many false alerts will inadvertently lead to a diminished sense of urgency for critical alerts.
Every alert that calls needs to be actionable, clear in its messaging and require a level of intelligence where it cannot be solved by a robotic set of actions.
Importance of Looking Back
If you are getting called for the same alert regularly, or are getting called for too many alerts (hello, alert fatigue), and have been ignoring most of the triggered alerts, something is wrong with the alerting strategy.
Be it false alerts or too many alerts, improving the alerting mechanism is a continuous process. Regular retrospectives into how many alerts were triggered and assessing that the triggered alerts were meaningful, help with identifying and removing alerts that are unnecessary, fine-tuning the alerts that trigger too late, and grouping alerts to reduce noise during incidents.
The process of retrospective can be simple, the number of alerts can be assessed by extracting the data from the alerting tool. If the alerts are tagged and categorized(which they should be), insights related to service and severity can be gathered automatically. The next step would be to triage this data and find the top culprits. Once the data is refined, decisions can be made on improving the noisy alerts — removing unnecessary unactionable alerts, lowering alert severity to send a message instead of calling, or even assigning alerts to another team that owns the service.
It is inevitable that as the business grows and the tech stack grows, the number of alerts will grow and it becomes all the more important to regularly ensure that the alerting setup is not noisy and chaotic.
All of Us Are Smarter Than Any of Us
All alerts should point to an issue that either directly impacts the customers and the business, or has the potential of leading up to it. Hence, every alert requires a sense of urgency in being responded to. The on-call engineer needs to be able to judge when it’s time to call in reinforcement to fight the battle at hand.
If you have stayed with an issue for too long and are stuck, remember that there is a reason we work in teams. Some issues need an extra set of eyes, while some directly indicate that they need to be escalated to a different team.
The on-call or incident response handbooks must include instructions for escalating issues to team members and inviting other concerned teams in addition to instructions to troubleshoot the issue. Incidents are always handled by a team but alerts that take too long to fix stand in line of turning into an incident, asking for help from the team here is not only wise but necessary.
Team Culture
Team culture is probably the most underrated aspect of the on-call experience. Right from getting onboarded to the on-call rotations to handling the routine on-call, open communication and trust in the team is what leads to efficient incident resolutions and well-informed decisions.
The trust amongst team members that when in need, shifts and responsibilities can be traded without friction, greatly helps in reduced on-call stress. The trust of the team in the on-call engineer to handle, resolve, and escalate issues as needed helps in maintaining a balance between operational and velocity work and keeps everyone motivated.
Handling On-call Is Rewarding
Last but not least, one learns that on-call duty is a rewarding responsibility. Put simply, with every resolved alert and incident you essentially save the business money, and the feeling of being able to do that is very rewarding in itself. Each timely intervention into an issue prevents potential downtime, revenue loss, or customer dissatisfaction. Knowing that your actions directly contribute to the financial health and reputation of the company is quite gratifying.
An on-call duty that is well organized and rotated allows everyone on the team to solve critical issues, and shine under pressure. While there is no taking away from the fact that on-call responsibility can be challenging, but despite that, the positive outcomes that stem from it result in significant professional growth. Being an on-call engineer gives you a deep understanding of the architecture design and the processes in place. It helps to build an aptitude to connect the symptoms with their cause and makes you adept at identifying and addressing issues.
Wrapping Up
In conclusion, a reliable on-call strategy is the backbone of every performant business, the claims a business can make about its stability and reliability are always backed by the confidence in the on-call mechanism in place.
Understanding the right way of setting alerts and continually improving the ever-evolving on-call strategy results in efficient and quick resolutions of issues. Regularly conducting retros to revisit the triggered on-call alerts will result in accurate alerts that help build and maintain reliable systems while keeping the stress of the team in check.
A happy team and a good team culture benefit everyone including the business. Trust is the secret sauce to success — trust amongst the team members, trust in the alerting system, and trust in the business in its on-call framework. When there is trust, teamwork thrives, communication flows and operations sail smoothly.
On-call duty may come with its challenges and disruptions, but the satisfaction derived from ensuring operational continuity, in addition to the professional growth that comes with it, makes it quite a rewarding responsibility.
The Ultimate Guide to Managing Ethical and Security Risks in AI