Why do we need a runbook?
the alert is only a signal, the very first step; reducing MTTR and solving the customer problem is the ultimate goal;
an actionable alert is one of the repetitive tasks; Adding a runbook for the repetitive task will help to increase debugging, accuracy, and efficiency in the triaging process.
But.. what exactly is runbook?
A runbook is a series of steps and detailed instructions to solve common issues or tasks effectively.
Most of the time, we run into escalated issues or production alerts where we need to solve and figure out a solution or the root cause as fast as we can. This problem-solving typically involves a quick search into the logging or reporting tools (at TipTip : Grafana/Loki/Sentry), third-party log (Payment Gateway, SMS provider, Fraud Tools), asking a coworker, or even asking for help from other different departments (Data team). These procedures are nontrivial, require experience or self-initiative, and surely take time. Runbook comes to help to ensure we have an effective problem-solving process, no matter how new or experienced the person on the team is.
Ok, I got it, so when should we use Runbook?
At tiptip, runbooks are extremely helpful for two kinds of operations:
- Incident response operations
- runbook for specific alerts or incidents is needed to become documentation and ensuing shared knowledge from the Subject Matter Experts (i.e. engineer from the specific pod/squad)
- with detailed runbooks, there is less need for escalation, and the team can function with L1 on call
- Engineering operations
- , i.e., Infra Maintenance, Operation that doesn’t have a feature yet (bulk user suspension)
Clear! How do I create a runbook for our team services alert?
At TipTip, we have our own Production Incident Runbook Template on confluence, just create a new page and choose the template.