The art of on-call duty

Being on-call means working/fixing live issues under high pressure and almost during midnight and/or weekend.

  • Taking strong responsibility for what we work, for what we did, for what we build
  • Uptime is an important metric, also critical to the success of our production.
  • If our product is down, we acutely feel our customer’s pain
  • Being on-call duty means we can learn a lot from it, for both technical knowledge & growing mindset.
  • Must get high paid for each on-call duty shift, because of stressful.

Table of Contents


Important points:

  • Keep production services reliable and available
  • Must have one of many dedicated teams of SREs that always take responsibility for the performance and reliability, also must have a strong background of system/infra/network and software development (because maybe need to inspect each line of code)
  • When you know you might get woken up in the middle of the night if there is a problem, you become more careful building infra, implementing components and testing.
  • Troubleshooting the system forces you to get a good understanding of how it all works
  • Being on-call means we always facing with these internal issues like:
    • troubleshooting tools aren’t as good as they should be
    • false alarms, unreliable alarms
    • logs without useful information
    • have no way to backup/restore

=> Improve our mindset for making alerting/monitoring/logging system better during live issues, driving improvements at every part of entire our system


  1. Build: init monitoring & logging system that help us detect live failures before most customers notice a problem, set up a reliable alerting system (pagerduty) to make sure we can receive alerts immediately
  2. Prepare: rotating SRE team member by shift to make sure we always have at least one guy to take care production anytime, also covered with always ready machine/LTE/phone
  3. Triage: acknowledge (ack) on alerts whenever you can, determine the urgency of the problem
  4. Fix: take whatever actions are necessary in order to resolve the issue immediately and get production back to normal, including implement hotfix or even rollback to the last working version.
  5. Report: after fixing incident, need to file ticket/report for logging which issues, what we did for solving, file a JIRA ticket to keep a track and push more and more details into this, writing a postmortem (incident postmortem is an excellent framework for learning from incidents and turning problems into progress)
  6. Follow-up & Improvement: fixed issues & logged into specific tickets do not mean we’re done with that, not yet. We must take action to follow-up after incident, how to prevent these issues in the future, improve sense of alerting, how can we backup/restore/rollback, how to make alarm more precise, responsive, and sensitive. At every step of follow-up/improvement, must write them down to central documentation.


  • If we only fix the symptoms (what we see on the surface), the problem will almost certainly return, and need fixing over, and over again
  • So the Root cause means:
    • What has caused the incident?
    • What’s the reason the system failed the way it did?
    • What’s initiating cause that led to an outage or degradation in performance
    • If we know and identify the root cause, action can be taken to fix issues permanently

In a technical way, root cause of live issues can be separated into 3 groups

  • Changes in the last deployment: changes in codebase, changes in configurations, but not be tested carefully or not aware of hidden failures
  • Exist at scale: during high traffic, during unexpected traffic, bottle-neck or facing limitations of some important components
  • 3rd party issue: hardware failure, provider/ISP failure, unreliable networking, 3rd API issues,…