Nobody Talks About the 2 AM Call Until You've Had One

The 2 AM call has a distinctive quality that no other professional experience quite replicates. It is not the content of the call that defines it, though the content is usually bad enough. It is the specific cocktail of circumstances surrounding it: the disorientation of being pulled from sleep into full professional responsibility in approximately four seconds, the particular clarity that comes with adrenaline and no coffee, and the knowledge that somewhere in the organization, something critical is broken and the number of people who can fix it is a very small group, and your number was at the top of that list. Welcome to infrastructure ownership. The view from here is excellent, assuming you can see straight.

What the 2 AM call actually tests is not your technical knowledge, though that is obviously load-bearing. What it tests is the architecture of the system you have built during business hours. A well-documented runbook is worth its weight in gold at 2 AM in a way it never quite is at 2 PM, because at 2 PM you have context, colleagues, coffee, and the option to think slowly. At 2 AM you have none of those things, and the runbook either tells you exactly what to do or it tells you that it was last updated in 2021 and references a tool that was decommissioned. The difference between those two outcomes was determined weeks or months before the incident occurred, in the quality of the operational documentation produced during calm periods. Infrastructure resilience is built in the daylight and tested in the dark.

The incident itself is rarely the most interesting part of the 2 AM experience. The most interesting part is what you discover about your environment in the process of resolving it. The dependency that wasn't in the architecture diagram. The single point of failure that everybody knew existed but nobody had documented as a risk. The monitoring alert that should have fired an hour before the call but didn't because the threshold was set too conservatively six months ago and nobody had reviewed it since. Every major incident is a guided tour of your environment's hidden assumptions, and the assumptions revealed at 2 AM are the ones that genuinely needed revealing. They just have very poor timing about it.

The organizational response to the 2 AM call matters as much as the technical response. Teams that treat incident response as a blame exercise get good at covering their tracks. Teams that treat it as a learning exercise get good at finding root causes. The post-incident review that happens the following afternoon should be a genuinely curious examination of the full causal chain, including the process gaps and monitoring gaps and documentation gaps that created the conditions for the incident, not just the immediate technical trigger. The technical trigger is usually the least interesting part. The system that made the trigger possible is where the real story lives and where the real improvements get made.

If you have not yet had a 2 AM call, that is either because your infrastructure is exceptionally well-built or because you have not yet been in the role long enough. The former is worth celebrating. The latter is a matter of time, and the time to prepare for it is not 2 AM but right now, during the quiet periods when good documentation gets written, runbooks get tested, monitoring thresholds get reviewed, and alert escalation paths get verified. The 2 AM call is coming. The only variable is whether it finds you with a working runbook or a sticky note with a phone number that may or may not still be current.

Previous
Previous

Your Documentation Is Lying to You (And So Is Everyone Else's)

Next
Next

Goodbye, Sierra. We Barely Understood You.