DevOpsGeneral

What Happens to Your Infrastructure When Your Only DevOps Engineer Quits

January 20, 20264 min readWritten by Mohammad Assi

Single points of human failure are the most common infrastructure risk we see in audits. Here is what they look like when they materialise.

We see this pattern on roughly one in three audits. A company hired a DevOps engineer 18 months ago. That engineer became the single person who understood the infrastructure. Then they left. The engineering team now deploys with a checklist written by someone who is no longer at the company, calling a phone number that may or may not get answered.

Why it happens

Infrastructure knowledge concentrates naturally. The person who built the system knows things that are never written down: why certain components are configured the way they are, what the 2am runbook actually means, which AWS console setting was changed manually and never made it into Terraform. That knowledge lives in one person's head because documenting it feels less urgent than building the next thing.

What the first week looks like

The first week after the departure is usually manageable. Nobody deploys. The team is focused on the immediate problem of coverage. But production does not pause. Bugs need to be fixed. Features have deadlines. Within ten days, someone deploys manually with incomplete knowledge of what the previous engineer would have checked.

The first incident

The incident that follows is almost always a variation of the same scenario. Someone changes something in production that triggers a failure mode the previous engineer would have known to avoid. The team debugs in an environment they only partially understand, taking three times as long as a knowledgeable engineer would. By the time it is resolved, the cost in engineering time is usually several thousand dollars. The cost in customer trust is harder to measure.

The discovery problem

The deeper problem is not the incident itself. It is what the incident reveals: the team discovers, usually for the first time, how much infrastructure knowledge was undocumented. AWS resources created manually with no IaC representation. Deployment scripts that depend on environment variables that exist only on the previous engineer's laptop. Monitoring alerts configured in a system nobody else has credentials for.

The recovery cost

Recovering from this situation is a forensic exercise before it is an engineering one. Someone has to enumerate what exists, reverse-engineer the current state, document it, and then rebuild the operational model around that documentation. For a typical startup infrastructure, this takes three to six weeks of senior engineering time — time that is not available for product work.