
Terraform vs Pulumi vs CDK in 2026: What We Actually Use and Why
Three tools that claim to solve the same problem. Here is what the differences actually mean in practice, and when each one makes sense.
Single points of human failure are the most common infrastructure risk we see in audits. Here is what they look like when they materialise.
We see this pattern on roughly one in three audits. A company hired a DevOps engineer 18 months ago. That engineer became the single person who understood the infrastructure. Then they left. The engineering team now deploys with a checklist written by someone who is no longer at the company, calling a phone number that may or may not get answered.
Infrastructure knowledge concentrates naturally. The person who built the system knows things that are never written down: why certain components are configured the way they are, what the 2am runbook actually means, which AWS console setting was changed manually and never made it into Terraform. That knowledge lives in one person's head because documenting it feels less urgent than building the next thing.
The first week after the departure is usually manageable. Nobody deploys. The team is focused on the immediate problem of coverage. But production does not pause. Bugs need to be fixed. Features have deadlines. Within ten days, someone deploys manually with incomplete knowledge of what the previous engineer would have checked.
The incident that follows is almost always a variation of the same scenario. Someone changes something in production that triggers a failure mode the previous engineer would have known to avoid. The team debugs in an environment they only partially understand, taking three times as long as a knowledgeable engineer would. By the time it is resolved, the cost in engineering time is usually several thousand dollars. The cost in customer trust is harder to measure.
The deeper problem is not the incident itself. It is what the incident reveals: the team discovers, usually for the first time, how much infrastructure knowledge was undocumented. AWS resources created manually with no IaC representation. Deployment scripts that depend on environment variables that exist only on the previous engineer's laptop. Monitoring alerts configured in a system nobody else has credentials for.
Recovering from this situation is a forensic exercise before it is an engineering one. Someone has to enumerate what exists, reverse-engineer the current state, document it, and then rebuild the operational model around that documentation. For a typical startup infrastructure, this takes three to six weeks of senior engineering time — time that is not available for product work.
Infrastructure needs to be owned by a system, not a person. This means Infrastructure as Code that actually reflects production state, not aspirational Terraform that predates the last six manual changes. It means runbooks written by the person doing the deployment, not the person who built the system three years ago. It means more than one person who has ever deployed to production.
The companies that call us after a departure are not in trouble because they made a bad hire. They are in trouble because the hire became a single point of failure that nobody noticed until it materialised. The fix is structural. The question is whether you address it before or after the departure that forces the conversation.
Related Service
DevOps-as-a-Service
We become your entire DevOps department.