Cloud & Infrastructure

Infrastructure-as-code from day one: the real cost of retrofitting later

Aetherisys Consulting #infrastructure-as-code#cloud#terraform

You are starting a greenfield project. The deadline is real, the team is small, and the cloud console is right there. Clicking through to create a database, a load balancer and a couple of buckets takes ten minutes. Writing the same thing as infrastructure-as-code (IaC) takes an afternoon, plus the time to learn the state model if nobody has used it before.

So the honest question is not “is IaC good practice” — everyone knows the answer they are supposed to give. The honest question is: does it pay for itself before the project ships, and if not, how much does it cost to add later?

Why click-ops quietly costs you

Click-ops — provisioning by hand in the console — is not free even on day one. The cost is just deferred and invisible.

The first thing you lose is knowledge. Once a resource exists, the reasoning behind it does not. Why is that security group open on port 5432 to a particular CIDR? Who set the RDS backup retention to seven days? With click-ops the answer lives in someone’s memory, and that someone changes jobs.

The second thing you lose is a known-good state. Every manual change is a change nobody can see. Someone bumps an instance size during an incident at 2am and forgets to mention it. Someone toggles a flag to test a theory. Three months in, your staging and production environments have drifted apart in ways nobody can fully enumerate. This is configuration drift, and it is the reason “it works in staging” stops meaning anything.

IaC does not eliminate drift by magic. It makes drift visible and correctable: terraform plan shows you exactly what no longer matches the code, and you decide whether to push the code or absorb the change.

The retrofit is the expensive part

People assume converting to IaC later is roughly the same work as doing it now, just postponed. It is not. It is materially harder, for three reasons.

You are importing, not creating. A fresh terraform apply builds resources from nothing. Retrofitting means writing code to match resources that already exist, then running terraform import for each one so the state file adopts it without recreating it. Every resource has to be reconciled by hand — its real attributes against your HCL — and a single mismatch can mean Terraform proposing to destroy and recreate a live database.

The environment is in use. When you provisioned by hand there was nothing to break. Now there is traffic, data and customers. Getting the first plan to a clean no-op against production is slow and nerve-wracking precisely because the cost of a wrong move is no longer zero.

Nobody owns the gaps. Manual environments accumulate orphans: a test bucket, an unused IP, an IAM role from a spike. During a retrofit you have to decide, one by one, whether each is load-bearing or junk — archaeology, not engineering.

A rough rule from doing this work: retrofitting IaC onto a live, moderately complex environment costs three to five times what it would have cost to write it up front, and most of that multiplier is risk and verification, not typing.

State is the part to get right

If there is one thing to set up properly on day one, it is state. Terraform’s state file is the source of truth that maps your code to real resources. Keep it on a laptop and you will lose it; keep it unlocked and two applies will corrupt it.

The minimum viable setup is a remote backend with locking — for example, an S3 bucket with versioning plus a DynamoDB lock table:

terraform {
  backend "s3" {
    bucket         = "acme-tfstate"
    key            = "prod/terraform.tfstate"
    region         = "ap-southeast-2"
    dynamodb_table = "acme-tfstate-locks"
    encrypt        = true
  }
}

This is fifteen minutes of work at the start and a genuinely painful migration once multiple people are applying changes. Do it first.

Environments and reproducibility

The real return on IaC shows up the second time you need an environment. A new staging stack, a per-customer deployment, a region for data-residency reasons, a clean environment to reproduce a production bug — with code, that is a new variable file and an apply. With click-ops it is a day of clicking and a near-certain guarantee the new environment differs from the old one in some way you will discover at the worst possible moment.

Keep environments as the same code with different inputs, not copy-pasted directories that drift independently. Parameterise instance sizes and counts; keep topology identical. Production should differ from staging in scale, not in shape.

Bake in observability while you are there

Defining infrastructure as code is the natural moment to define its observability too. The dashboards, alarms, log groups and retention policies live in the same modules as the resources they watch. A new service ships with its alerting already wired, because the module that creates the service also creates the alarm. Observability added later, by hand, is always patchy — some services covered, some not, and no way to tell which from the outside.

What “fully yours” handover looks like

If a consultancy builds your platform, the difference between a good handover and a bad one is whether you receive a system or a mystery.

A real handover is a Git repository: every resource defined in code, environments reproducible from a clean checkout, state in a backend you control, secrets in your own secret manager, and a README that explains how to plan and apply. You can rebuild the platform without the people who built it. That is what reproducible, observable and fully yours actually means in practice — not a slide, a repo.

A bad handover is a console full of resources, a credentials document, and a phone number.

When minimal IaC is enough

None of this means every project needs hundreds of lines of HCL and a module library on day one. Fast does not mean careless, and careful does not mean heavy.

For a genuine throwaway prototype with a hard kill date, click-ops is a defensible choice — as long as everyone agrees it is throwaway and acts accordingly. For a small, stable service that will not grow, a single flat Terraform configuration with a remote backend is plenty; you do not need workspaces, modules or a pipeline.

The line worth drawing is simple. The moment a system has real users, real data, or a second environment, the cost of not having IaC starts compounding — and it is far cheaper to cross that line before launch than after.

If you are about to start a greenfield build and want it reproducible from the first commit, get in touch — we are happy to talk through what minimal looks like for your case.