Why do we codify stuff?

Infrastructure as Code (IaC) is agreed upon to be a requirement for all modern application infrastructure. There are several tools that help us do this - platform agnostic options like Terraform, Pulumi and Crossplane have heaps of different providers, or plugins that allow us to connect to a myriad of services. On the other hand, each major cloud vendor has their own options. Azure has Azure Resource Manager (ARM) templates and Bicep, AWS has CloudFormation, and GCP has the Google Cloud Build API.

Challenges

The first challenge you encounter when working with IaC is that there are dependencies. Resource A depends on some other resource B and C (e.g. a subnet is dependent on its parent Virtual Network). This is usually quite trivial. Then you might have dependencies across services, which can still be manageable, but can also become a pain if the provisioning of resource B or C takes time, or doesn’t immediately return the information you need for referencing it.

The biggest challenge with IaC in my mind is when the declarative approach becomes tedious or overly complex. I mentioned this briefly in my first encounter with GitOps:

For example Linkerd generates a root certificate when installed into the cluster, and this needs to be used by Cert-Manager to create an intermediate certificate. The public key of this certificate then needs to be added back to the configuration of Linkerd.

This example is pretty specific to my GitOps setup, but it illustrates a broader issue. These kinds of interdependencies and sequencing problems seem to crop up no matter what tool or approach you use.

This problem has been nagging at me since I started working with IaC, and I’m starting to wonder if I’m missing something fundamental about the whole concept.

Disaster recovery?

The most obvious reason for doing Infrastructure as Code is disaster recovery. Much like scaling™, it has become a favorite argument for making decisions about system architecture. It's an easy argument to make, because it can't really be argued against. This blog post puts it well:

First, it makes them look like "seasoned developers." After all, they foresee things and aspire to think through all the possibilities.

Second, it's because there's no data to argue with. They appeal to some point in the future when something might happen. But what are the chances that it will happen? Nobody knows.

Disaster recovery, much like scaling, is a question of risk management: cost of mitigation versus cost of consequence. Having your infrastructure declaratively defined should, in theory, allow you to reproduce your entire setup in the event of a disaster. But, as I mentioned earlier, this is only half true. There will always be some manual steps or configurations that need to be done along the way. Or more truthfully, I think there always should be manual steps and configuration. I'm not sure it is ever worth the time and effort to automate every single thing about your setup. It's alluring to envision a situation where one click or command can automatically set up your whole infrastructure, but realistically - how often would you really need it?

What is the real point?

In my mind, the better reason (and I’d argue the real reason) to use IaC is control and auditability. You want to be sure your infrastructure matches your desired state, and that every change is tracked and documented in version control. This isn’t just about avoiding drift; it’s about being able to see exactly who made what change and when.

To do this, you need frequent reconciliation, or enforcing of your desired state. Particularly with cloud configurations, the web portal is ever-so-inviting for ClickOps (simply clicking around to do changes). If this is allowed to go on for too long, wiping such changes could be catastrophic. Reconciliation should ideally happen continuously, but is somewhat limited by the resources required to calculate the actual state of the infrastructure. Such reconciliation must be hand-crafted for most IaC tools, using GitHub Actions or similar. Crossplane, though, is an exception – it’s built around this concept, which is why I like it so much.

Improving the developer experience through self service

A big bonus for IaC is its use in automating and delegating provisioning of infrastructure. Self-service tools can be built around automation pipelines and scripts to allow teams friction free setup of new databases, storage etc, without needing to involve a centralized platform team. Crossplane has a very neat, builtin way of doing this through Kubernetes resource manifests, but this can also be done using platform portals like Backstage combined with e.g. GitHub Actions.

This kind of team enablement is very powerful. It gives teams agency and autonomy to experiment more freely, and and encourages a culture of innovation where they can quickly test new ideas without waiting for approvals or assistance. This is essential for achieving fast flow.

Trying to achieve fast flow Clay Banks / Unsplash

Automatic reconciliation versus manual (re)implementation

Again, some things just aren’t worth automating, especially core infrastructure that doesn’t change frequently. Automating every little detail can become overkill, leading to unnecessary complexity without a corresponding benefit. In those cases, it’s okay to rely on manual steps. The chances of needing to rebuild core infrastructure from scratch are low, and the risk of configuration drift is often minimal if the components are stable. Instead of sinking time and effort into automating these parts, documenting the setup and the necessary steps for reimplementation can be more than sufficient. This kind of manual documentation still serves the core purpose of IaC – ensuring that the process is repeatable and controlled – without the overhead of maintaining automation for parts of the system that are unlikely to change.

Maybe that’s what I’m suggesting here. Maybe the definition of IaC should extend beyond just code to include well-maintained manual documentation – Infrastructure as Checklists.