Skip to main content

Command Palette

Search for a command to run...

Part 2: DevOps on AWS: Infrastructure as Code at Scale

Updated
4 min read
Part 2: DevOps on AWS: Infrastructure as Code at Scale

In Part 1, we established an uncomfortable truth:

DevOps on AWS is not about pipelines.
It’s about designing systems that change safely under pressure.

Nowhere is this more visible or more misunderstood than Infrastructure as Code (IaC).

Most teams treat IaC as “writing templates instead of clicking the console.”

That’s not IaC at scale.
That’s just scripted manual work.

At scale, Infrastructure as Code is about governance, blast radius, recovery, and trust.


Infrastructure Is Ephemeral - or It Becomes a Liability

AWS fundamentally changed how production infrastructure should be treated:

  • Servers are replaced, not repaired

  • Drift is not “normal” - it’s a defect

  • Manual changes are operational debt

  • Recovery must be faster than diagnosis

In mature AWS environments:

If infrastructure cannot be recreated from code, it is not production-ready.

This mindset shift is critical : both for real systems and for the DevOps Professional exam.


IaC Is a Control Plane, Not a Provisioning Tool

At scale, IaC answers questions like:

  • Who is allowed to change what?

  • How do we know what changed?

  • How fast can we undo a bad change?

  • Can we rebuild everything right now?

IaC becomes the control plane for production change, not just a deployment mechanism.

This is why “just Terraform” or “just CloudFormation” thinking fails in enterprise systems.


CloudFormation vs Terraform vs CDK (Reality, Not Religion)

CloudFormation

Strengths:

  • Native AWS integration

  • Deep service coverage

  • Predictable behavior under failure

  • First-class drift detection

Trade-offs:

  • Verbose

  • Slower iteration

  • Less expressive logic

Best used when:

  • AWS-only environments

  • Strong governance and audit requirements

  • Regulated or risk-averse systems


Terraform

Strengths:

  • Multi-cloud support

  • Strong module ecosystem

  • Declarative state management

Trade-offs:

  • State becomes a critical dependency

  • Provider bugs can cause real outages

  • Drift detection is weaker than CloudFormation

Best used when:

  • Multi-cloud or hybrid environments

  • Platform teams managing shared infrastructure

  • Strong state discipline exists


AWS CDK

Strengths:

  • Real programming languages

  • Reusable constructs

  • Faster iteration for complex systems

Trade-offs:

  • Abstraction leaks

  • Generated templates can become opaque

  • Requires strong engineering discipline

Best used when:

  • Platform engineering teams

  • Reusable internal frameworks

  • Teams comfortable debugging generated IaC

Key insight:
The exam doesn’t ask which tool is best.
It asks which trade-off fits the constraint.


Drift Is the Silent Production Killer

Drift happens when:

  • Engineers “hot-fix” via console

  • Emergency changes bypass IaC

  • Permissions allow uncontrolled modification

Drift leads to:

  • Failed rollbacks

  • Inconsistent environments

  • Disaster recovery surprises

Production-grade systems enforce:

  • Drift detection

  • Drift remediation

  • Restricted write access outside IaC pipelines

In AWS terms:

  • CloudFormation drift detection

  • IAM boundary enforcement

  • Change pipelines as the only mutation path

If you can’t explain your infrastructure state —> you don’t control it.


Multi-Account IaC: Scaling Without Chaos

At scale, AWS DevOps is multi-account by default:

  • Security account

  • Shared services

  • Dev / Test / Prod

  • Workload isolation

IaC must support:

  • Cross-account deployments

  • Environment-specific configuration

  • Centralized governance with local autonomy

Common patterns:

  • One repo per environment (simple, limited)

  • One repo per workload (scales better)

  • Central platform repo + workload repos (enterprise standard)

The goal:

Teams move fast - without breaking shared foundations.


Safe Change Is More Important Than Fast Change

IaC failures are not rare - they are inevitable.

Production systems design for:

  • Partial failures

  • Rollback on error

  • No-impact retries

Key principles:

  • Idempotency over cleverness

  • Small, incremental changes

  • Immutable deployments

  • Rollback plans defined before rollout

From both exam and real-world perspective:

Rollback speed matters more than rollout speed.


IaC and CI/CD Are Coupled - But Not the Same

IaC pipelines must:

  • Validate templates (linting, synth, plan)

  • Preview impact before execution

  • Require approval for high-risk changes

  • Automatically rollback on failure

This is why:

  • “Apply on merge” is dangerous at scale

  • Manual approvals still exist in mature systems

  • Change control ≠ lack of DevOps maturity

DevOps maturity is about controlled velocity, not blind automation.


How the DevOps Professional Exam Tests IaC

The exam doesn’t ask:

“What is CloudFormation?”

It asks:

  • How do you prevent drift?

  • How do you rollback safely?

  • How do you scale changes across accounts?

  • How do you reduce blast radius?

Every IaC question is really a risk-management question.


What’s Next (Part 3)

In Part 3, we’ll dive into:

Deployment Strategies Under Failure

  • Blue/Green vs Canary vs Rolling

  • Progressive delivery on AWS

  • Feature flags vs redeployments

  • Reducing blast radius in production

We’ll connect:
Deployment patterns → real outages → exam scenarios


Final Thought

Infrastructure as Code is not about declaring resources.

It is about declaring intent, control, and recovery.

When systems change faster than humans can reason about them,
code becomes the only source of truth.

That’s not just DevOps.

That’s survival at scale.

AWS DevOps Professional – Designing, Operating & Scaling Production Systems

Part 2 of 3

This series focuses on how AWS DevOps works in real production environments - not just what services exist, but: Why they’re used When they fail How senior engineers design for scale, resilience, and speed This is exam-aligned, but production-first.

Up next

Part 1: AWS DevOps Professional

DevOps on AWS: From Pipelines to Production Operating Models Most people think DevOps on AWS means CI/CD pipelines, YAML files, and automation tools. That’s not DevOps.That’s just mechanization. At scale, DevOps on AWS is an operating model - a way s...