Part 2: DevOps on AWS: Infrastructure as Code at Scale

In Part 1, we established an uncomfortable truth:
DevOps on AWS is not about pipelines.
It’s about designing systems that change safely under pressure.
Nowhere is this more visible or more misunderstood than Infrastructure as Code (IaC).
Most teams treat IaC as “writing templates instead of clicking the console.”
That’s not IaC at scale.
That’s just scripted manual work.
At scale, Infrastructure as Code is about governance, blast radius, recovery, and trust.
Infrastructure Is Ephemeral - or It Becomes a Liability
AWS fundamentally changed how production infrastructure should be treated:
Servers are replaced, not repaired
Drift is not “normal” - it’s a defect
Manual changes are operational debt
Recovery must be faster than diagnosis
In mature AWS environments:
If infrastructure cannot be recreated from code, it is not production-ready.
This mindset shift is critical : both for real systems and for the DevOps Professional exam.
IaC Is a Control Plane, Not a Provisioning Tool
At scale, IaC answers questions like:
Who is allowed to change what?
How do we know what changed?
How fast can we undo a bad change?
Can we rebuild everything right now?
IaC becomes the control plane for production change, not just a deployment mechanism.
This is why “just Terraform” or “just CloudFormation” thinking fails in enterprise systems.
CloudFormation vs Terraform vs CDK (Reality, Not Religion)
CloudFormation
Strengths:
Native AWS integration
Deep service coverage
Predictable behavior under failure
First-class drift detection
Trade-offs:
Verbose
Slower iteration
Less expressive logic
Best used when:
AWS-only environments
Strong governance and audit requirements
Regulated or risk-averse systems
Terraform
Strengths:
Multi-cloud support
Strong module ecosystem
Declarative state management
Trade-offs:
State becomes a critical dependency
Provider bugs can cause real outages
Drift detection is weaker than CloudFormation
Best used when:
Multi-cloud or hybrid environments
Platform teams managing shared infrastructure
Strong state discipline exists
AWS CDK
Strengths:
Real programming languages
Reusable constructs
Faster iteration for complex systems
Trade-offs:
Abstraction leaks
Generated templates can become opaque
Requires strong engineering discipline
Best used when:
Platform engineering teams
Reusable internal frameworks
Teams comfortable debugging generated IaC
Key insight:
The exam doesn’t ask which tool is best.
It asks which trade-off fits the constraint.
Drift Is the Silent Production Killer
Drift happens when:
Engineers “hot-fix” via console
Emergency changes bypass IaC
Permissions allow uncontrolled modification
Drift leads to:
Failed rollbacks
Inconsistent environments
Disaster recovery surprises
Production-grade systems enforce:
Drift detection
Drift remediation
Restricted write access outside IaC pipelines
In AWS terms:
CloudFormation drift detection
IAM boundary enforcement
Change pipelines as the only mutation path
If you can’t explain your infrastructure state —> you don’t control it.
Multi-Account IaC: Scaling Without Chaos
At scale, AWS DevOps is multi-account by default:
Security account
Shared services
Dev / Test / Prod
Workload isolation
IaC must support:
Cross-account deployments
Environment-specific configuration
Centralized governance with local autonomy
Common patterns:
One repo per environment (simple, limited)
One repo per workload (scales better)
Central platform repo + workload repos (enterprise standard)
The goal:
Teams move fast - without breaking shared foundations.
Safe Change Is More Important Than Fast Change
IaC failures are not rare - they are inevitable.
Production systems design for:
Partial failures
Rollback on error
No-impact retries
Key principles:
Idempotency over cleverness
Small, incremental changes
Immutable deployments
Rollback plans defined before rollout
From both exam and real-world perspective:
Rollback speed matters more than rollout speed.
IaC and CI/CD Are Coupled - But Not the Same
IaC pipelines must:
Validate templates (linting, synth, plan)
Preview impact before execution
Require approval for high-risk changes
Automatically rollback on failure
This is why:
“Apply on merge” is dangerous at scale
Manual approvals still exist in mature systems
Change control ≠ lack of DevOps maturity
DevOps maturity is about controlled velocity, not blind automation.
How the DevOps Professional Exam Tests IaC
The exam doesn’t ask:
“What is CloudFormation?”
It asks:
How do you prevent drift?
How do you rollback safely?
How do you scale changes across accounts?
How do you reduce blast radius?
Every IaC question is really a risk-management question.
What’s Next (Part 3)
In Part 3, we’ll dive into:
Deployment Strategies Under Failure
Blue/Green vs Canary vs Rolling
Progressive delivery on AWS
Feature flags vs redeployments
Reducing blast radius in production
We’ll connect:
Deployment patterns → real outages → exam scenarios
Final Thought
Infrastructure as Code is not about declaring resources.
It is about declaring intent, control, and recovery.
When systems change faster than humans can reason about them,
code becomes the only source of truth.
That’s not just DevOps.
That’s survival at scale.



