Terraform Complete Guide 2026: State, Modules, CI/CD, Multi-Account & Interview Q&A
- What is Terraform and Why Is It the Standard?
- Core Concepts β State, Plan, Apply, Destroy
- Terraform State β The Most Critical Concept
- Variables, Outputs, and Locals
- Modules β Reusable Infrastructure Components
- Workspaces vs Separate State Files
- Terraform in CI/CD β The GitOps Workflow
- Production Best Practices
- Common Terraform Errors and Fixes
- 12 Terraform Interview Questions with Expert Answers
Terraform is the industry-standard Infrastructure as Code tool for DevOps and platform engineering. I use Terraform daily in production β provisioning EKS clusters, VPCs, RDS instances, IAM roles, and CloudFront distributions across multiple AWS accounts. This guide covers Terraform from core concepts to advanced patterns used in real production environments, with the depth needed to pass senior DevOps interviews.
What is Terraform and Why Is It the Standard?
Terraform by HashiCorp lets you define cloud infrastructure as declarative HCL (HashiCorp Configuration Language) code. You describe the desired end state of your infrastructure, and Terraform figures out how to create, update, or delete resources to reach that state. This is fundamentally different from scripting (Bash, Python boto3) where you write imperative steps β Terraform handles the dependency resolution, ordering, parallelisation, and state tracking for you.
Terraform won the IaC landscape because of three things: provider ecosystem (2,000+ providers covering every cloud, SaaS tool, database, and monitoring platform), the plan/apply workflow (you see exactly what will change before it changes β no surprises), and the module system (reusable, versioned infrastructure components that standardise patterns across teams). AWS CloudFormation covers AWS only. Pulumi is a code-based alternative but has a smaller community. Terraform has the largest ecosystem, most job postings, and most production deployments.
Core Concepts β State, Plan, Apply, Destroy
Terraform's workflow has four phases that you must understand deeply for interviews:
- terraform init β Initialises the working directory. Downloads provider plugins (the AWS, GCP, or Kubernetes provider binaries), initialises the backend (where state is stored), and installs any modules referenced. Must be run after any change to provider versions or backend configuration. Safe to run repeatedly β it is idempotent.
- terraform plan β Compares the desired state (your HCL code) against the current state (the state file) and the real infrastructure (by calling cloud provider APIs). Outputs a diff showing exactly which resources will be created (+), updated (~), or destroyed (-). Critically: a plan that shows no changes is your confirmation that real infrastructure matches your code. Always review the plan before applying.
- terraform apply β Executes the plan. Calls provider APIs to create, update, or delete resources. Updates the state file after each resource operation. Can be run with -auto-approve in CI/CD to skip the interactive prompt β but only after the plan has been reviewed and approved in a prior pipeline stage.
- terraform destroy β Creates a plan to destroy all resources in the state file, then applies it. Irreversible. In production, protect against accidental destroy with lifecycle { prevent_destroy = true } on critical resources (databases, S3 buckets) and by restricting who can run destroy in your CI/CD pipeline.
Terraform State β The Most Critical Concept
Terraform state is the source of truth for what Terraform has created. It maps your HCL resource definitions to real infrastructure resource IDs. Without state, Terraform cannot know what already exists and would try to create everything from scratch on every apply, causing duplicate resources and errors.
Never store Terraform state locally in production. Local state is not shared between team members, is lost if the developer's machine is lost, and cannot support locking (preventing simultaneous applies). Always use remote state with state locking:
terraform state subcommands: terraform state list (see all tracked resources), terraform state show resource (inspect a resource's state), terraform state mv (rename a resource in state without destroying it), terraform state rm (remove a resource from state without destroying the real resource), and terraform import (import an existing resource into state). These operations require careful execution β always backup state before manipulation.Variables, Outputs, and Locals
Modules β Reusable Infrastructure Components
Modules are Terraform's code reuse mechanism. A module is a directory of .tf files with a defined interface (input variables and output values). The same module can be called multiple times with different inputs to create similar infrastructure for different environments or teams. Well-designed modules reduce duplication, enforce standards, and make large Terraform codebases manageable.
Workspaces vs Separate State Files
Terraform workspaces allow multiple state files within a single backend, identified by workspace name. The default workspace is called "default". You can create a "staging" and "production" workspace within the same backend bucket. Workspaces are most useful for short-lived feature environments (PR environments) where the infrastructure is identical but needs isolation.
For long-lived environments (dev, staging, production) most experienced teams use separate state file paths instead of workspaces. Separate paths give clearer isolation, simpler blast radius, and avoid the common mistake of accidentally applying production changes to staging (or vice versa) by being in the wrong workspace. Use workspaces for ephemeral environments; use separate backends or state key paths for permanent environments.
Terraform in CI/CD β The GitOps Workflow
Running Terraform manually is only acceptable for initial bootstrapping. Production infrastructure changes must go through a CI/CD pipeline with plan review and approval gates. The standard pattern I use with GitHub Actions:
Production Best Practices
- Use lifecycle prevent_destroy for critical resources. Add
lifecycle { prevent_destroy = true }to RDS instances, S3 state buckets, and production EKS clusters. This causes Terraform to error instead of destroying these resources, even with terraform destroy. - Tag everything with a common_tags local. Define a locals block with standard tags (Environment, Project, ManagedBy, Owner, CostCenter) and apply them to every resource. AWS Cost Explorer, resource group searches, and security audits all depend on consistent tagging.
- Use data sources to reference existing resources. If a resource was not created by Terraform, use a data source to look it up:
data "aws_vpc" "existing" { filter { name="tag:Name"; values=["prod-vpc"] } }. Never hardcode ARNs or IDs β they change between accounts and environments. - Pin provider versions. Use version constraints like
~> 5.0(allow 5.x but not 6.x) for providers. A major provider version bump can contain breaking changes that silently destroy and recreate resources. - Separate state per service. Don't put all infrastructure in one state file. Separate by: networking (VPC, subnets), compute (EKS, node groups), data (RDS, ElastiCache), and application (Helm releases). This limits blast radius and plan runtime.
- Use sensitive = true for secret variables. Variables marked sensitive are never printed in plan or apply output, reducing the risk of secrets appearing in CI/CD logs.
Common Terraform Errors and How to Fix Them
- Error acquiring state lock β Another apply is running (or a previous one crashed without releasing the lock). Run
terraform force-unlock LOCK_IDwith the lock ID shown in the error. Only do this if you are certain no other apply is actually running. - Error: Resource already exists β The resource exists in AWS but not in Terraform state. Import it:
terraform import aws_s3_bucket.my_bucket my-existing-bucket-name. After import, run plan to verify no unintended changes. - Plan shows unexpected destroy of critical resource β Usually caused by renaming a resource block (Terraform sees it as destroy old + create new), moving a resource into or out of a module, or changing a force-new argument. Use
terraform state mvto rename resources in state without destroying them. Usemovedblocks (Terraform 1.1+) for module refactoring. - Provider authentication errors in CI/CD β Never use long-lived IAM access keys in CI/CD. Use OIDC federation: GitHub Actions, GitLab CI, and CircleCI all support assuming an IAM role via OIDC without stored credentials. This generates temporary credentials per pipeline run.
12 Terraform Interview Questions with Expert Answers
terraform taint resource.name (deprecated since Terraform 0.15.2) marked a resource as "tainted" in state, causing it to be destroyed and recreated on the next apply. terraform apply -replace="resource.name" is the modern replacement β it generates a plan showing the destroy/create for that resource and applies it in one step. Use -replace when a resource is in a broken state that can't be fixed by updating its configuration: a corrupted EC2 instance, a Kubernetes node in NotReady state that won't recover, or a resource whose cloud-side state has diverged from what Terraform expects. Always prefer targeted replacement over manual resource deletion, which would leave the old resource in state pointing to a non-existent resource.data "aws_secretsmanager_secret_version". The secret value never touches your Terraform code or variables. (3) Use Vault dynamic secrets β Terraform's Vault provider generates short-lived credentials on demand (e.g., temporary database credentials) that expire after use. For the state file itself: always encrypt with KMS, restrict S3 bucket access to only the CI/CD role and senior engineers, and consider using Terraform Cloud or HCP Terraform which encrypts sensitive values in state separately from the state file itself.count creates N instances indexed by integer (0, 1, 2...). for_each creates instances keyed by a string or map value. The critical difference is what happens when you remove an item from the middle: with count, removing item at index 1 from a list of 3 causes Terraform to shift items 2 and 3 down to indices 1 and 2, which Terraform interprets as modifying those resources β potentially destroying and recreating them. With for_each using stable string keys (like a map or set of strings), removing one key only destroys that specific resource. Use count only when you need simple integer-indexed identical resources (like N EC2 instances with no identity). Use for_each for resources with distinct identities (IAM users by name, S3 buckets by purpose, security group rules by protocol). In practice, prefer for_each in almost all cases.terraform plan -refresh-only β this shows differences between state and real infrastructure without proposing any configuration changes. In CI/CD, run a scheduled drift detection job daily: terraform plan -detailed-exitcode returns exit code 2 if there are changes (including drift) and exit code 0 if there are none β alert on exit code 2. When drift is detected: if the manual change was intentional, codify it in Terraform and apply. If it was unintentional, run apply to revert it. Prevent drift with SCPs in AWS Organizations that block console write actions on production resources (only allow the Terraform CI/CD role to make changes).moved { from = aws_s3_bucket.old_name; to = aws_s3_bucket.new_name }. It is also essential when refactoring flat resources into modules or extracting modules into child modules. After applying the moved block, the entry in state is updated and you can remove the moved block in the next commit.provider "aws" { alias = "production"; assume_role { role_arn = "arn:aws:iam::PROD:role/terraform" } }. Use AWS Organizations + SCPs managed by Terraform to enforce guardrails across all accounts. Use Terragrunt (a thin wrapper around Terraform) to DRY up multi-account, multi-region configurations β it handles backend configuration generation, dependency management between stacks, and parallel apply across multiple modules.terraform validate checks the configuration for syntax errors and internal consistency β references to undefined variables, incorrect argument types, invalid resource configurations. It does not check against the cloud provider API, so it runs offline without credentials. Run it in CI/CD as the first step before init or plan. terraform fmt automatically formats .tf files to the canonical HCL style β consistent indentation (2 spaces), alignment of = signs in blocks, and sorted arguments. Run terraform fmt -check -recursive in CI to fail the pipeline if any file is not formatted β this enforces a consistent code style across the team. Both are fast, free, and should be gates in every Terraform CI/CD pipeline before plan is allowed to run.terraform test) that lets you write .tftest.hcl files testing specific module outputs with mock providers (no real cloud resources required). (3) Integration tests β Terratest (Go library by Gruntwork) provisions real infrastructure in a test account, runs assertions, then destroys everything. Slower (minutes to hours) but catches real provider behaviour. (4) Policy-as-code β Sentinel (Terraform Cloud) or OPA/Conftest enforces organisational policies: "all S3 buckets must have versioning enabled", "no security groups can allow 0.0.0.0/0 on port 22". Run as a pre-apply gate in your pipeline.ποΈ Explore Terraform on the Interactive Mind Map
See how Terraform connects to AWS, Kubernetes, CI/CD pipelines, and GitOps workflows.
Open Interactive Mind Map βοΈ AWS Guide β