← Home πŸ—ΊοΈ Mind Map β˜• Ko-fi πŸ’³ Razorpay
// AWS Guide Β· DevOps & Cloud

AWS DevOps Guide 2026: EKS, IAM, VPC, IRSA & Real Interview Questions

πŸ“… Updated May 2026 ⏱ 24 min read 🏷 AWS Β· EKS Β· IAM Β· VPC Β· Cloud Β· DevOps
πŸ‘¨β€πŸ’»
Dhanush R β€” Senior DevOps Engineer
4.5+ years of daily hands-on experience with AWS in production β€” EKS, IAM, VPC, RDS, S3, CloudWatch, and cost optimisation across enterprise environments. Every section here comes from real production incidents and architecture decisions.
// Table of Contents
  1. Core AWS Services Every DevOps Engineer Must Know
  2. VPC Architecture β€” Production Design
  3. IAM Deep Dive β€” Roles, Policies, and IRSA
  4. EKS in Production β€” AWS-Specific Components
  5. S3 Advanced β€” Policies, Versioning, and Lifecycle
  6. AWS Cost Optimisation Strategies
  7. CloudWatch Monitoring and Alerting
  8. Essential AWS CLI Commands
  9. 12 AWS Interview Questions with Expert Answers

AWS is the dominant cloud platform in DevOps, with the broadest service catalogue and the largest ecosystem of tools. I work with AWS daily in production β€” provisioning EKS clusters with Terraform, configuring IRSA for pod-level IAM authentication, designing multi-AZ VPC architectures, troubleshooting IAM permission errors at 2am, and optimising cloud costs across dozens of services. This guide covers the AWS services, patterns, and concepts that appear most in DevOps and SRE interviews in 2026.

Core AWS Services Every DevOps Engineer Must Know

AWS has over 200 services. In DevOps roles, you need deep knowledge of approximately 12–15 services. Here are the ones you will use daily and be interviewed on most frequently:

VPC Architecture β€” Production Design

Every production AWS workload should run in a properly designed VPC. A poorly designed VPC is a security vulnerability. Here is the standard multi-AZ architecture I implement for production EKS environments:

# 3-AZ production VPC architecture Region: ap-south-1 (Mumbai) VPC: 10.0.0.0/16 Public Subnets (one per AZ β€” for ALBs and NAT Gateways only): ap-south-1a: 10.0.0.0/24 ap-south-1b: 10.0.1.0/24 ap-south-1c: 10.0.2.0/24 β†’ Route: 0.0.0.0/0 β†’ Internet Gateway Private Subnets (EKS nodes, RDS, application services): ap-south-1a: 10.0.10.0/23 (512 IPs) ap-south-1b: 10.0.12.0/23 ap-south-1c: 10.0.14.0/23 β†’ Route: 0.0.0.0/0 β†’ NAT Gateway (in public subnet) Database Subnets (RDS, ElastiCache β€” no internet access): ap-south-1a: 10.0.20.0/24 ap-south-1b: 10.0.21.0/24 ap-south-1c: 10.0.22.0/24 β†’ Route: local only (no NAT, no IGW)

Key design decisions: EKS worker nodes go in private subnets β€” they should never have direct public IPs. The ALB sits in public subnets and routes to nodes in private subnets. RDS goes in database subnets with a security group that only allows inbound on port 5432 from the EKS node security group. NAT Gateway provides outbound internet access for private subnet resources (pulling Docker images, calling external APIs) without inbound exposure.

IAM Deep Dive β€” Roles, Policies, and IRSA

IAM is the most complex service in AWS and the source of the most production issues I have investigated. Understanding it deeply is essential for both security and operational efficiency.

IAM Policy Evaluation Logic

When an API call is made, AWS evaluates all applicable policies in this order: (1) Check for explicit Deny β€” if any policy explicitly denies the action, it is denied immediately regardless of any Allow. (2) Check for explicit Allow β€” if any policy explicitly allows the action, it proceeds. (3) Implicit Deny β€” if there is no explicit Allow, the request is denied by default. This is why IAM is Deny by default β€” you must explicitly grant every permission.

# Minimum IAM policy for EKS worker nodes { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "ec2:DescribeInstances", "ec2:DescribeRegions", "ecr:GetAuthorizationToken", # pull images from ECR "ecr:BatchGetImage", "ecr:GetDownloadUrlForLayer" ], "Resource": "*" } ] }

IRSA β€” IAM Roles for Service Accounts

IRSA is the secure way to give individual EKS Pods access to AWS services without sharing credentials. Without IRSA, your only options are: embedding IAM access keys in the Pod spec (a major security risk), or using the EC2 instance IAM role (which gives all Pods on that node the same permissions, violating least privilege). IRSA works via OIDC federation: the EKS cluster exposes an OIDC endpoint, and IAM trusts tokens signed by that endpoint. Pods using a ServiceAccount annotated with an IAM role ARN receive a projected JWT token that AWS STS validates, then issues temporary credentials with automatic rotation.

# ServiceAccount annotation for IRSA apiVersion: v1 kind: ServiceAccount metadata: name: s3-reader-sa namespace: production annotations: eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/s3-reader-role # Corresponding IAM trust policy on the IAM role { "Principal": { "Federated": "arn:aws:iam::123456789:oidc-provider/oidc.eks.ap-south-1.amazonaws.com/id/EXAMPLED539D4633E53DE1B716D3041E" }, "Condition": { "StringEquals": { "oidc.eks.ap-south-1.amazonaws.com/id/EXAMPLED:sub": "system:serviceaccount:production:s3-reader-sa" } } }

EKS in Production

Running Kubernetes on AWS EKS in production involves several AWS-specific components beyond vanilla Kubernetes:

S3 Advanced β€” Policies, Versioning, and Lifecycle

# S3 bucket policy β€” enforce encryption and deny HTTP { "Version": "2012-10-17", "Statement": [ { "Sid": "DenyUnencryptedObjectUploads", "Effect": "Deny", "Principal": "*", "Action": "s3:PutObject", "Resource": "arn:aws:s3:::my-prod-bucket/*", "Condition": { "StringNotEquals": { "s3:x-amz-server-side-encryption": "aws:kms" } } }, { "Sid": "DenyHTTP", "Effect": "Deny", "Principal": "*", "Action": "s3:*", "Resource": "arn:aws:s3:::my-prod-bucket/*", "Condition": {"Bool": {"aws:SecureTransport": "false"}} } ] }
S3 lifecycle best practice: Use lifecycle rules to automatically transition objects to cheaper storage classes and expire old objects. Transition objects to Standard-IA after 30 days, Glacier Instant Retrieval after 90 days, and delete after 365 days (or configure based on your data retention policy). For Terraform state buckets, enable versioning and MFA Delete to protect against accidental state deletion.

AWS Cost Optimisation

Cloud cost is a topic in senior DevOps and SRE interviews. Here are the strategies I have used to reduce AWS costs in production environments:

CloudWatch Monitoring and Alerting

# Create a CloudWatch alarm for EKS pod crash rate aws cloudwatch put-metric-alarm \ --alarm-name "EKS-CrashLoopBackOff-Alert" \ --alarm-description "Alert when pods are crash looping" \ --namespace ContainerInsights \ --metric-name pod_status_running \ --dimensions Name=ClusterName,Value=prod-cluster \ --statistic Average \ --period 60 \ --threshold 5 \ --comparison-operator LessThanThreshold \ --evaluation-periods 3 \ --alarm-actions arn:aws:sns:ap-south-1:123456:ops-alerts # Query logs from EKS pods with CloudWatch Logs Insights fields @timestamp, @message | filter @logStream like /production/ | filter @message like /ERROR/ | stats count(*) as errorCount by bin(5m) | sort @timestamp desc

Essential AWS CLI Commands

12 AWS Interview Questions with Expert Answers

Q1: What is the difference between a Security Group and a Network ACL in AWS?
Security Groups are stateful firewalls that operate at the EC2 instance (or ENI) level. Stateful means: if you allow inbound traffic on port 8080, the response traffic is automatically allowed outbound regardless of outbound rules. Security Groups only support Allow rules β€” you cannot explicitly deny traffic from a specific IP. Network ACLs (NACLs) are stateless firewalls at the subnet level. Stateless means: you must explicitly allow both inbound AND the response outbound traffic. NACLs support both Allow and Deny rules and are evaluated in rule number order (lower numbers first). In practice: use Security Groups as your primary network control (they are more granular and stateful, making them easier to manage). Use NACLs for broad subnet-level blocking (e.g., blocking a malicious IP range at the subnet level quickly).
Q2: Explain IRSA and why it is preferred over EC2 instance profiles for EKS pod authentication.
EC2 instance profiles attach an IAM role to an EC2 node. Every pod running on that node inherits all permissions of that node role β€” you cannot have pod-level permission isolation. IRSA (IAM Roles for Service Accounts) uses OIDC token federation to give each Kubernetes ServiceAccount its own IAM role with its own minimum-privilege permissions. The EKS cluster's OIDC provider signs a JWT for each pod, AWS STS validates it, and returns temporary credentials scoped to only that ServiceAccount's IAM role. This means a compromised application pod can only access the specific AWS resources its ServiceAccount needs β€” not ECR, not S3, not Secrets Manager β€” unless explicitly granted. IRSA credentials also auto-rotate every 12 hours. It is strictly more secure than instance profiles for multi-tenant clusters.
Q3: What is the difference between RDS Multi-AZ and RDS Read Replicas?
Multi-AZ is for high availability and disaster recovery. RDS synchronously replicates every write to a standby instance in a different AZ using synchronous replication β€” no data loss. If the primary fails, RDS automatically promotes the standby (60–120 seconds failover). The standby cannot serve read traffic β€” it is a hot spare only. You pay double for the standby instance. Read Replicas are for read scaling and reporting offload. They use asynchronous replication β€” there is potential replication lag. Read Replicas can (and should) serve SELECT queries from your application. You can have up to 5 Read Replicas per RDS instance (15 for Aurora). Read Replicas can be promoted to standalone databases for disaster recovery or migrations, but this is a manual process. Use Multi-AZ for production databases that need HA. Add Read Replicas when your primary is bottlenecked on reads.
Q4: How does AWS IAM policy evaluation work when multiple policies apply?
AWS evaluates all applicable policies β€” identity-based, resource-based, permission boundaries, session policies, SCPs β€” simultaneously, following this logic: First, check for an explicit Deny in any policy. If found, access is denied immediately, overriding all Allow statements. Second, check for an explicit Allow in any policy. If found (and no Deny), access is permitted. Third, if there is no explicit Allow, access is implicitly denied by default. A common real-world mistake: an engineer adds an explicit Allow for S3 in the identity policy but the S3 bucket policy has an explicit Deny for that IAM role. The Deny wins. Another common issue: AWS Organizations SCPs. An SCP cannot grant permissions β€” it only restricts what can be allowed. If an SCP doesn't Allow an action, no identity policy in that account can Allow it either.
Q5: What is a NAT Gateway and when do you need it?
A NAT Gateway provides outbound internet access for resources in private subnets, without allowing inbound connections from the internet. EC2 instances and EKS pods in private subnets need outbound internet access to: pull Docker images from ECR or Docker Hub, install OS packages (apt/yum), call external APIs, and send data to SaaS monitoring tools. Without a NAT Gateway, private subnet resources have no internet connectivity. You need one NAT Gateway per AZ for high availability (if the AZ hosting the NAT Gateway fails, private subnet instances in other AZs lose internet access). NAT Gateways are expensive at high traffic volumes β€” they charge per GB processed. For AWS service traffic (S3, DynamoDB, ECR, etc.), use VPC Gateway Endpoints (free) or Interface Endpoints (PrivateLink) to bypass the NAT Gateway entirely.
Q6: Explain S3 consistency model and when it matters.
Since December 2020, Amazon S3 provides strong read-after-write consistency for all operations β€” PUTs, DELETEs, and LIST operations are all strongly consistent. This means: immediately after writing an object, any subsequent GET will return the new object. After deleting an object, any subsequent GET will return a 404. After a PUT overwrite, all subsequent GETs return the new version. Before this change (the old S3 had eventual consistency for overwrites and DELETEs), engineering teams had to add retry loops, sleep statements, and consistency workarounds in their code. Those workarounds are no longer needed and should be removed from existing code. The strong consistency change applies to all AWS regions simultaneously and does not require any configuration β€” it is the default behaviour for all S3 operations.
Q7: What is CloudFormation and how does it compare to Terraform?
CloudFormation is AWS's native Infrastructure as Code service β€” you define AWS resources in YAML or JSON templates and CloudFormation provisions and manages them. It is tightly integrated with AWS (supports every new AWS service immediately at launch) and has built-in rollback on failures. Terraform is a vendor-neutral IaC tool by HashiCorp that uses HCL and supports AWS, GCP, Azure, Kubernetes, Datadog, and 2,000+ other providers through a plugin system. The key differences in practice: Terraform has a more ergonomic language (HCL vs CloudFormation's verbose YAML), better module system, readable plan output, and cross-cloud support. CloudFormation is deeper in AWS (StackSets for multi-account deployment, native Service Catalog integration, no state management burden). In most DevOps interviews and real-world jobs today, Terraform is the more common answer. Use CloudFormation when you need native AWS account vending (Control Tower), multi-account StackSets, or pure AWS-only environments where the broader Terraform ecosystem adds unnecessary complexity.
Q8: How do you secure secrets in an EKS environment?
Three layers of security for EKS secrets: (1) Enable envelope encryption for Kubernetes Secrets using AWS KMS β€” this encrypts secrets at rest in etcd using a KMS CMK, so even direct etcd access doesn't expose plaintext secrets. (2) Use the AWS Secrets Manager CSI Driver (Secrets Store CSI Driver + AWS provider) to mount secrets from Secrets Manager or Parameter Store directly as files or environment variables in pods β€” secrets never touch Kubernetes etcd. (3) For the most sensitive credentials, use Vault Agent Injector (HashiCorp Vault on EKS or HCP Vault) β€” Vault injects secrets directly into pod environment at startup using a sidecar init container, with dynamic secrets that auto-rotate. At minimum, never commit secrets to Git, always use IRSA for AWS service access (no static access keys), and audit secret access with CloudTrail.
Q9: What is AWS Transit Gateway and when do you need it?
Transit Gateway is a regional network transit hub that connects multiple VPCs and on-premises networks through a single gateway. Without Transit Gateway, connecting N VPCs requires N*(N-1)/2 peering connections β€” this becomes unmanageable quickly (10 VPCs = 45 peering connections, each manually configured). With Transit Gateway, every VPC connects to a single Transit Gateway, and routing is centralised. Use Transit Gateway when you have 3+ VPCs that need to communicate, when you need to connect VPCs to on-premises networks via VPN or Direct Connect, or when you are building a hub-and-spoke network architecture with a centralised egress VPC (all internet traffic routed through a single inspection VPC with a firewall appliance). Transit Gateway supports route tables that can segment traffic β€” you can prevent certain VPCs from communicating while allowing others.
Q10: How does CloudWatch differ from Prometheus for Kubernetes monitoring?
CloudWatch Container Insights collects cluster, node, pod, and container metrics from EKS using a DaemonSet agent (CloudWatch Agent or Fluent Bit). Metrics flow to CloudWatch where you can create dashboards, alarms, and use Logs Insights for log analysis. It requires no infrastructure to run β€” fully managed. Cost is pay-per-metric and pay-per-log GB. Prometheus is an open-source pull-based metrics system that you deploy inside the cluster. It scrapes metrics from pods via /metrics endpoints and stores them in its own time-series database. PromQL (the query language) is far more expressive than CloudWatch Metrics Insights. The Prometheus ecosystem (Grafana, AlertManager, Thanos, Cortex) provides much richer dashboarding and alerting capabilities. In practice: use CloudWatch for AWS service metrics (ELB, RDS, Lambda) and baseline cluster monitoring, and Prometheus + Grafana for detailed application metrics and custom SLO alerting. Amazon Managed Service for Prometheus (AMP) and Amazon Managed Grafana (AMG) let you run this stack without managing the infrastructure.
Q11: Explain the difference between an ALB and NLB. When would you choose each?
ALB (Application Load Balancer) operates at Layer 7 (HTTP/HTTPS). It can inspect HTTP headers, URL paths, query strings, and HTTP methods to make routing decisions. It supports content-based routing (route /api to the API service, /static to S3), WebSocket, HTTP/2, gRPC, WAF integration, authentication (Cognito, OIDC), and request/response header manipulation. It is the right choice for all HTTP/HTTPS microservices and Kubernetes Ingress. NLB (Network Load Balancer) operates at Layer 4 (TCP/UDP/TLS). It does not inspect application-layer content β€” it routes based on IP and port only. NLB supports static Elastic IPs (which ALB does not β€” ALBs have dynamic IPs), millions of requests per second with ultra-low latency (microseconds vs ALB milliseconds), and is the right choice for: non-HTTP protocols, real-time gaming and trading systems, IoT applications, and any service that requires a static IP (for IP-based allowlisting by partners or on-premises firewalls).
Q12: How do you implement blue-green deployments on AWS?
Blue-green deployment on AWS can be implemented at several layers: (1) EKS level β€” use two Deployments (blue and green) with traffic shifting via ALB weighted target groups controlled by the AWS Load Balancer Controller. Argo Rollouts automates this with analysis. (2) ECS level β€” CodeDeploy integrates natively with ECS for blue-green with automatic rollback on CloudWatch alarm breach. (3) EC2/ASG level β€” CodeDeploy manages two Auto Scaling Groups and shifts traffic via ALB weighted target groups. (4) Lambda level β€” Lambda aliases and weighted routing allow you to shift 10% to the new version, monitor error rates, and shift 100% when confident. For EKS in production I use Argo Rollouts: define a Rollout resource with the blueGreen strategy, configure an analysis template that queries Prometheus error rate metrics, and Argo automatically promotes or rolls back based on the analysis result. This makes blue-green fully automated and observable.

☁️ Explore AWS on the Interactive Mind Map

See how AWS EKS, IAM, VPC, and S3 connect to Kubernetes, Terraform, and CI/CD pipelines.

Open Interactive Mind Map πŸ—‚οΈ Terraform for AWS β†’
// More Guides
πŸ“– DevOps ☸️ Kubernetes 🐳 Docker βš™οΈ CI/CD ☁️ AWS πŸ—‚οΈ Terraform πŸ“Š Prometheus 🐧 Linux 🌿 Git
Advertisement
β˜• Support Master DevOps

All guides are 100% free. If this helped you learn or prep for an interview, your support keeps the project alive.

β˜• Ko-fi β€” International πŸ’³ Razorpay β€” India
☁️
Written by Dhanush R
Senior DevOps Engineer Β· 4.5+ Years Β· Bengaluru Β· AWS Β· Kubernetes Β· Terraform

4.5+ years of daily hands-on experience with AWS in production β€” EKS, IAM, VPC, RDS, S3, CloudWatch, and cost optimisation across enterprise environments. Every section here comes from real production incidents and architecture decisions. Last updated: May 2026.

πŸ“Έ Instagram ▢️ YouTube πŸ’Ό LinkedIn About β†’
πŸŒ™