// AWS Guide Β· DevOps & Cloud
AWS DevOps Guide 2026: EKS, IAM, VPC, IRSA & Real Interview Questions
π
Updated May 2026
β± 24 min read
π· AWS Β· EKS Β· IAM Β· VPC Β· Cloud Β· DevOps
π¨βπ»
Dhanush R β Senior DevOps Engineer
4.5+ years of daily hands-on experience with AWS in production β EKS, IAM, VPC, RDS, S3, CloudWatch, and cost optimisation across enterprise environments. Every section here comes from real production incidents and architecture decisions.
AWS is the dominant cloud platform in DevOps, with the broadest service catalogue and the largest ecosystem of tools. I work with AWS daily in production β provisioning EKS clusters with Terraform, configuring IRSA for pod-level IAM authentication, designing multi-AZ VPC architectures, troubleshooting IAM permission errors at 2am, and optimising cloud costs across dozens of services. This guide covers the AWS services, patterns, and concepts that appear most in DevOps and SRE interviews in 2026.
Core AWS Services Every DevOps Engineer Must Know
AWS has over 200 services. In DevOps roles, you need deep knowledge of approximately 12β15 services. Here are the ones you will use daily and be interviewed on most frequently:
- EC2 (Elastic Compute Cloud) β Virtual machines in the cloud. The foundation of most AWS workloads. Understand instance families: t3/t4g (burstable, cost-effective for dev), m6i (general-purpose production), c6i (compute-optimised for CPU-intensive workloads), r6i (memory-optimised for databases and caches). Understand On-Demand vs Reserved Instances (1 or 3 year commit, up to 72% discount) vs Spot Instances (up to 90% discount, but can be interrupted with 2 minutes notice β use for fault-tolerant batch workloads and EKS worker nodes with proper Cluster Autoscaler configuration).
- EKS (Elastic Kubernetes Service) β AWS-managed Kubernetes. AWS manages the control plane (API server, etcd, controller manager, scheduler) β you pay per cluster per hour. You manage worker node groups. Use Managed Node Groups for standard workloads or Fargate for serverless pods. EKS integrates natively with IAM (IRSA), ALB Ingress Controller, EBS CSI Driver, EFS CSI Driver, CloudWatch Container Insights, and AWS Load Balancer Controller.
- IAM (Identity and Access Management) β Every API call in AWS goes through IAM. IAM controls who (users, roles, services) can do what (actions) on which resources under what conditions. Understanding IAM policy evaluation logic is critical: explicit Deny > explicit Allow > implicit Deny. Know the difference between identity-based policies (attached to users/roles), resource-based policies (attached to S3 buckets, SQS queues, Lambda functions), and Service Control Policies (SCPs in AWS Organizations).
- VPC (Virtual Private Cloud) β Your isolated network within AWS. All production workloads run inside a VPC with private subnets. Understand subnets (public vs private), route tables, Internet Gateway, NAT Gateway, security groups (stateful, instance-level firewall), NACLs (stateless, subnet-level firewall), VPC peering, Transit Gateway, and PrivateLink.
- S3 (Simple Storage Service) β Object storage with 99.999999999% (11 nines) durability. Used for: Terraform state files, build artifacts, static website hosting, data lake storage, log archives, and cross-region replication. Understand storage classes: Standard (frequently accessed), Standard-IA (infrequent access), Glacier Instant Retrieval, and Glacier Deep Archive (lowest cost, hours retrieval).
- RDS (Relational Database Service) β Managed SQL databases (PostgreSQL, MySQL, MariaDB, Oracle, SQL Server). AWS handles backups, patches, Multi-AZ failover, and Read Replica provisioning. Multi-AZ synchronously replicates to a standby in a different AZ β automatic failover in 60β120 seconds. Read Replicas asynchronously replicate for read scaling. Aurora is the premium option: MySQL/PostgreSQL-compatible, up to 5x faster, up to 15 Read Replicas, auto-scaling storage up to 128TB.
- CloudWatch β Metrics, logs, alarms, dashboards, and events. The primary observability platform in AWS. CloudWatch Logs Insights queries logs with SQL-like syntax. CloudWatch Container Insights gives Kubernetes cluster, node, and pod metrics. CloudWatch Alarms trigger SNS notifications or Auto Scaling actions.
- ALB (Application Load Balancer) β L7 HTTP/HTTPS load balancer. Routes traffic by URL path, hostname, HTTP headers, and query strings. Native integration with EKS via the AWS Load Balancer Controller. Supports WAF integration, AWS Certificate Manager for TLS, and sticky sessions. NLB (Network Load Balancer) is L4 TCP/UDP β ultra-low latency, static IPs, for non-HTTP workloads.
- Lambda β Serverless compute that runs code in response to events. Pay only for execution time (in milliseconds). No servers to manage, automatic scaling from 0 to thousands of concurrent invocations. Use for: API backends (behind API Gateway), event processors (S3 triggers, SQS consumers, DynamoDB Streams), scheduled jobs (EventBridge rules), and infrastructure automation.
- ECR (Elastic Container Registry) β Private Docker container registry deeply integrated with EKS, ECS, and CodeBuild. Supports image scanning (Clair-based), lifecycle policies to auto-delete old images, cross-region replication, and immutable image tags.
VPC Architecture β Production Design
Every production AWS workload should run in a properly designed VPC. A poorly designed VPC is a security vulnerability. Here is the standard multi-AZ architecture I implement for production EKS environments:
# 3-AZ production VPC architecture
Region: ap-south-1 (Mumbai)
VPC: 10.0.0.0/16
Public Subnets (one per AZ β for ALBs and NAT Gateways only):
ap-south-1a: 10.0.0.0/24
ap-south-1b: 10.0.1.0/24
ap-south-1c: 10.0.2.0/24
β Route: 0.0.0.0/0 β Internet Gateway
Private Subnets (EKS nodes, RDS, application services):
ap-south-1a: 10.0.10.0/23 (512 IPs)
ap-south-1b: 10.0.12.0/23
ap-south-1c: 10.0.14.0/23
β Route: 0.0.0.0/0 β NAT Gateway (in public subnet)
Database Subnets (RDS, ElastiCache β no internet access):
ap-south-1a: 10.0.20.0/24
ap-south-1b: 10.0.21.0/24
ap-south-1c: 10.0.22.0/24
β Route: local only (no NAT, no IGW)
Key design decisions: EKS worker nodes go in private subnets β they should never have direct public IPs. The ALB sits in public subnets and routes to nodes in private subnets. RDS goes in database subnets with a security group that only allows inbound on port 5432 from the EKS node security group. NAT Gateway provides outbound internet access for private subnet resources (pulling Docker images, calling external APIs) without inbound exposure.
IAM Deep Dive β Roles, Policies, and IRSA
IAM is the most complex service in AWS and the source of the most production issues I have investigated. Understanding it deeply is essential for both security and operational efficiency.
IAM Policy Evaluation Logic
When an API call is made, AWS evaluates all applicable policies in this order: (1) Check for explicit Deny β if any policy explicitly denies the action, it is denied immediately regardless of any Allow. (2) Check for explicit Allow β if any policy explicitly allows the action, it proceeds. (3) Implicit Deny β if there is no explicit Allow, the request is denied by default. This is why IAM is Deny by default β you must explicitly grant every permission.
# Minimum IAM policy for EKS worker nodes
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:DescribeInstances",
"ec2:DescribeRegions",
"ecr:GetAuthorizationToken", # pull images from ECR
"ecr:BatchGetImage",
"ecr:GetDownloadUrlForLayer"
],
"Resource": "*"
}
]
}
IRSA β IAM Roles for Service Accounts
IRSA is the secure way to give individual EKS Pods access to AWS services without sharing credentials. Without IRSA, your only options are: embedding IAM access keys in the Pod spec (a major security risk), or using the EC2 instance IAM role (which gives all Pods on that node the same permissions, violating least privilege). IRSA works via OIDC federation: the EKS cluster exposes an OIDC endpoint, and IAM trusts tokens signed by that endpoint. Pods using a ServiceAccount annotated with an IAM role ARN receive a projected JWT token that AWS STS validates, then issues temporary credentials with automatic rotation.
# ServiceAccount annotation for IRSA
apiVersion: v1
kind: ServiceAccount
metadata:
name: s3-reader-sa
namespace: production
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/s3-reader-role
# Corresponding IAM trust policy on the IAM role
{
"Principal": {
"Federated": "arn:aws:iam::123456789:oidc-provider/oidc.eks.ap-south-1.amazonaws.com/id/EXAMPLED539D4633E53DE1B716D3041E"
},
"Condition": {
"StringEquals": {
"oidc.eks.ap-south-1.amazonaws.com/id/EXAMPLED:sub":
"system:serviceaccount:production:s3-reader-sa"
}
}
}
EKS in Production
Running Kubernetes on AWS EKS in production involves several AWS-specific components beyond vanilla Kubernetes:
- AWS Load Balancer Controller β Provisions AWS ALBs for Kubernetes Ingress resources and NLBs for LoadBalancer Services. Replaces the in-tree cloud controller. Required for annotation-based ALB configuration (WAF integration, SSL certificates, access logs to S3).
- EBS CSI Driver β Manages EBS volumes as PersistentVolumes. Required for StatefulSets that need block storage. Volumes are AZ-specific β set WaitForFirstConsumer on StorageClass to ensure the volume is created in the same AZ as the scheduled pod.
- EFS CSI Driver β Manages EFS file systems as PersistentVolumes with ReadWriteMany access mode β multiple pods on different nodes can mount the same EFS volume simultaneously. Use for shared configuration, ML model storage, or any multi-writer workload.
- Cluster Autoscaler β Automatically adds or removes EC2 worker nodes based on pending pod demand. Scales up when pods are unschedulable due to insufficient resources, scales down when nodes are underutilised. Requires IAM permissions to call EC2 Auto Scaling APIs. KEDA + Karpenter is the modern alternative β Karpenter provisions right-sized nodes directly in seconds rather than using pre-configured Auto Scaling Groups.
- Karpenter β Next-generation node provisioning for EKS. Provisions nodes in under 60 seconds (vs 3β5 minutes for Cluster Autoscaler), selects the optimal instance type for each workload, and supports spot interruption handling. Strongly recommended for new EKS clusters.
S3 Advanced β Policies, Versioning, and Lifecycle
# S3 bucket policy β enforce encryption and deny HTTP
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DenyUnencryptedObjectUploads",
"Effect": "Deny",
"Principal": "*",
"Action": "s3:PutObject",
"Resource": "arn:aws:s3:::my-prod-bucket/*",
"Condition": {
"StringNotEquals": {
"s3:x-amz-server-side-encryption": "aws:kms"
}
}
},
{
"Sid": "DenyHTTP",
"Effect": "Deny",
"Principal": "*",
"Action": "s3:*",
"Resource": "arn:aws:s3:::my-prod-bucket/*",
"Condition": {"Bool": {"aws:SecureTransport": "false"}}
}
]
}
S3 lifecycle best practice: Use lifecycle rules to automatically transition objects to cheaper storage classes and expire old objects. Transition objects to Standard-IA after 30 days, Glacier Instant Retrieval after 90 days, and delete after 365 days (or configure based on your data retention policy). For Terraform state buckets, enable versioning and MFA Delete to protect against accidental state deletion.
AWS Cost Optimisation
Cloud cost is a topic in senior DevOps and SRE interviews. Here are the strategies I have used to reduce AWS costs in production environments:
- Right-size EC2 instances β Use CloudWatch metrics and AWS Compute Optimizer to identify over-provisioned instances. A t3.large running at 5% CPU is costing 4x more than a t3.small would.
- Spot Instances for EKS worker nodes β Mix On-Demand (for stateful/critical workloads) with Spot (for stateless workloads) using Karpenter's node provisioning. Spot can reduce EC2 costs by 60β90% for fault-tolerant workloads.
- S3 Intelligent-Tiering β Automatically moves objects between access tiers based on access patterns. No retrieval charges. Use for buckets where access patterns are unknown or variable.
- RDS Reserved Instances β For long-running production databases with predictable workloads, 1-year Reserved Instances save ~40% vs On-Demand.
- NAT Gateway consolidation β NAT Gateway charges per GB processed. In multi-AZ setups, route all outbound traffic through a single NAT Gateway in one AZ (accepting the AZ dependency) vs paying for three NAT Gateways. For high-volume workloads, consider a NAT instance (cheaper but requires management) or AWS PrivateLink for traffic to AWS services (bypasses NAT entirely).
- Delete unused resources β Unattached EBS volumes, idle load balancers, unused Elastic IPs, old AMIs and their snapshots, and unused RDS snapshots are a common source of wasted spend. Use AWS Cost Explorer and Trusted Advisor to identify them.
CloudWatch Monitoring and Alerting
# Create a CloudWatch alarm for EKS pod crash rate
aws cloudwatch put-metric-alarm \
--alarm-name "EKS-CrashLoopBackOff-Alert" \
--alarm-description "Alert when pods are crash looping" \
--namespace ContainerInsights \
--metric-name pod_status_running \
--dimensions Name=ClusterName,Value=prod-cluster \
--statistic Average \
--period 60 \
--threshold 5 \
--comparison-operator LessThanThreshold \
--evaluation-periods 3 \
--alarm-actions arn:aws:sns:ap-south-1:123456:ops-alerts
# Query logs from EKS pods with CloudWatch Logs Insights
fields @timestamp, @message
| filter @logStream like /production/
| filter @message like /ERROR/
| stats count(*) as errorCount by bin(5m)
| sort @timestamp desc
Essential AWS CLI Commands
aws eks update-kubeconfig --name prod-cluster --region ap-south-1 β Update kubeconfig for EKS cluster access
aws ec2 describe-instances --filters "Name=tag:Environment,Values=production" --query 'Reservations[].Instances[].{ID:InstanceId,Type:InstanceType,State:State.Name}' β List production EC2 instances
aws s3 cp s3://bucket/path ./local --recursive β Download S3 objects recursively
aws iam simulate-principal-policy --policy-source-arn arn:aws:iam::123:role/MyRole --action-names s3:GetObject --resource-arns arn:aws:s3:::mybucket/* β Test IAM permissions before deploying
aws logs tail /aws/eks/prod-cluster/cluster --follow β Stream EKS control plane logs
aws cloudwatch get-metric-statistics --namespace AWS/EKS --metric-name node_cpu_utilization β Get EKS node CPU metrics
aws sts get-caller-identity β Verify current AWS identity and account
12 AWS Interview Questions with Expert Answers
Q1: What is the difference between a Security Group and a Network ACL in AWS?
Security Groups are stateful firewalls that operate at the EC2 instance (or ENI) level. Stateful means: if you allow inbound traffic on port 8080, the response traffic is automatically allowed outbound regardless of outbound rules. Security Groups only support Allow rules β you cannot explicitly deny traffic from a specific IP. Network ACLs (NACLs) are stateless firewalls at the subnet level. Stateless means: you must explicitly allow both inbound AND the response outbound traffic. NACLs support both Allow and Deny rules and are evaluated in rule number order (lower numbers first). In practice: use Security Groups as your primary network control (they are more granular and stateful, making them easier to manage). Use NACLs for broad subnet-level blocking (e.g., blocking a malicious IP range at the subnet level quickly).
Q2: Explain IRSA and why it is preferred over EC2 instance profiles for EKS pod authentication.
EC2 instance profiles attach an IAM role to an EC2 node. Every pod running on that node inherits all permissions of that node role β you cannot have pod-level permission isolation. IRSA (IAM Roles for Service Accounts) uses OIDC token federation to give each Kubernetes ServiceAccount its own IAM role with its own minimum-privilege permissions. The EKS cluster's OIDC provider signs a JWT for each pod, AWS STS validates it, and returns temporary credentials scoped to only that ServiceAccount's IAM role. This means a compromised application pod can only access the specific AWS resources its ServiceAccount needs β not ECR, not S3, not Secrets Manager β unless explicitly granted. IRSA credentials also auto-rotate every 12 hours. It is strictly more secure than instance profiles for multi-tenant clusters.
Q3: What is the difference between RDS Multi-AZ and RDS Read Replicas?
Multi-AZ is for high availability and disaster recovery. RDS synchronously replicates every write to a standby instance in a different AZ using synchronous replication β no data loss. If the primary fails, RDS automatically promotes the standby (60β120 seconds failover). The standby cannot serve read traffic β it is a hot spare only. You pay double for the standby instance. Read Replicas are for read scaling and reporting offload. They use asynchronous replication β there is potential replication lag. Read Replicas can (and should) serve SELECT queries from your application. You can have up to 5 Read Replicas per RDS instance (15 for Aurora). Read Replicas can be promoted to standalone databases for disaster recovery or migrations, but this is a manual process. Use Multi-AZ for production databases that need HA. Add Read Replicas when your primary is bottlenecked on reads.
Q4: How does AWS IAM policy evaluation work when multiple policies apply?
AWS evaluates all applicable policies β identity-based, resource-based, permission boundaries, session policies, SCPs β simultaneously, following this logic: First, check for an explicit Deny in any policy. If found, access is denied immediately, overriding all Allow statements. Second, check for an explicit Allow in any policy. If found (and no Deny), access is permitted. Third, if there is no explicit Allow, access is implicitly denied by default. A common real-world mistake: an engineer adds an explicit Allow for S3 in the identity policy but the S3 bucket policy has an explicit Deny for that IAM role. The Deny wins. Another common issue: AWS Organizations SCPs. An SCP cannot grant permissions β it only restricts what can be allowed. If an SCP doesn't Allow an action, no identity policy in that account can Allow it either.
Q5: What is a NAT Gateway and when do you need it?
A NAT Gateway provides outbound internet access for resources in private subnets, without allowing inbound connections from the internet. EC2 instances and EKS pods in private subnets need outbound internet access to: pull Docker images from ECR or Docker Hub, install OS packages (apt/yum), call external APIs, and send data to SaaS monitoring tools. Without a NAT Gateway, private subnet resources have no internet connectivity. You need one NAT Gateway per AZ for high availability (if the AZ hosting the NAT Gateway fails, private subnet instances in other AZs lose internet access). NAT Gateways are expensive at high traffic volumes β they charge per GB processed. For AWS service traffic (S3, DynamoDB, ECR, etc.), use VPC Gateway Endpoints (free) or Interface Endpoints (PrivateLink) to bypass the NAT Gateway entirely.
Q6: Explain S3 consistency model and when it matters.
Since December 2020, Amazon S3 provides strong read-after-write consistency for all operations β PUTs, DELETEs, and LIST operations are all strongly consistent. This means: immediately after writing an object, any subsequent GET will return the new object. After deleting an object, any subsequent GET will return a 404. After a PUT overwrite, all subsequent GETs return the new version. Before this change (the old S3 had eventual consistency for overwrites and DELETEs), engineering teams had to add retry loops, sleep statements, and consistency workarounds in their code. Those workarounds are no longer needed and should be removed from existing code. The strong consistency change applies to all AWS regions simultaneously and does not require any configuration β it is the default behaviour for all S3 operations.
Q7: What is CloudFormation and how does it compare to Terraform?
CloudFormation is AWS's native Infrastructure as Code service β you define AWS resources in YAML or JSON templates and CloudFormation provisions and manages them. It is tightly integrated with AWS (supports every new AWS service immediately at launch) and has built-in rollback on failures. Terraform is a vendor-neutral IaC tool by HashiCorp that uses HCL and supports AWS, GCP, Azure, Kubernetes, Datadog, and 2,000+ other providers through a plugin system. The key differences in practice: Terraform has a more ergonomic language (HCL vs CloudFormation's verbose YAML), better module system, readable plan output, and cross-cloud support. CloudFormation is deeper in AWS (StackSets for multi-account deployment, native Service Catalog integration, no state management burden). In most DevOps interviews and real-world jobs today, Terraform is the more common answer. Use CloudFormation when you need native AWS account vending (Control Tower), multi-account StackSets, or pure AWS-only environments where the broader Terraform ecosystem adds unnecessary complexity.
Q8: How do you secure secrets in an EKS environment?
Three layers of security for EKS secrets: (1) Enable envelope encryption for Kubernetes Secrets using AWS KMS β this encrypts secrets at rest in etcd using a KMS CMK, so even direct etcd access doesn't expose plaintext secrets. (2) Use the AWS Secrets Manager CSI Driver (Secrets Store CSI Driver + AWS provider) to mount secrets from Secrets Manager or Parameter Store directly as files or environment variables in pods β secrets never touch Kubernetes etcd. (3) For the most sensitive credentials, use Vault Agent Injector (HashiCorp Vault on EKS or HCP Vault) β Vault injects secrets directly into pod environment at startup using a sidecar init container, with dynamic secrets that auto-rotate. At minimum, never commit secrets to Git, always use IRSA for AWS service access (no static access keys), and audit secret access with CloudTrail.
Q9: What is AWS Transit Gateway and when do you need it?
Transit Gateway is a regional network transit hub that connects multiple VPCs and on-premises networks through a single gateway. Without Transit Gateway, connecting N VPCs requires N*(N-1)/2 peering connections β this becomes unmanageable quickly (10 VPCs = 45 peering connections, each manually configured). With Transit Gateway, every VPC connects to a single Transit Gateway, and routing is centralised. Use Transit Gateway when you have 3+ VPCs that need to communicate, when you need to connect VPCs to on-premises networks via VPN or Direct Connect, or when you are building a hub-and-spoke network architecture with a centralised egress VPC (all internet traffic routed through a single inspection VPC with a firewall appliance). Transit Gateway supports route tables that can segment traffic β you can prevent certain VPCs from communicating while allowing others.
Q10: How does CloudWatch differ from Prometheus for Kubernetes monitoring?
CloudWatch Container Insights collects cluster, node, pod, and container metrics from EKS using a DaemonSet agent (CloudWatch Agent or Fluent Bit). Metrics flow to CloudWatch where you can create dashboards, alarms, and use Logs Insights for log analysis. It requires no infrastructure to run β fully managed. Cost is pay-per-metric and pay-per-log GB. Prometheus is an open-source pull-based metrics system that you deploy inside the cluster. It scrapes metrics from pods via /metrics endpoints and stores them in its own time-series database. PromQL (the query language) is far more expressive than CloudWatch Metrics Insights. The Prometheus ecosystem (Grafana, AlertManager, Thanos, Cortex) provides much richer dashboarding and alerting capabilities. In practice: use CloudWatch for AWS service metrics (ELB, RDS, Lambda) and baseline cluster monitoring, and Prometheus + Grafana for detailed application metrics and custom SLO alerting. Amazon Managed Service for Prometheus (AMP) and Amazon Managed Grafana (AMG) let you run this stack without managing the infrastructure.
Q11: Explain the difference between an ALB and NLB. When would you choose each?
ALB (Application Load Balancer) operates at Layer 7 (HTTP/HTTPS). It can inspect HTTP headers, URL paths, query strings, and HTTP methods to make routing decisions. It supports content-based routing (route /api to the API service, /static to S3), WebSocket, HTTP/2, gRPC, WAF integration, authentication (Cognito, OIDC), and request/response header manipulation. It is the right choice for all HTTP/HTTPS microservices and Kubernetes Ingress. NLB (Network Load Balancer) operates at Layer 4 (TCP/UDP/TLS). It does not inspect application-layer content β it routes based on IP and port only. NLB supports static Elastic IPs (which ALB does not β ALBs have dynamic IPs), millions of requests per second with ultra-low latency (microseconds vs ALB milliseconds), and is the right choice for: non-HTTP protocols, real-time gaming and trading systems, IoT applications, and any service that requires a static IP (for IP-based allowlisting by partners or on-premises firewalls).
Q12: How do you implement blue-green deployments on AWS?
Blue-green deployment on AWS can be implemented at several layers: (1) EKS level β use two Deployments (blue and green) with traffic shifting via ALB weighted target groups controlled by the AWS Load Balancer Controller. Argo Rollouts automates this with analysis. (2) ECS level β CodeDeploy integrates natively with ECS for blue-green with automatic rollback on CloudWatch alarm breach. (3) EC2/ASG level β CodeDeploy manages two Auto Scaling Groups and shifts traffic via ALB weighted target groups. (4) Lambda level β Lambda aliases and weighted routing allow you to shift 10% to the new version, monitor error rates, and shift 100% when confident. For EKS in production I use Argo Rollouts: define a Rollout resource with the blueGreen strategy, configure an analysis template that queries Prometheus error rate metrics, and Argo automatically promotes or rolls back based on the analysis result. This makes blue-green fully automated and observable.