Top 30 DevOps Interview Questions and Answers (2026)
This guide covers the 30 most common DevOps interview questions asked at product-based companies, tech consultancies, and startups in 2026. Each answer is written from real interview and production experience β not paraphrased documentation. Use these to prepare for Senior DevOps Engineer, Platform Engineer, and SRE roles.
βΈοΈ Kubernetes Interview Questions
Kubernetes Β· OrchestrationQ1 What is the difference between a liveness probe and a readiness probe? βΊ
Liveness probe β answers "Is this container still alive?" If it fails, Kubernetes restarts the container. Use it to detect deadlocks or processes that are running but not making progress.
Readiness probe β answers "Is this container ready to receive traffic?" If it fails, the pod is removed from the Service's endpoint list β no traffic is routed to it, but the container is NOT restarted. Use this for slow-starting apps or apps that need time to warm up caches.
Startup probe (third type) β disables liveness checks for a configurable period during initial startup. Essential for slow-starting apps like JVM services that take 60+ seconds to initialise.
Interview tip: Always mention that misconfiguring probes (too aggressive liveness, missing readiness) is one of the most common sources of production incidents in Kubernetes.
Q2 What happens when a Pod is deleted in Kubernetes? Walk me through the full sequence. βΊ
When you run kubectl delete pod api-xyz, this sequence occurs:
- The API Server marks the pod as Terminating and sets its deletionTimestamp.
- The Endpoint Controller removes the pod from all Service endpoint lists β traffic stops routing to it.
- If a preStop lifecycle hook is defined, it runs first (e.g., a sleep to give the load balancer time to drain).
- Kubernetes sends SIGTERM to PID 1 in the container β the app should begin graceful shutdown.
- Kubernetes waits up to terminationGracePeriodSeconds (default 30s) for the container to exit.
- If still running after the grace period, Kubernetes sends SIGKILL (force kill).
- The kubelet removes the pod from the node and the API Server deletes the pod object.
Key production insight: Steps 2 and 3β4 happen in parallel. The endpoint removal can take 1β2 seconds. Without a preStop sleep of at least 2β5s, you may get connection errors from the load balancer routing to a pod that's already shutting down.
Q3 How does Kubernetes schedule a pod? What factors does the scheduler consider? βΊ
The Kubernetes scheduler runs a two-phase process for every unscheduled pod:
Phase 1 β Filtering (Predicates): Eliminates nodes that cannot run the pod. Checks include: does the node have enough CPU and memory (resources.requests)? Does it match nodeSelector labels? Are taints tolerated? Is the node in a Ready condition? Does the pod's affinity/anti-affinity match?
Phase 2 β Scoring (Priorities): Ranks remaining feasible nodes. Scoring functions include: LeastRequestedPriority (prefer less loaded nodes), InterPodAffinityPriority (co-locate or spread pods), NodeAffinityPriority (soft preferences). The pod is placed on the highest-scoring node.
Q4 What is a DaemonSet and when do you use it instead of a Deployment? βΊ
A DaemonSet ensures exactly one pod runs on each node (or a subset of nodes matching a selector). When a new node joins the cluster, the DaemonSet automatically schedules a pod on it. When a node is removed, the pod is garbage collected.
Use cases: Log collectors (Fluentd, Filebeat), node-level monitoring agents (node-exporter, Datadog agent), network plugins (Calico, Cilium), storage daemons. Anything that needs to run on every node as a node-level service, not as an application.
vs Deployment: A Deployment scales replicas horizontally β you control how many pods run across the cluster. A DaemonSet is node-scoped β one pod per node, automatically adjusting as nodes come and go. You cannot set replicas on a DaemonSet.
Q5 What is RBAC in Kubernetes and how does it work? βΊ
RBAC (Role-Based Access Control) is Kubernetes' mechanism for controlling who can do what to which resources. It uses four objects:
- Role β defines permissions (verbs + resources) within a namespace
- ClusterRole β same, but cluster-wide (or for non-namespaced resources)
- RoleBinding β binds a Role to a user, group, or ServiceAccount within a namespace
- ClusterRoleBinding β binds a ClusterRole cluster-wide
Principle of least privilege: Always grant the minimum permissions needed. Audit existing permissions with kubectl auth can-i --list --as=system:serviceaccount:production:mysa.
Q6 How do you debug a pod stuck in CrashLoopBackOff? βΊ
CrashLoopBackOff means the container starts, crashes, and Kubernetes keeps retrying with exponential back-off. Follow this diagnostic sequence:
Common root causes: misconfigured environment variables or secrets, insufficient memory (OOMKill), missing config files at expected paths, failed database connection at startup, liveness probe too aggressive for slow-starting apps.
Q7 What is a Horizontal Pod Autoscaler (HPA) and how does it work? βΊ
HPA automatically scales the number of pod replicas based on observed metrics. By default it uses CPU and memory (from the Metrics Server), but it can also scale on custom metrics from Prometheus via the Custom Metrics API.
Important: HPA requires resources.requests to be set on containers β without requests, the Metrics Server cannot calculate utilisation percentage. HPA scales up fast but scales down slowly (default 5-minute stabilisation window) to avoid thrashing.
π³ Docker Interview Questions
Docker Β· ContainersQ8 What is the difference between an image and a container? βΊ
An image is a read-only, layered filesystem snapshot β it is the build artefact, like a class definition in object-oriented programming. An image is immutable once built.
A container is a running instance of an image β like an object instance. Docker adds a thin writable layer on top of the image layers for the container's runtime changes. Multiple containers can run from the same image simultaneously, each with their own writable layer. Stopping a container does not delete its writable layer; removing it does.
Q9 What are Docker networking modes and when do you use each? βΊ
- bridge (default) β containers get a private network namespace with a virtual ethernet interface connected to docker0 bridge. Use for most single-host applications and docker-compose setups.
- host β container shares the host's network namespace directly. No network isolation, no port mapping needed. Use for high-performance network applications where network latency matters (e.g., monitoring agents).
- none β no network. Container has only a loopback interface. Use for batch jobs that don't need any network access.
- overlay β multi-host network using VXLAN tunnels. Used by Docker Swarm to connect containers across multiple Docker hosts. Kubernetes uses its own CNI plugins instead.
- macvlan β assigns a MAC address to the container, making it appear as a physical network device. Use when containers need to be directly accessible on the LAN.
Q10 What is a multi-stage Docker build and why do you use it? βΊ
Multi-stage builds use multiple FROM statements in one Dockerfile. Each stage is a separate image. You can selectively copy artefacts from one stage to another, leaving behind everything else (build tools, compilers, test dependencies, source code).
Result: the Maven builder image is ~700MB. The runtime image is ~80MB. The 88% size reduction means faster pulls, smaller attack surface, and lower registry costs.
Q11 What is .dockerignore and why is it important? βΊ
.dockerignore works like .gitignore but for the Docker build context β it tells the Docker daemon which files to exclude when sending the build context to the Docker daemon. Without it, every docker build sends your entire project directory including node_modules, .git, test data, and IDE files to the daemon.
This matters for two reasons: build speed (sending 500MB of node_modules over a socket is slow) and security (source code, secrets in .env files, and credentials may end up in the image layer if copied by a COPY . . instruction).
Q12 How do you pass secrets to a Docker container securely? βΊ
Never hardcode secrets in Dockerfile, COPY a secrets file into the image, or pass them via ENV instructions in the Dockerfile (they appear in docker history).
Good approaches:
- Environment variables at runtime:
docker run -e DB_PASSWORD=$SECRET api:latest. The secret is not baked into the image, but it is visible in the container's/proc/PID/environβ acceptable for most use cases. - Docker secrets (Swarm): Mounted as a tmpfs file at
/run/secrets/secret_name. Not visible in inspect output. - Build-time secrets with BuildKit:
RUN --mount=type=secret,id=npmrc cat /run/secrets/npmrc > ~/.npmrc. The secret is never stored in any layer. - In Kubernetes: Mount Kubernetes Secrets as environment variables or files. Use External Secrets Operator to sync from AWS Secrets Manager or HashiCorp Vault.
βοΈ CI/CD & GitOps Interview Questions
CI/CD Β· GitHub Actions Β· ArgoCDQ13 What is the difference between Continuous Integration, Continuous Delivery, and Continuous Deployment? βΊ
Continuous Integration (CI) β developers merge code frequently (at least daily). Every merge triggers automated tests, linting, and security scans. The goal is to detect integration problems early. CI does not involve deployment.
Continuous Delivery (CD) β every change that passes CI is automatically built into a deployable artefact and deployed to a staging/QA environment. Deployment to production requires a manual approval step. This ensures production is always one click away.
Continuous Deployment β takes Continuous Delivery further β every change that passes the automated pipeline is deployed to production automatically, with no human approval. Requires very high confidence in test coverage and monitoring.
Q14 What is GitOps and how does ArgoCD implement it? βΊ
GitOps is an operational model where the desired state of infrastructure and applications is stored declaratively in Git. Changes to infrastructure are made via Git commits and pull requests β not manual kubectl apply commands. Git becomes the single source of truth.
ArgoCD implements GitOps by continuously watching a Git repository and comparing the cluster's actual state with the desired state in Git. When they diverge (new commit in Git, or manual change to cluster), ArgoCD either alerts or automatically syncs the cluster back to Git.
Benefits: audit log is Git history, rollback is git revert, no cloud credentials needed in CI pipelines, self-healing (auto-reverts manual cluster changes).
Q15 How do you handle secrets in a GitOps pipeline? βΊ
The fundamental challenge: if Git is the source of truth, secrets cannot be stored in Git in plaintext. Common solutions:
- Sealed Secrets (Bitnami) β encrypt secrets with a cluster-specific public key. The encrypted SealedSecret CRD is safe to commit to Git. The Sealed Secrets controller decrypts it in-cluster. Simple, no external dependency.
- External Secrets Operator (ESO) β define an ExternalSecret CRD pointing to AWS Secrets Manager, HashiCorp Vault, or GCP Secret Manager. ESO fetches and syncs the real secret into a Kubernetes Secret. Source of truth is the secrets manager, not Git.
- SOPS + age β encrypt secrets files in Git with SOPS. Decrypt at apply time using age keys stored in the cluster. ArgoCD has a SOPS plugin.
- Vault Agent Injector β HashiCorp Vault injects secrets as files into pod containers via a sidecar. Secrets never touch Git or Kubernetes Secrets.
Q16 How do you implement a blue-green deployment? βΊ
Blue-green deployment maintains two identical production environments β blue (current live) and green (new version). Traffic switch is instantaneous, and rollback is equally instant.
With ArgoCD: Use Argo Rollouts (a separate controller) which has native blue-green and canary support with automatic traffic shifting and metric-based promotion/rollback.
Q17 What are GitHub Actions reusable workflows and why do you use them? βΊ
Reusable workflows let you define a workflow in one file and call it from multiple other workflows across repositories. Instead of copying the same "build Docker image, scan with Trivy, push to ECR" steps into 20 service repositories, you define it once in a shared repo and call it.
Benefits: Single place to update security scanning tool versions, consistent pipeline structure across teams, reduction in workflow duplication. Combine with organisation-level secrets for maximum DRY.
βοΈ AWS Interview Questions
AWS Β· Cloud Β· EKSQ18 What is VPC and how do public vs private subnets work? βΊ
A VPC (Virtual Private Cloud) is a logically isolated network within AWS. You define the IP range (CIDR block), create subnets, route tables, and gateways.
Public subnet β has a route to an Internet Gateway (IGW). Resources here can have public IP addresses and are directly reachable from the internet. Use for load balancers and NAT Gateways.
Private subnet β no direct internet route. Resources here reach the internet via a NAT Gateway (in a public subnet). Use for EKS nodes, RDS databases, application servers β they can make outbound requests but are not directly reachable from the internet.
Production best practice: Never put application servers or databases in public subnets. Only load balancers go in public subnets. This is the most important AWS security design pattern.
Q19 Explain IAM roles vs IAM users. When do you use each? βΊ
IAM Users β long-term identities with permanent credentials (access key ID + secret). Use only for human users who need console access, and prefer SSO (AWS SSO/Identity Center) even then. Never create IAM users for applications or services.
IAM Roles β identities without long-term credentials. They are assumed by a principal (EC2 instance, Lambda function, EKS pod, another account, GitHub Actions via OIDC) and issue temporary credentials via STS. Use roles for everything non-human: EC2 instance profiles, Lambda execution roles, EKS node roles, IRSA for pod-level permissions.
The golden rule: If it's a machine or service, it must use a role β never a user with an access key.
Q20 What is S3 versioning and when should you enable it? βΊ
S3 versioning preserves every version of every object in a bucket. When you overwrite or delete a file, the previous version is retained. You can restore any previous version or recover accidentally deleted objects.
Enable versioning on: Terraform state buckets (critical β protects against state corruption), any bucket used for backups, buckets containing configuration or deployment artefacts that you may need to roll back.
Cost consideration: Every version counts toward storage costs. Combine versioning with lifecycle policies to automatically transition old versions to cheaper storage classes (Glacier) or delete them after a retention period.
Q21 What is CloudFormation and how is it different from Terraform? βΊ
CloudFormation is AWS's native IaC service. Templates are YAML or JSON. State is managed by AWS automatically. Deeply integrated with AWS β supports all AWS resources on day 0. Free to use (pay only for resources created).
Terraform is cloud-agnostic. Works with AWS, GCP, Azure, and hundreds of providers in the same codebase. HCL is generally more readable than CloudFormation YAML. State is managed yourself (S3 + DynamoDB). Has a larger open-source ecosystem.
When to use each: Pure AWS shops often start with CloudFormation for deep native integration. Multi-cloud or cloud-agnostic teams prefer Terraform. Most enterprises now standardise on Terraform. CloudFormation's key advantage is that AWS ensures it supports every new service feature immediately β Terraform providers sometimes lag.
ποΈ Terraform & IaC Interview Questions
Terraform Β· IaC Β· AnsibleQ22 What is the Terraform state file and why is it critical? βΊ
The state file (terraform.tfstate) is Terraform's record of what infrastructure it manages. It maps your HCL resource definitions to real cloud resource IDs (e.g., aws_instance.web β i-0abc1234). Without the state file, Terraform cannot know what exists and would try to create everything from scratch on the next apply.
Critical risks: State files often contain sensitive values in plaintext (database passwords, private keys). Always encrypt state at rest (S3 SSE). Never commit state to Git. Use DynamoDB locking to prevent concurrent modifications. Enable S3 versioning on the state bucket to recover from accidental corruption.
State manipulation commands: terraform state list (see all resources), terraform state rm (remove without destroying), terraform state mv (rename/move resource in state), terraform import (import existing resources).
Q23 When would you use Ansible instead of Terraform? βΊ
Terraform and Ansible solve different problems and are often used together.
Use Terraform for: Provisioning infrastructure β creating VPCs, EC2 instances, EKS clusters, S3 buckets, databases. Terraform excels at managing the lifecycle of cloud resources (create, update, destroy). It is declarative β you describe desired state.
Use Ansible for: Configuration management β installing software on servers, applying OS configurations, managing users and packages, deploying application configs. Ansible is procedural β you describe the steps to take. It is also used for ad-hoc automation tasks and application deployment to existing servers.
Together: Terraform provisions the EC2 instance, Ansible configures it. Terraform creates the EKS cluster, Ansible (or Helm) deploys applications onto it.
Q24 What is a Terraform provider and how does it work? βΊ
A provider is a plugin that translates Terraform's resource definitions into API calls for a specific platform. The AWS provider translates aws_instance into EC2 API calls. The Kubernetes provider translates kubernetes_deployment into Kubernetes API calls.
terraform init downloads providers from the Terraform Registry. Always pin provider versions in production β unpinned providers can break on automatic upgrades.
π§ Linux & Git Interview Questions
Linux Β· Git Β· ShellQ25 How do you find a process consuming high CPU and kill it gracefully? βΊ
Interview insight: Always try SIGTERM (15) first β it allows the application to flush buffers, close connections, and write final logs. SIGKILL (9) bypasses all of this and can cause data corruption or broken connections. This mirrors exactly how Kubernetes handles pod termination.
Q26 What is the difference between git fetch and git pull? βΊ
git fetch downloads new commits, branches, and tags from the remote but does NOT update your working branch or local files. It updates remote-tracking branches (origin/main). It is always safe β it never changes your working state.
git pull = git fetch + git merge (or git rebase if configured). It fetches and immediately integrates the remote changes into your current branch. If there are conflicts, you must resolve them.
Best practice: Use git fetch then inspect what changed (git log HEAD..origin/main or git diff HEAD origin/main) before merging. This is especially important on long-running feature branches where the remote may have diverged significantly.
Q27 How do you troubleshoot high disk usage on a Linux server? βΊ
π― SRE Concepts Interview Questions
SRE Β· Observability Β· IncidentsQ28 What is an SLO and how do you define one for a new service? βΊ
SLI (Service Level Indicator) β the metric you measure. e.g., "percentage of HTTP requests that return 200 within 200ms".
SLO (Service Level Objective) β your internal reliability target. e.g., "99.9% of requests succeed within 200ms, measured over a 30-day rolling window".
SLA (Service Level Agreement) β contractual commitment to customers, usually less strict than the SLO (if your SLO is 99.9%, your SLA might be 99.5%). Violating an SLA has financial penalties.
Defining an SLO for a new service: Start by identifying the key user journeys. For each, define an SLI that measures whether the user experience was good or bad. Set the SLO conservatively at first (99% or 99.5%) β you can tighten it once you have production data. Calculate the error budget: 100% - SLO = budget for failure.
Q29 Walk me through how you handle a production incident. βΊ
A production incident has a well-defined lifecycle:
- Detect β alert fires (PagerDuty/OpsGenie), Slack notification, or user report.
- Assess severity β P1 (total outage, revenue impact), P2 (degraded service, workaround exists), P3 (minor, contained). Severity determines who is engaged.
- Communicate β post an incident channel message immediately: "Investigating high error rate on payment service. Impact: checkout failures. ETA: 15 minutes." Update every 10β15 minutes even if nothing is resolved.
- Mitigate first, investigate second β restore service quickly (rollback, scale up, failover) before root-causing. Mitigation and investigation are parallel, but mitigation wins.
- Resolve & confirm β verify metrics return to normal, check error rates, confirm with stakeholders.
- Post-mortem β blameless writeup within 48 hours: timeline, root cause, impact, 3β5 action items with owners and due dates.
Q30 What is toil and how do SRE teams reduce it? βΊ
Toil β as defined by Google's SRE book β is work that is manual, repetitive, automatable, tactical (reacting to problems rather than preventing them), and that scales linearly with service growth. Toil has no lasting value once completed.
Examples: manually restarting a service that crashes weekly, manually rotating credentials every 90 days, manually scaling capacity before peak traffic, running the same deployment runbook for every release.
How SRE teams reduce toil: Automate anything done more than twice. Implement automatic restarts (Kubernetes). Use auto-scaling (HPA, Cluster Autoscaler). Use secrets managers for automatic rotation. Build self-service tooling so developers don't need SRE help for routine tasks. Track the ratio of toil to project work β Google recommends keeping toil below 50% of an SRE team's time.
π‘ How to Answer DevOps Interview Questions
1. State the problem/risk clearly β "The main challenge with X is..."
2. Name the mechanism β "We solve this using Y"
3. Give a real example β "In production, we handled this by..."
4. Mention the trade-off β "The downside is Z, which we balance by..."
When you blank on a question, say: "Let me think through this from first principles." This buys 5 seconds and signals structured thinking β far better than silence. Interviewers at senior level are not just testing knowledge; they are testing how you approach problems you haven't seen before.
For scenario questions ("Tell me about a time when..."), always use a real production story. If you don't have one, describe what you would do and be explicit that it's hypothetical. Interviewers respect honesty and practical thinking far more than answers that sound memorised.
DevOps Engineer with years of hands-on experience in Kubernetes, AWS, CI/CD, Terraform, and SRE practices. This platform shares real-world DevOps knowledge for interview preparation β written from production experience, not copied from documentation.