What is the difference between CMD and ENTRYPOINT in Docker?

ENTRYPOINT defines the executable that always runs. CMD provides default arguments that can be overridden. Always use exec form to ensure SIGTERM is received by the application process.

What is Terraform remote state and why is it needed?

Remote state stores terraform.tfstate in a shared location like S3 so teams can collaborate. DynamoDB provides state locking to prevent concurrent modifications.

How does OIDC authentication work in GitHub Actions?

GitHub issues a short-lived JWT token at runtime. The cloud provider verifies the token signature and exchanges it for temporary credentials. No long-lived secrets need to be stored.

// Interview Prep · All Levels

Top 30 DevOps Interview Questions and Answers (2026)

Q: What is the difference between a liveness probe and a readiness probe in Kubernetes?

A liveness probe restarts the container if it fails. A readiness probe removes the pod from the Service endpoint list if it fails, stopping traffic routing without restarting the container.

📅 Updated April 2026 ⏱ 20 min read 🏷 Kubernetes · Docker · AWS · Terraform · CI/CD · SRE

This guide covers the 30 most common DevOps interview questions asked at product-based companies, tech consultancies, and startups in 2026. Each answer is written from real interview and production experience — not paraphrased documentation. Use these to prepare for Senior DevOps Engineer, Platform Engineer, and SRE roles.

// Jump to Section

☸️ Kubernetes (Q1–Q7) 🐳 Docker (Q8–Q12) ⚙️ CI/CD & GitOps (Q13–Q17) ☁️ AWS (Q18–Q21)

🗂️ Terraform & IaC (Q22–Q24) 🐧 Linux & Git (Q25–Q27) 🎯 SRE Concepts (Q28–Q30) 💡 Interview Tips

☸️ Kubernetes Interview Questions

Kubernetes · Orchestration

Q1 What is the difference between a liveness probe and a readiness probe? ›

Liveness probe — answers "Is this container still alive?" If it fails, Kubernetes restarts the container. Use it to detect deadlocks or processes that are running but not making progress.

Readiness probe — answers "Is this container ready to receive traffic?" If it fails, the pod is removed from the Service's endpoint list — no traffic is routed to it, but the container is NOT restarted. Use this for slow-starting apps or apps that need time to warm up caches.

Startup probe (third type) — disables liveness checks for a configurable period during initial startup. Essential for slow-starting apps like JVM services that take 60+ seconds to initialise.

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  failureThreshold: 3    # restart after 3 consecutive failures

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 3    # remove from endpoints after 3 failures

Interview tip: Always mention that misconfiguring probes (too aggressive liveness, missing readiness) is one of the most common sources of production incidents in Kubernetes.

Q2 What happens when a Pod is deleted in Kubernetes? Walk me through the full sequence. ›

When you run kubectl delete pod api-xyz, this sequence occurs:

The API Server marks the pod as Terminating and sets its deletionTimestamp.
The Endpoint Controller removes the pod from all Service endpoint lists — traffic stops routing to it.
If a preStop lifecycle hook is defined, it runs first (e.g., a sleep to give the load balancer time to drain).
Kubernetes sends SIGTERM to PID 1 in the container — the app should begin graceful shutdown.
Kubernetes waits up to terminationGracePeriodSeconds (default 30s) for the container to exit.
If still running after the grace period, Kubernetes sends SIGKILL (force kill).
The kubelet removes the pod from the node and the API Server deletes the pod object.

Key production insight: Steps 2 and 3–4 happen in parallel. The endpoint removal can take 1–2 seconds. Without a preStop sleep of at least 2–5s, you may get connection errors from the load balancer routing to a pod that's already shutting down.

Q3 How does Kubernetes schedule a pod? What factors does the scheduler consider? ›

The Kubernetes scheduler runs a two-phase process for every unscheduled pod:

Phase 1 — Filtering (Predicates): Eliminates nodes that cannot run the pod. Checks include: does the node have enough CPU and memory (resources.requests)? Does it match nodeSelector labels? Are taints tolerated? Is the node in a Ready condition? Does the pod's affinity/anti-affinity match?

Phase 2 — Scoring (Priorities): Ranks remaining feasible nodes. Scoring functions include: LeastRequestedPriority (prefer less loaded nodes), InterPodAffinityPriority (co-locate or spread pods), NodeAffinityPriority (soft preferences). The pod is placed on the highest-scoring node.

# Spread pods across availability zones
topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: api

Q4 What is a DaemonSet and when do you use it instead of a Deployment? ›

A DaemonSet ensures exactly one pod runs on each node (or a subset of nodes matching a selector). When a new node joins the cluster, the DaemonSet automatically schedules a pod on it. When a node is removed, the pod is garbage collected.

Use cases: Log collectors (Fluentd, Filebeat), node-level monitoring agents (node-exporter, Datadog agent), network plugins (Calico, Cilium), storage daemons. Anything that needs to run on every node as a node-level service, not as an application.

vs Deployment: A Deployment scales replicas horizontally — you control how many pods run across the cluster. A DaemonSet is node-scoped — one pod per node, automatically adjusting as nodes come and go. You cannot set replicas on a DaemonSet.

Q5 What is RBAC in Kubernetes and how does it work? ›

RBAC (Role-Based Access Control) is Kubernetes' mechanism for controlling who can do what to which resources. It uses four objects:

Role — defines permissions (verbs + resources) within a namespace
ClusterRole — same, but cluster-wide (or for non-namespaced resources)
RoleBinding — binds a Role to a user, group, or ServiceAccount within a namespace
ClusterRoleBinding — binds a ClusterRole cluster-wide

# Grant a ServiceAccount read access to pods in one namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-reader
  namespace: production
rules:
  - apiGroups: [""]
    resources: ["pods", "pods/log"]
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-pods-binding
  namespace: production
subjects:
  - kind: ServiceAccount
    name: monitoring-agent
    namespace: production
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io

Principle of least privilege: Always grant the minimum permissions needed. Audit existing permissions with kubectl auth can-i --list --as=system:serviceaccount:production:mysa.

Q6 How do you debug a pod stuck in CrashLoopBackOff? ›

CrashLoopBackOff means the container starts, crashes, and Kubernetes keeps retrying with exponential back-off. Follow this diagnostic sequence:

# Step 1: describe the pod — check Events section for clues
kubectl describe pod api-xyz-abc -n production

# Step 2: check current and previous container logs
kubectl logs api-xyz-abc -n production          # current attempt
kubectl logs api-xyz-abc -n production --previous # last crashed container

# Step 3: check exit code in describe output
# Exit 1 = application error (check logs)
# Exit 137 = OOMKilled (increase memory limit)
# Exit 139 = Segfault (application bug)
# Exit 143 = SIGTERM not handled (check shutdown code)

# Step 4: override entrypoint to keep container alive for inspection
kubectl run debug --image=api:latest --command -- sleep 3600
kubectl exec -it debug -- /bin/sh

Common root causes: misconfigured environment variables or secrets, insufficient memory (OOMKill), missing config files at expected paths, failed database connection at startup, liveness probe too aggressive for slow-starting apps.

Q7 What is a Horizontal Pod Autoscaler (HPA) and how does it work? ›

HPA automatically scales the number of pod replicas based on observed metrics. By default it uses CPU and memory (from the Metrics Server), but it can also scale on custom metrics from Prometheus via the Custom Metrics API.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60     # scale up when avg CPU > 60%
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 70

Important: HPA requires resources.requests to be set on containers — without requests, the Metrics Server cannot calculate utilisation percentage. HPA scales up fast but scales down slowly (default 5-minute stabilisation window) to avoid thrashing.

🐳 Docker Interview Questions

Docker · Containers

Q8 What is the difference between an image and a container? ›

An image is a read-only, layered filesystem snapshot — it is the build artefact, like a class definition in object-oriented programming. An image is immutable once built.

A container is a running instance of an image — like an object instance. Docker adds a thin writable layer on top of the image layers for the container's runtime changes. Multiple containers can run from the same image simultaneously, each with their own writable layer. Stopping a container does not delete its writable layer; removing it does.

Q9 What are Docker networking modes and when do you use each? ›

bridge (default) — containers get a private network namespace with a virtual ethernet interface connected to docker0 bridge. Use for most single-host applications and docker-compose setups.
host — container shares the host's network namespace directly. No network isolation, no port mapping needed. Use for high-performance network applications where network latency matters (e.g., monitoring agents).
none — no network. Container has only a loopback interface. Use for batch jobs that don't need any network access.
overlay — multi-host network using VXLAN tunnels. Used by Docker Swarm to connect containers across multiple Docker hosts. Kubernetes uses its own CNI plugins instead.
macvlan — assigns a MAC address to the container, making it appear as a physical network device. Use when containers need to be directly accessible on the LAN.

Q10 What is a multi-stage Docker build and why do you use it? ›

Multi-stage builds use multiple FROM statements in one Dockerfile. Each stage is a separate image. You can selectively copy artefacts from one stage to another, leaving behind everything else (build tools, compilers, test dependencies, source code).

# Stage 1: build — has Maven, JDK, source code
FROM maven:3.9-eclipse-temurin-17 AS builder
WORKDIR /build
COPY pom.xml .
RUN mvn dependency:go-offline        # cache deps separately
COPY src ./src
RUN mvn package -DskipTests

# Stage 2: runtime — tiny JRE, only the JAR
FROM eclipse-temurin:17-jre-alpine
WORKDIR /app
RUN useradd -r -u 1001 appuser
COPY --from=builder /build/target/api.jar app.jar
USER appuser
ENTRYPOINT ["java", "-jar", "app.jar"]

Result: the Maven builder image is ~700MB. The runtime image is ~80MB. The 88% size reduction means faster pulls, smaller attack surface, and lower registry costs.

Q11 What is .dockerignore and why is it important? ›

.dockerignore works like .gitignore but for the Docker build context — it tells the Docker daemon which files to exclude when sending the build context to the Docker daemon. Without it, every docker build sends your entire project directory including node_modules, .git, test data, and IDE files to the daemon.

This matters for two reasons: build speed (sending 500MB of node_modules over a socket is slow) and security (source code, secrets in .env files, and credentials may end up in the image layer if copied by a COPY . . instruction).

.git
node_modules
*.log
.env
.env.*
dist
target
__pycache__
*.pyc
.DS_Store
*.test.js
coverage/

Q12 How do you pass secrets to a Docker container securely? ›

Never hardcode secrets in Dockerfile, COPY a secrets file into the image, or pass them via ENV instructions in the Dockerfile (they appear in docker history).

Good approaches:

Environment variables at runtime: docker run -e DB_PASSWORD=$SECRET api:latest. The secret is not baked into the image, but it is visible in the container's /proc/PID/environ — acceptable for most use cases.
Docker secrets (Swarm): Mounted as a tmpfs file at /run/secrets/secret_name. Not visible in inspect output.
Build-time secrets with BuildKit: RUN --mount=type=secret,id=npmrc cat /run/secrets/npmrc > ~/.npmrc. The secret is never stored in any layer.
In Kubernetes: Mount Kubernetes Secrets as environment variables or files. Use External Secrets Operator to sync from AWS Secrets Manager or HashiCorp Vault.

⚙️ CI/CD & GitOps Interview Questions

CI/CD · GitHub Actions · ArgoCD

Q13 What is the difference between Continuous Integration, Continuous Delivery, and Continuous Deployment? ›

Continuous Integration (CI) — developers merge code frequently (at least daily). Every merge triggers automated tests, linting, and security scans. The goal is to detect integration problems early. CI does not involve deployment.

Continuous Delivery (CD) — every change that passes CI is automatically built into a deployable artefact and deployed to a staging/QA environment. Deployment to production requires a manual approval step. This ensures production is always one click away.

Continuous Deployment — takes Continuous Delivery further — every change that passes the automated pipeline is deployed to production automatically, with no human approval. Requires very high confidence in test coverage and monitoring.

        Interview distinction: CI is about code integration and testing. CD (Delivery) keeps production deployable but requires human approval. CD (Deployment) removes the manual gate entirely. Most companies practice CI + Continuous Delivery, not full Continuous Deployment.
      

Q14 What is GitOps and how does ArgoCD implement it? ›

GitOps is an operational model where the desired state of infrastructure and applications is stored declaratively in Git. Changes to infrastructure are made via Git commits and pull requests — not manual kubectl apply commands. Git becomes the single source of truth.

ArgoCD implements GitOps by continuously watching a Git repository and comparing the cluster's actual state with the desired state in Git. When they diverge (new commit in Git, or manual change to cluster), ArgoCD either alerts or automatically syncs the cluster back to Git.

Benefits: audit log is Git history, rollback is git revert, no cloud credentials needed in CI pipelines, self-healing (auto-reverts manual cluster changes).

Q15 How do you handle secrets in a GitOps pipeline? ›

The fundamental challenge: if Git is the source of truth, secrets cannot be stored in Git in plaintext. Common solutions:

Sealed Secrets (Bitnami) — encrypt secrets with a cluster-specific public key. The encrypted SealedSecret CRD is safe to commit to Git. The Sealed Secrets controller decrypts it in-cluster. Simple, no external dependency.
External Secrets Operator (ESO) — define an ExternalSecret CRD pointing to AWS Secrets Manager, HashiCorp Vault, or GCP Secret Manager. ESO fetches and syncs the real secret into a Kubernetes Secret. Source of truth is the secrets manager, not Git.
SOPS + age — encrypt secrets files in Git with SOPS. Decrypt at apply time using age keys stored in the cluster. ArgoCD has a SOPS plugin.
Vault Agent Injector — HashiCorp Vault injects secrets as files into pod containers via a sidecar. Secrets never touch Git or Kubernetes Secrets.

Q16 How do you implement a blue-green deployment? ›

Blue-green deployment maintains two identical production environments — blue (current live) and green (new version). Traffic switch is instantaneous, and rollback is equally instant.

# Blue deployment (currently live)
kubectl apply -f deployment-blue.yaml   # version 1.0, label: slot=blue

# Deploy green (new version) without touching live traffic
kubectl apply -f deployment-green.yaml  # version 1.1, label: slot=green

# Test green directly via its own service
kubectl port-forward svc/api-green 8080:80

# Switch all traffic to green by updating the main service selector
kubectl patch service api -p '{"spec":{"selector":{"slot":"green"}}}'

# Rollback instantly if needed
kubectl patch service api -p '{"spec":{"selector":{"slot":"blue"}}}'

With ArgoCD: Use Argo Rollouts (a separate controller) which has native blue-green and canary support with automatic traffic shifting and metric-based promotion/rollback.

Q17 What are GitHub Actions reusable workflows and why do you use them? ›

Reusable workflows let you define a workflow in one file and call it from multiple other workflows across repositories. Instead of copying the same "build Docker image, scan with Trivy, push to ECR" steps into 20 service repositories, you define it once in a shared repo and call it.

# caller workflow
jobs:
  build:
    uses: myorg/shared-workflows/.github/workflows/docker-build.yml@main
    with:
      image-name: payment-service
      dockerfile: ./Dockerfile
    secrets:
      ECR_REGISTRY: ${{ secrets.ECR_REGISTRY }}
      AWS_ROLE_ARN: ${{ secrets.AWS_ROLE_ARN }}

Benefits: Single place to update security scanning tool versions, consistent pipeline structure across teams, reduction in workflow duplication. Combine with organisation-level secrets for maximum DRY.

☁️ AWS Interview Questions

AWS · Cloud · EKS

Q18 What is VPC and how do public vs private subnets work? ›

A VPC (Virtual Private Cloud) is a logically isolated network within AWS. You define the IP range (CIDR block), create subnets, route tables, and gateways.

Public subnet — has a route to an Internet Gateway (IGW). Resources here can have public IP addresses and are directly reachable from the internet. Use for load balancers and NAT Gateways.

Private subnet — no direct internet route. Resources here reach the internet via a NAT Gateway (in a public subnet). Use for EKS nodes, RDS databases, application servers — they can make outbound requests but are not directly reachable from the internet.

Production best practice: Never put application servers or databases in public subnets. Only load balancers go in public subnets. This is the most important AWS security design pattern.

Q19 Explain IAM roles vs IAM users. When do you use each? ›

IAM Users — long-term identities with permanent credentials (access key ID + secret). Use only for human users who need console access, and prefer SSO (AWS SSO/Identity Center) even then. Never create IAM users for applications or services.

IAM Roles — identities without long-term credentials. They are assumed by a principal (EC2 instance, Lambda function, EKS pod, another account, GitHub Actions via OIDC) and issue temporary credentials via STS. Use roles for everything non-human: EC2 instance profiles, Lambda execution roles, EKS node roles, IRSA for pod-level permissions.

The golden rule: If it's a machine or service, it must use a role — never a user with an access key.

Q20 What is S3 versioning and when should you enable it? ›

S3 versioning preserves every version of every object in a bucket. When you overwrite or delete a file, the previous version is retained. You can restore any previous version or recover accidentally deleted objects.

Enable versioning on: Terraform state buckets (critical — protects against state corruption), any bucket used for backups, buckets containing configuration or deployment artefacts that you may need to roll back.

Cost consideration: Every version counts toward storage costs. Combine versioning with lifecycle policies to automatically transition old versions to cheaper storage classes (Glacier) or delete them after a retention period.

# Enable versioning via AWS CLI
aws s3api put-bucket-versioning \
  --bucket mycompany-terraform-state \
  --versioning-configuration Status=Enabled

Q21 What is CloudFormation and how is it different from Terraform? ›

CloudFormation is AWS's native IaC service. Templates are YAML or JSON. State is managed by AWS automatically. Deeply integrated with AWS — supports all AWS resources on day 0. Free to use (pay only for resources created).

Terraform is cloud-agnostic. Works with AWS, GCP, Azure, and hundreds of providers in the same codebase. HCL is generally more readable than CloudFormation YAML. State is managed yourself (S3 + DynamoDB). Has a larger open-source ecosystem.

When to use each: Pure AWS shops often start with CloudFormation for deep native integration. Multi-cloud or cloud-agnostic teams prefer Terraform. Most enterprises now standardise on Terraform. CloudFormation's key advantage is that AWS ensures it supports every new service feature immediately — Terraform providers sometimes lag.

🗂️ Terraform & IaC Interview Questions

Terraform · IaC · Ansible

Q22 What is the Terraform state file and why is it critical? ›

The state file (terraform.tfstate) is Terraform's record of what infrastructure it manages. It maps your HCL resource definitions to real cloud resource IDs (e.g., aws_instance.web → i-0abc1234). Without the state file, Terraform cannot know what exists and would try to create everything from scratch on the next apply.

Critical risks: State files often contain sensitive values in plaintext (database passwords, private keys). Always encrypt state at rest (S3 SSE). Never commit state to Git. Use DynamoDB locking to prevent concurrent modifications. Enable S3 versioning on the state bucket to recover from accidental corruption.

State manipulation commands: terraform state list (see all resources), terraform state rm (remove without destroying), terraform state mv (rename/move resource in state), terraform import (import existing resources).

Q23 When would you use Ansible instead of Terraform? ›

Terraform and Ansible solve different problems and are often used together.

Use Terraform for: Provisioning infrastructure — creating VPCs, EC2 instances, EKS clusters, S3 buckets, databases. Terraform excels at managing the lifecycle of cloud resources (create, update, destroy). It is declarative — you describe desired state.

Use Ansible for: Configuration management — installing software on servers, applying OS configurations, managing users and packages, deploying application configs. Ansible is procedural — you describe the steps to take. It is also used for ad-hoc automation tasks and application deployment to existing servers.

Together: Terraform provisions the EC2 instance, Ansible configures it. Terraform creates the EKS cluster, Ansible (or Helm) deploys applications onto it.

Q24 What is a Terraform provider and how does it work? ›

A provider is a plugin that translates Terraform's resource definitions into API calls for a specific platform. The AWS provider translates aws_instance into EC2 API calls. The Kubernetes provider translates kubernetes_deployment into Kubernetes API calls.

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"      # allow 5.x, not 6.x
    }
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = ">= 2.20"
    }
  }
}

provider "aws" {
  region = "us-east-1"
  default_tags {
    tags = { ManagedBy = "Terraform", Environment = "production" }
  }
}

terraform init downloads providers from the Terraform Registry. Always pin provider versions in production — unpinned providers can break on automatic upgrades.

🐧 Linux & Git Interview Questions

Linux · Git · Shell

Q25 How do you find a process consuming high CPU and kill it gracefully? ›

# Find top CPU consumers
top                              # live, press P to sort by CPU
ps aux --sort=-%cpu | head -10   # snapshot, sorted by CPU descending

# Get PID of a specific process
pgrep -f "java.*api-service"
pidof nginx

# Graceful shutdown (SIGTERM — app should clean up and exit)
kill -15 PID
kill -SIGTERM PID

# Verify it stopped; force kill only if SIGTERM doesn't work after ~30s
ps -p PID
kill -9 PID    # SIGKILL — no cleanup, immediate termination (last resort)

Interview insight: Always try SIGTERM (15) first — it allows the application to flush buffers, close connections, and write final logs. SIGKILL (9) bypasses all of this and can cause data corruption or broken connections. This mirrors exactly how Kubernetes handles pod termination.

Q26 What is the difference between git fetch and git pull? ›

git fetch downloads new commits, branches, and tags from the remote but does NOT update your working branch or local files. It updates remote-tracking branches (origin/main). It is always safe — it never changes your working state.

git pull = git fetch + git merge (or git rebase if configured). It fetches and immediately integrates the remote changes into your current branch. If there are conflicts, you must resolve them.

Best practice: Use git fetch then inspect what changed (git log HEAD..origin/main or git diff HEAD origin/main) before merging. This is especially important on long-running feature branches where the remote may have diverged significantly.

Q27 How do you troubleshoot high disk usage on a Linux server? ›

# Step 1: check which filesystem is full
df -hT

# Step 2: find the largest directories
du -sh /* 2>/dev/null | sort -rh | head -10
du -sh /var/* 2>/dev/null | sort -rh | head -10

# Step 3: find the largest files
find / -type f -size +500M 2>/dev/null | sort -k5 -rn

# Step 4: check for deleted files still held open (common cause)
lsof | grep deleted | awk '{print $7, $9}' | sort -rn | head -10
# Fix: restart the process holding the deleted file open

# Step 5: check Docker (often the culprit)
docker system df
docker system prune -a    # remove unused images/containers/volumes

# Step 6: check journal logs
journalctl --disk-usage
journalctl --vacuum-size=500M   # trim logs to 500MB

🎯 SRE Concepts Interview Questions

SRE · Observability · Incidents

Q28 What is an SLO and how do you define one for a new service? ›

SLI (Service Level Indicator) — the metric you measure. e.g., "percentage of HTTP requests that return 200 within 200ms".

SLO (Service Level Objective) — your internal reliability target. e.g., "99.9% of requests succeed within 200ms, measured over a 30-day rolling window".

SLA (Service Level Agreement) — contractual commitment to customers, usually less strict than the SLO (if your SLO is 99.9%, your SLA might be 99.5%). Violating an SLA has financial penalties.

Defining an SLO for a new service: Start by identifying the key user journeys. For each, define an SLI that measures whether the user experience was good or bad. Set the SLO conservatively at first (99% or 99.5%) — you can tighten it once you have production data. Calculate the error budget: 100% - SLO = budget for failure.

Q29 Walk me through how you handle a production incident. ›

A production incident has a well-defined lifecycle:

Detect — alert fires (PagerDuty/OpsGenie), Slack notification, or user report.
Assess severity — P1 (total outage, revenue impact), P2 (degraded service, workaround exists), P3 (minor, contained). Severity determines who is engaged.
Communicate — post an incident channel message immediately: "Investigating high error rate on payment service. Impact: checkout failures. ETA: 15 minutes." Update every 10–15 minutes even if nothing is resolved.
Mitigate first, investigate second — restore service quickly (rollback, scale up, failover) before root-causing. Mitigation and investigation are parallel, but mitigation wins.
Resolve & confirm — verify metrics return to normal, check error rates, confirm with stakeholders.
Post-mortem — blameless writeup within 48 hours: timeline, root cause, impact, 3–5 action items with owners and due dates.

Q30 What is toil and how do SRE teams reduce it? ›

Toil — as defined by Google's SRE book — is work that is manual, repetitive, automatable, tactical (reacting to problems rather than preventing them), and that scales linearly with service growth. Toil has no lasting value once completed.

Examples: manually restarting a service that crashes weekly, manually rotating credentials every 90 days, manually scaling capacity before peak traffic, running the same deployment runbook for every release.

How SRE teams reduce toil: Automate anything done more than twice. Implement automatic restarts (Kubernetes). Use auto-scaling (HPA, Cluster Autoscaler). Use secrets managers for automatic rotation. Build self-service tooling so developers don't need SRE help for routine tasks. Track the ratio of toil to project work — Google recommends keeping toil below 50% of an SRE team's time.

💡 How to Answer DevOps Interview Questions

    The 4-step framework for any technical answer:

State the problem/risk clearly — "The main challenge with X is..."

Name the mechanism — "We solve this using Y"

Give a real example — "In production, we handled this by..."

Mention the trade-off — "The downside is Z, which we balance by..."

When you blank on a question, say: "Let me think through this from first principles." This buys 5 seconds and signals structured thinking — far better than silence. Interviewers at senior level are not just testing knowledge; they are testing how you approach problems you haven't seen before.

For scenario questions ("Tell me about a time when..."), always use a real production story. If you don't have one, describe what you would do and be explicit that it's hypothetical. Interviewers respect honesty and practical thinking far more than answers that sound memorised.

👉 Deep-Dive into Each Topic

☸️ Kubernetes Guide 🐳 Docker Guide ☁️ AWS Guide 🗂️ Terraform Guide ⚙️ CI/CD Guide 🐧 Linux Guide 🌿 Git Guide 📊 Prometheus Guide 🗺️ Mind Map

☸️

About the Author

DevOps & SRE Engineer · Last updated: April 2026

DevOps Engineer with years of hands-on experience in Kubernetes, AWS, CI/CD, Terraform, and SRE practices. This platform shares real-world DevOps knowledge for interview preparation — written from production experience, not copied from documentation.

📸 Instagram ▶️ YouTube 💼 LinkedIn ℹ️ About Us →