← Home 🗺 Mind Map ☕ Ko-fi 💳 Razorpay
// DevOps Guide · Beginner to Intermediate

What is DevOps? Complete Guide 2026: Culture, Tools, SDLC & Real Interview Q&A

📅 Updated May 2026 ⏰ 22 min read 🏷 DevOps · SRE · Career · Tools · Culture
👨‍💻
Dhanush R — Senior DevOps Engineer, Bengaluru
4.5+ years of production DevOps experience across AWS EKS, Kubernetes, Terraform, and CI/CD. I built this platform to give every engineer the real knowledge needed to pass DevOps interviews and thrive in production roles.
// Table of Contents
  1. What is DevOps? The Real Definition
  2. DevOps vs SRE — What's the Difference?
  3. The DevOps Software Development Lifecycle
  4. Core DevOps Principles — CALMS Framework
  5. The DevOps Toolchain — Every Category Explained
  6. CI/CD — The Engine of DevOps
  7. Containers and Kubernetes — Why They Dominate
  8. Infrastructure as Code — The DevOps Standard
  9. Monitoring, Observability, and the Three Pillars
  10. DORA Metrics — How to Measure DevOps Performance
  11. DevOps Career Path and Skills Roadmap 2026
  12. 12 DevOps Interview Questions with Expert Answers

DevOps is not a job title, a tool, or a department. It is a culture and set of practices that unify software development (Dev) and IT operations (Ops) to deliver software faster, more reliably, and with better quality. I have worked as a Senior DevOps Engineer for 4.5 years, building and operating infrastructure that handles real production traffic across AWS EKS, Kubernetes, Terraform, and multiple CI/CD systems. This guide gives you the real understanding of DevOps — the concepts, tools, practices, and interview answers you actually need in 2026.

What is DevOps? The Real Definition

Before DevOps, software teams were organised into strict silos: developers wrote code and threw it "over the wall" to operations teams who deployed and maintained it. Developers were measured on feature velocity — how much code they shipped. Operations were measured on stability — keeping systems up. These incentives were fundamentally opposed. Developers wanted to change things fast. Operations wanted nothing to change. This friction caused slow releases (months between deployments), painful deployments (manual, error-prone, late-night affairs), blame culture (developers blamed ops for deployment failures, ops blamed devs for bad code), and poor reliability (no shared ownership of production).

DevOps emerged as a response to this dysfunction. It draws from lean manufacturing, the agile movement, and systems thinking. The core idea: development and operations teams should share tools, processes, and responsibility. Developers own their code all the way to production. Operations engineers automate rather than manually operate. Both teams care about deployment frequency, reliability, and customer experience.

The most accurate definition: DevOps is a set of cultural practices, organisational patterns, and technical capabilities that enable organisations to deliver high-quality software rapidly and reliably by breaking down silos between development and operations teams. It is measured by the four DORA metrics: deployment frequency, lead time for changes, change failure rate, and mean time to recovery.

DevOps vs SRE — What's the Difference?

DevOps is the cultural philosophy and the broad set of practices. It describes how teams work together — shared ownership, automation, fast feedback, continuous improvement. A "DevOps engineer" is a practitioner who builds the platforms, pipelines, and automation that enable development teams to deliver software reliably.

SRE (Site Reliability Engineering) was invented at Google in 2003 by Ben Treynor Sloss. He was asked to make Google's production systems more reliable, and his approach was to "hire software engineers to do operations." SRE is Google's concrete implementation of DevOps principles. SRE defines specific mechanisms: error budgets to quantify acceptable unreliability, SLOs (Service Level Objectives) to set reliability targets, toil reduction (automating away manual operational work), blameless post-mortems, and on-call rotations with clear escalation paths.

In practice: DevOps is the philosophy; SRE is one opinionated implementation of that philosophy. Both roles exist in industry. Large companies (Google, Amazon, Meta) have formal SRE teams. Startups and mid-size companies typically have "DevOps engineers" who do similar work with less formal structure. The skills overlap significantly: Kubernetes, cloud infrastructure, CI/CD, monitoring, and incident response are core to both.

The DevOps Software Development Lifecycle

The DevOps lifecycle is often visualised as an infinity loop representing the continuous nature of the process. Unlike waterfall (linear: requirements → design → build → test → deploy → maintain) or even agile (iterative sprints), DevOps is genuinely continuous — code flows from commit to production in hours or minutes, and feedback from production flows back to developers continuously.

Core DevOps Principles — CALMS Framework

The CALMS framework (coined by Jez Humble, co-author of The DevOps Handbook) provides a structured way to assess and build DevOps culture in an organisation:

The DevOps Toolchain — Every Category Explained

The DevOps toolchain covers every phase of the software lifecycle. Here are the primary categories and the most important tools in each:

Container Orchestration
Kubernetes (dominant standard), Docker Swarm (simpler, smaller scale). Manages containers at scale across clusters of machines.
🐳
Containerisation
Docker (universal standard for building and running containers), Podman (daemonless alternative). Packages applications with all dependencies.
CI/CD
GitHub Actions (most popular), GitLab CI, Jenkins, CircleCI, ArgoCD (GitOps), Flux. Automates build, test, and deployment pipelines.
📀
Infrastructure as Code
Terraform (multi-cloud standard), AWS CloudFormation, Pulumi, Ansible (configuration management). Defines infrastructure in version-controlled code.
Cloud Platforms
AWS (dominant), GCP, Azure. Provide compute, storage, networking, managed databases, and Kubernetes as managed services.
📊
Monitoring & Observability
Prometheus + Grafana (open-source standard), Datadog, New Relic, CloudWatch. Metrics, logs, traces for full observability.
🌿
Source Control
Git (universal), GitHub, GitLab, Bitbucket. Version control, code review (PR/MR), branch protection, and GitOps triggers.
🔒
Secret Management
HashiCorp Vault, AWS Secrets Manager, External Secrets Operator. Securely store and distribute credentials, API keys, and certificates.
📜
Artifact Registry
AWS ECR, Docker Hub, GitHub Container Registry, JFrog Artifactory. Stores Docker images and build artifacts securely.
📋
Package Management
Helm (Kubernetes package manager), npm, pip, Maven. Manages application and infrastructure dependencies.

CI/CD — The Engine of DevOps

Continuous Integration and Continuous Delivery/Deployment is the technical foundation that enables DevOps practices. Without CI/CD, DevOps is just a philosophy. With CI/CD, every code commit automatically moves through build, test, security scan, and deployment stages — giving teams the ability to deploy safely at any time.

Continuous Integration means every developer's code changes are automatically built and tested when pushed. The CI system catches integration bugs within minutes — not days or weeks. The key principle: the main branch must always be in a deployable state. Every failing test on main is a production emergency that blocks everyone.

Continuous Delivery means the software is always in a deployable state and can be released to production at any time with a single button click or approval. Continuous Deployment removes even that approval — every passing build is automatically deployed to production. Elite DevOps teams deploy to production dozens or hundreds of times per day.

# A minimal but complete CI/CD pipeline (GitHub Actions) name: CI/CD on: push: branches: [main] pull_request: branches: [main] jobs: test-and-deploy: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: npm ci - run: npm run lint && npm test - name: Build Docker image if: github.ref == 'refs/heads/main' run: docker build -t myapp:${{ github.sha }} . - name: Deploy to production if: github.ref == 'refs/heads/main' run: kubectl set image deployment/myapp myapp=myapp:${{ github.sha }}

Containers and Kubernetes — Why They Dominate

Containers solved one of the oldest problems in software deployment: "it works on my machine." By packaging an application with all its dependencies (libraries, runtime, configuration) into a single portable unit (a Docker image), containers guarantee identical behaviour from a developer's laptop through CI/CD to production. A container image that passes tests in CI will behave identically in production.

Kubernetes became the dominant container orchestration platform because it answers all the operational questions that containers alone don't: How do you restart a crashed container automatically? How do you distribute containers across multiple servers for high availability? How do you roll out updates without downtime? How do you scale automatically when traffic spikes? Kubernetes handles all of this through a desired-state model — you declare what you want, Kubernetes makes it happen and maintains it continuously.

In 2026, Kubernetes runs the majority of production workloads at scale. AWS EKS, Google GKE, and Azure AKS are the managed services used by most organisations. Every DevOps engineer needs hands-on Kubernetes experience — it is the single most important platform skill in the field.

Infrastructure as Code — The DevOps Standard

Infrastructure as Code (IaC) means defining and managing infrastructure (servers, networks, databases, load balancers, Kubernetes clusters) through machine-readable configuration files rather than through manual processes in cloud consoles. IaC enables: version control for infrastructure (every change is a Git commit), repeatability (the same code provisions identical environments), peer review (infrastructure changes go through PR review), and disaster recovery (any environment can be rebuilt from code).

Terraform is the dominant IaC tool in 2026. It uses HCL (HashiCorp Configuration Language) to define resources across any cloud provider. The plan/apply workflow shows exactly what will change before it changes, eliminating deployment surprises. State management tracks what has been provisioned. Modules enable reusable infrastructure components across teams.

IaC golden rule: If you made an infrastructure change in the AWS console, it doesn't exist. All infrastructure must be defined in Terraform and version-controlled in Git. Manual console changes are untracked, non-repeatable, and will be overwritten by the next terraform apply. Enforce this with IAM policies that restrict console write access in production to only the CI/CD pipeline's Terraform role.

Monitoring, Observability, and the Three Pillars

Monitoring tells you something is wrong. Observability tells you why. A well-monitored system has dashboards showing current health. An observable system lets you ask arbitrary questions about any past or present state — even questions you didn't think to ask when you built the system.

Observability is built on three pillars:

DORA Metrics — How to Measure DevOps Performance

The DORA (DevOps Research and Assessment) metrics are the industry standard for measuring the performance of DevOps organisations. They are derived from seven years of research across thousands of teams and have been validated to correlate with organisational performance and profitability.

Interview insight: Elite DevOps organisations achieve high deployment frequency AND low failure rate simultaneously — these are not in tension. Research shows that deploying more frequently, in smaller batches, with comprehensive automation actually reduces failure rate compared to infrequent large deployments. The key enabler is comprehensive automated testing and progressive delivery strategies.

DevOps Career Path and Skills Roadmap 2026

The DevOps engineer role has become one of the highest-paying and most in-demand technical positions. Here is the honest skills roadmap based on what is actually required in production DevOps jobs in 2026:

The most important advice: get hands-on experience. Build a homelab (k3s on a cheap VPS, or use free tiers on AWS/GCP). Deploy a real application end-to-end. Set up Prometheus monitoring on it. Write the Terraform to provision the infrastructure. Build a GitHub Actions pipeline. The companies that interview you will test practical skills, not theoretical knowledge.

12 DevOps Interview Questions with Expert Answers

Q1: What is DevOps and how does it differ from traditional software development?
DevOps is a cultural and technical movement that breaks down the traditional silos between development and operations teams. In traditional organisations, developers write code and hand it to separate operations teams for deployment and maintenance. This creates slow release cycles, blame culture, and poor reliability. DevOps unifies these responsibilities: developers own their code all the way to production, operations engineers automate platforms rather than manually managing systems, and both teams share responsibility for reliability, deployment speed, and customer experience. The measurable outcomes are captured in the DORA metrics: elite DevOps organisations deploy multiple times per day with under one hour lead time, under 15% change failure rate, and under one hour mean time to recovery. These outcomes are simultaneously achievable — frequent small deployments with comprehensive automation are actually more reliable than infrequent large ones.
Q2: Explain the difference between Continuous Integration, Continuous Delivery, and Continuous Deployment.
Continuous Integration means every code commit triggers automated build and testing. The main branch is always in a buildable, tested state. Developers integrate their changes frequently (multiple times per day), so integration conflicts are caught immediately rather than building up over weeks. Continuous Delivery extends CI by ensuring every successful build is ready to be deployed to production at any time. Deployment to production is possible with a single action (a button click or approval), but a human makes the decision. This suits regulated industries where deployment requires a sign-off. Continuous Deployment is the final step — every build that passes all automated tests is automatically deployed to production without human intervention. Deployment frequency is limited only by how fast developers can write code. This requires extremely high confidence in the automated test suite and robust rollback mechanisms. Most organisations practise Continuous Delivery; elite organisations like Netflix and Amazon practise Continuous Deployment for many services.
Q3: What are the four DORA metrics and why do they matter?
The four DORA metrics measure software delivery performance and are backed by seven years of research across thousands of engineering teams. Deployment Frequency measures how often code is deployed to production — elite teams deploy multiple times per day. Lead Time for Changes measures the time from a code commit to it running in production — elite teams achieve under one hour. Change Failure Rate measures the percentage of deployments that cause a production incident — elite teams stay under 15%. Mean Time to Recovery measures how quickly service is restored after an incident — elite teams recover in under one hour. They matter because they are validated predictors of overall organisational performance and business outcomes. They provide a common, objective language for measuring DevOps maturity across teams. And they expose the right things to improve: a team with high change failure rate should invest in testing and progressive delivery; a team with high MTTR should invest in runbooks, rollback mechanisms, and on-call practice.
Q4: What is Infrastructure as Code and why is it essential for DevOps?
Infrastructure as Code means defining cloud infrastructure — servers, networks, databases, Kubernetes clusters — in version-controlled configuration files rather than through manual processes. Terraform is the dominant IaC tool. IaC is essential for DevOps for four reasons: repeatability (the same code provisions identical infrastructure in dev, staging, and production, eliminating "works in staging, broken in production" issues), version control (every infrastructure change is a Git commit with an author, timestamp, and review — complete audit trail), peer review (infrastructure changes go through pull requests just like code, catching mistakes before they reach production), and disaster recovery (any environment can be rebuilt from code alone in minutes after a catastrophic failure, rather than hours of manual reconstruction from memory). Without IaC, infrastructure drift is inevitable: manual changes accumulate, environments diverge, and nobody knows the true state of production.
Q5: What is the difference between DevOps and SRE?
DevOps is the broad cultural philosophy: break down silos between development and operations, automate everything, measure continuously, and share responsibility for reliability. SRE (Site Reliability Engineering) is Google's specific, opinionated implementation of DevOps principles, created in 2003 by Ben Treynor Sloss. SRE adds precise mechanisms: SLOs (quantified reliability targets), error budgets (how much unreliability is acceptable, used to balance feature velocity against reliability), toil reduction (systematically eliminating manual operational work), and structured on-call and incident response practices. In practice: SRE teams are typically found at large organisations (Google, Amazon, Uber, Airbnb) with the engineering headcount to have dedicated reliability engineers. Smaller organisations have "DevOps engineers" who perform similar functions with less formal structure. The skills overlap significantly — both require deep Kubernetes, cloud, monitoring, CI/CD, and incident response expertise.
Q6: How do containers improve application deployment compared to VMs?
Virtual machines virtualise the entire hardware stack and run a complete OS kernel per VM — each VM is typically several gigabytes, takes 30–60 seconds to start, and a physical server can run 5–10 VMs efficiently. Containers share the host OS kernel and only virtualise the process environment using Linux namespaces and cgroups — each container is typically megabytes, starts in milliseconds, and a server can run 100+ containers efficiently. For deployment, containers provide several advantages: immutability (a container image is a fixed, tested artifact that behaves identically in CI and production), speed (millisecond startup enables rapid scaling and fast rolling updates), density (more application instances per server reduces infrastructure cost), and portability (the same image runs on any Docker host, any Kubernetes cluster, any cloud). The trade-off: containers share the kernel, so a kernel vulnerability affects all containers on a host. VMs provide stronger isolation. For most web applications, containers' operational advantages far outweigh this security trade-off.
Q7: What is GitOps and how does it work?
GitOps uses Git as the single source of truth for both application code and infrastructure desired state. Every Kubernetes manifest, Terraform configuration, and Helm values file lives in Git. Changes to production are made exclusively through Git pull requests — no direct kubectl apply, no console changes, no pipeline imperative deployments with cluster credentials. A GitOps operator (ArgoCD or Flux) runs inside the Kubernetes cluster, continuously watches the manifest repository, compares it to the live cluster state, and automatically reconciles any differences. If someone manually runs kubectl and changes a deployment, the operator reverts it within seconds to match Git. The benefits: complete audit trail (every change is a Git commit with author and PR review), instant rollback (git revert triggers automatic rollback), improved security (no cluster credentials in CI pipelines), and disaster recovery (any cluster can be rebuilt from Git alone). GitOps is the recommended deployment approach for Kubernetes in 2026.
Q8: What are the three pillars of observability?
The three pillars are metrics, logs, and traces. Metrics are numeric time-series measurements: CPU usage, request rate, error rate, latency percentiles. They are efficient to store and query, ideal for dashboards and alerting. Prometheus scrapes metrics and Grafana visualises them. Logs are timestamped text records of discrete events: HTTP requests received, database queries executed, errors with stack traces. Logs provide the detail needed to understand why metrics changed. Traces track a request as it flows through multiple microservices, showing which services were called, how long each step took, and where errors occurred — essential for debugging latency in distributed systems. Together, metrics tell you something is wrong, logs tell you what happened, and traces tell you exactly where in the distributed system the problem occurred. A mature observability stack correlates all three: clicking on a metric spike should show related logs and traces from that time window.
Q9: How do you handle a production outage as a DevOps engineer?
A structured incident response process: (1) Declare the incident immediately. Create an incident Slack channel, assign an incident commander to coordinate. Do not spend time investigating alone — get the right people in the room fast. (2) Communicate with stakeholders. Update the status page within 5 minutes. Internal teams and customers need to know you know about the problem. (3) Mitigate first, investigate second. Can you roll back the last deployment? Disable a feature flag? Redirect traffic? The fastest path to recovery takes priority over understanding the root cause. (4) Identify and confirm the root cause once service is restored. Check deployment history, recent changes, monitoring graphs, and error logs around the time of the incident. (5) Run a blameless post-mortem within 48 hours. Document the timeline, root cause, contributing factors, and specific action items to prevent recurrence. (6) Implement the action items. The post-mortem has zero value if the actions are not completed. Track them in Jira or equivalent and review at the next sprint planning.
Q10: What is an SLO and how do you use an error budget?
An SLO (Service Level Objective) is a target reliability commitment: "99.9% of API requests will succeed over a 30-day rolling window." An SLI (Service Level Indicator) is the actual measurement: the real success rate calculated from Prometheus metrics. The error budget is the difference: 99.9% SLO means a 0.1% error budget — 43.8 minutes of downtime or 0.1% of requests can fail per 30-day period. Error budgets make reliability discussions data-driven. If the budget is healthy (mostly unspent), the team has runway to take deployment risks, experiment with new features, and move fast. If the budget is nearly exhausted, deployments should slow, on-call should not be paged for minor issues, and engineering effort should focus on reliability improvements. This prevents both over-engineering (spending engineering time on reliability beyond what users need) and under-engineering (ignoring reliability until a major outage). It also makes the conversation between product teams (who want features) and platform teams (who want stability) objective: how much error budget does this risky deployment consume?
Q11: What is shift-left in security (DevSecOps) and how do you implement it?
Shift-left means moving security practices earlier in the software development lifecycle — finding and fixing security issues during development (left on the timeline) rather than during production deployment or, worse, after a breach. Traditional security was a gate at the end of development: a security team reviewed code before release, slowing delivery and finding issues too late to fix cheaply. DevSecOps integrates security into every stage: pre-commit hooks (Gitleaks secret scanning before code is committed), CI pipeline SAST (Semgrep, SonarQube scanning every PR for SQL injection, XSS, hardcoded secrets), dependency scanning (Snyk, OWASP Dependency Check finding CVEs in third-party libraries), container image scanning (Trivy scanning Docker images before they are pushed to production), infrastructure security scanning (tfsec, Checkov scanning Terraform for misconfigurations), and runtime security (Falco detecting anomalous container behaviour in production). The result: security issues are caught by developers in their normal workflow, not by a security team as a deployment gate, making fixes faster, cheaper, and less disruptive.
Q12: Walk me through how you would onboard a new microservice to your DevOps platform.
A comprehensive onboarding process for a new microservice: (1) Repository setup — create a GitHub repository from the standard template (which includes CI/CD workflow, Dockerfile template, branch protection rules, and CODEOWNERS). (2) CI/CD pipeline — the template GitHub Actions workflow handles lint, test, container build, Trivy scan, ECR push, and staging deployment automatically. The developer fills in the Dockerfile and adds application-specific test commands. (3) Kubernetes manifests — create Deployment, Service, HPA, and PodDisruptionBudget manifests. Apply resource requests/limits, readiness probes, and topology spread constraints from the start. Commit to the GitOps manifest repository. ArgoCD detects the new application and deploys to staging. (4) Monitoring — create a ServiceMonitor for Prometheus scraping. Add the service to the standard Grafana dashboard template (RED metrics: rate, errors, duration). Create AlertManager rules for error rate and latency SLOs. (5) Runbook — write a basic runbook documenting how to restart the service, check its logs, and perform common operational tasks. Add it to the internal wiki. (6) On-call integration — add the service's critical alerts to PagerDuty with the owning team's escalation policy. The whole process takes 2–4 hours with a mature Internal Developer Platform.

📖 Explore the Full DevOps Ecosystem

See how all DevOps tools connect — Kubernetes, Docker, Terraform, CI/CD, AWS, and more — on the interactive mind map.

🗺 Open Interactive Mind Map ⎈ Kubernetes Deep Dive →
// All Guides
⎈ Kubernetes 🐳 Docker ☁ AWS 📀 Terraform ⚙ CI/CD 🐧 Linux 🌿 Git 📊 Prometheus 🎯 Interview Q&A
Advertisement
☕ Support Master DevOps

All 9 guides are 100% free. If this helped you learn DevOps or land a job, your support keeps the project alive.

☕ Ko-fi — International 💳 Razorpay — India
📖
Written by Dhanush R
Senior DevOps Engineer · 4.5+ Years · Bengaluru, India · AWS EKS · Kubernetes · Terraform

DevOps engineer with 4.5+ years of hands-on production experience. I built Master DevOps to share real, practical knowledge — not tutorials that work on localhost but fail in production. Every guide is written from actual daily work in enterprise DevOps environments. Last updated: May 2026.

📷 Instagram ▶ YouTube 💼 LinkedIn About →
🌙