Kubernetes Complete Guide 2026: Architecture, YAML, RBAC & Production Patterns
- What is Kubernetes and Why Does It Exist?
- Kubernetes Architecture β Control Plane & Worker Nodes
- Core Workload Objects: Pod, Deployment, StatefulSet, DaemonSet
- Networking β Services, Ingress, and Network Policies
- Production YAML Examples with Real Annotations
- Health Probes β The Most Misunderstood Feature
- Security, RBAC, and Secrets Management
- Autoscaling: HPA, VPA, and KEDA
- Storage: PersistentVolumes and StorageClasses
- Debugging Kubernetes in Production β Systematic Approach
- Essential kubectl Command Reference
- Common Production Errors and How to Fix Them
- 15 Kubernetes Interview Questions with Expert Answers
Kubernetes (K8s) is the most important skill in modern DevOps and SRE. I have used it daily in production for 4.5 years β managing clusters on AWS EKS and Azure AKS, responding to production incidents at 2am, tuning HPA policies for high-traffic services, and designing multi-AZ high-availability architectures. This guide is written from that real-world experience, not from copying documentation.
Originally built by Google based on their internal cluster management system called Borg, Kubernetes was open-sourced in 2014 and donated to the Cloud Native Computing Foundation (CNCF). It has since become the industry standard for running containerised workloads at scale. If you are preparing for a Senior DevOps, Platform Engineer, or SRE role in 2026, a deep and practical understanding of Kubernetes is non-negotiable. This guide covers everything you need.
What is Kubernetes and Why Does It Exist?
Before Kubernetes, running Docker containers at scale exposed a fundamental gap: Docker solved packaging, but not orchestration. When you have 50 Docker containers running across 10 servers, who restarts a container that crashes? Who redistributes load when a server goes down? How do you roll out a new version of your application without downtime across all instances? How do you automatically add more containers during a traffic spike and remove them when traffic drops?
Kubernetes answers all of these questions. It is an orchestration platform β a system that manages containers across a cluster of machines. You declare what you want (three replicas of my API service, always running, with 512MB of RAM each, accessible at this DNS name) and Kubernetes makes it happen and maintains it continuously β even when servers fail, containers crash, or traffic triples unexpectedly.
The key insight is that Kubernetes is a desired-state system. You describe the desired state of your infrastructure in YAML manifest files. Kubernetes continuously monitors the actual state of the cluster and reconciles it toward the desired state. This reconciliation loop runs constantly β every few seconds for every resource. It is the architectural foundation of everything in the platform and explains why Kubernetes is self-healing by design.
Kubernetes Architecture β Control Plane & Worker Nodes
A Kubernetes cluster consists of two types of infrastructure: the control plane, which manages the cluster state and makes scheduling decisions, and worker nodes, which run the actual application workloads. Understanding this architecture in depth is the first question in most Kubernetes interviews.
Control Plane Components
In a managed Kubernetes service like AWS EKS or GKE, the control plane is fully managed by the cloud provider. You pay for it but never SSH into it. In self-managed clusters (kubeadm, Rancher), you manage these components yourself:
- API Server (kube-apiserver) β The single entry point for all cluster operations. Every kubectl command, every CI/CD tool, every internal component β all communicate exclusively with the API server. It validates requests against RBAC policies, persists accepted changes to etcd, and notifies watching components. It is stateless and can be horizontally scaled behind a load balancer.
- etcd β A distributed, strongly consistent key-value store that holds the entire cluster state: all resource definitions, secrets, ConfigMaps, RBAC policies, and node registrations. If etcd is lost without a backup, the entire cluster state is unrecoverable. In production, run etcd on a dedicated cluster of at least 3 (preferably 5) nodes for high availability. Back it up every hour using etcdctl snapshot save.
- Scheduler (kube-scheduler) β Watches for newly created Pods that have no assigned node and selects the best node for each Pod. The scheduling decision considers: resource requests and limits, node capacity, nodeSelector labels, affinity and anti-affinity rules, taints and tolerations, topology spread constraints, and custom scheduler plugins. If no node can satisfy all constraints, the Pod remains in Pending state.
- Controller Manager (kube-controller-manager) β Runs a set of controller loops, each responsible for reconciling one resource type. The Deployment controller watches Deployments and manages ReplicaSets. The ReplicaSet controller watches ReplicaSets and creates or deletes Pods. The Node controller monitors node health and evicts Pods from unhealthy nodes. Each controller does exactly one job and does it continuously in a tight reconciliation loop.
- Cloud Controller Manager β In cloud-managed clusters, this component handles cloud-specific operations: provisioning LoadBalancer Services as real cloud load balancers (AWS ELBs, GCP LBs), attaching PersistentVolumes as cloud storage (EBS volumes, GCP Persistent Disks), and managing cloud node registrations. It separates cloud-provider logic from the core Kubernetes code.
Worker Node Components
- kubelet β The agent that runs on every worker node. It registers the node with the API server, receives Pod specifications from the API server, and uses the container runtime to start, stop, and health-check containers as directed. The kubelet reports node status (CPU, memory, disk capacity) and Pod status (running, failed, ready) back to the control plane every few seconds.
- kube-proxy β Maintains network rules (iptables or IPVS) on each node that implement the Kubernetes Service abstraction. When a Pod calls a Service's cluster IP, kube-proxy's rules transparently redirect that traffic to one of the Service's healthy endpoint Pods β on any node in the cluster. In modern high-performance clusters, kube-proxy is being replaced by eBPF-based solutions like Cilium, which achieve better performance without iptables.
- Container Runtime β The software that actually runs containers on the node. containerd is the standard container runtime in Kubernetes 1.24+ (Docker's runtime was removed as the default). containerd implements the Container Runtime Interface (CRI) that kubelet uses. Other compliant runtimes include CRI-O and gVisor (for sandboxed containers).
- CNI Plugin β The Container Network Interface plugin implements the cluster networking model: every Pod gets a unique IP address, and any Pod can communicate directly with any other Pod without NAT. Popular choices include Calico (BGP-based, good for on-prem), Cilium (eBPF-based, best performance and observability), Flannel (simple, good for small clusters), and AWS VPC CNI (native VPC IPs for EKS Pods).
Core Workload Objects
Pod β The Atomic Unit
A Pod is the smallest deployable unit in Kubernetes. It wraps one or more containers that share a network namespace (they reach each other via localhost) and can share storage volumes. In practice, most Pods contain a single application container. The sidecar pattern β where a second container handles a cross-cutting concern like log shipping, secret injection, or service mesh proxying β is the main use case for multi-container Pods.
Pods are ephemeral and are not self-healing. A bare Pod that crashes stays crashed. Never run bare Pods in production β always use a controller (Deployment, StatefulSet, DaemonSet, or Job) that will recreate the Pod if it fails. The scheduler assigns each Pod to a node once at creation time; after that, the Pod is permanently bound to that node. If the node dies, the Pod is gone β the controller creates a replacement Pod on a healthy node.
Deployment β For Stateless Applications
A Deployment is the standard way to run stateless applications. You declare the desired number of replicas and the container image. The Deployment controller creates a ReplicaSet, which creates and maintains the specified number of Pods. If a Pod is deleted or fails, the ReplicaSet controller creates a replacement immediately.
Deployments also manage rolling updates. When you update the image version, the Deployment creates a new ReplicaSet with the new image, gradually scales it up while scaling down the old ReplicaSet β all while maintaining the number of available Pods specified by the rolling update strategy. Rolling back is instant: the old ReplicaSet still exists with all its Pods cached, so rollback just scales the old ReplicaSet back up.
StatefulSet β For Stateful Applications
StatefulSets provide three guarantees that Deployments do not: stable, persistent Pod identities (pod-0, pod-1, pod-2 β names never change even after rescheduling), stable per-Pod PersistentVolumeClaims that remain bound to the same Pod even when it is rescheduled to a different node, and ordered, sequential startup and shutdown (pod-0 must be Running before pod-1 is started; pod-1 is stopped before pod-0 during scale-down).
Use StatefulSets for: PostgreSQL, MySQL, MongoDB, Kafka, Redis Cluster, Cassandra, Elasticsearch β any workload where instance identity matters. For example, a primary-replica PostgreSQL setup needs a stable pod-0 that is always the primary, with pod-1 and pod-2 as replicas that connect to pod-0. With a Deployment, Pod names are random and change on rescheduling, making this impossible.
DaemonSet β One Pod Per Node
DaemonSets ensure exactly one Pod runs on every node in the cluster (or a filtered subset). When a new node joins the cluster, the DaemonSet controller automatically schedules its Pod on it. When a node is removed, its DaemonSet Pod is garbage collected. Use DaemonSets for: log collectors (Fluentd, Filebeat, Promtail), monitoring agents (Prometheus Node Exporter, Datadog Agent), network plugins (Calico, Cilium CNI agents), and security tools (Falco runtime security).
Job and CronJob β For Batch Workloads
A Job runs one or more Pods to completion β it creates Pods until the specified number of successful completions is reached, then stops. Use Jobs for database migrations, data processing tasks, and one-time setup operations. A CronJob creates Jobs on a schedule using standard cron syntax. Kubernetes CronJobs replace traditional server cron entries with the reliability and observability of the Kubernetes platform.
Networking β Services, Ingress, and Network Policies
Kubernetes networking is built on a flat network model: every Pod in the cluster gets a unique IP address, and any Pod can reach any other Pod's IP directly, without NAT, regardless of which node each Pod runs on. This simplicity is powerful but has implications for security β by default, all Pods can talk to all other Pods. Network Policies address this.
Service Types
- ClusterIP (default) β Creates a virtual IP address accessible only inside the cluster. Kubernetes DNS resolves the Service name to this IP. All Pods in the cluster can reach the Service via its name:
http://my-service.my-namespace.svc.cluster.local. Use for all internal microservice communication. - NodePort β Exposes the Service on a static port (30000β32767) on every node's IP. External traffic can reach the Service via any node IP on that port. Useful for development. Not for production β it exposes node IPs directly, bypasses cloud load balancer health checks, and uses inefficient routing.
- LoadBalancer β Provisions a cloud load balancer (AWS ELB, GCP LB, Azure LB) and routes external traffic to the Service. One load balancer per Service means this gets expensive quickly for many Services. Use Ingress instead for HTTP/HTTPS workloads.
- Ingress β An L7 HTTP/HTTPS routing layer. A single Ingress controller (NGINX Ingress, AWS ALB Ingress, Traefik, Istio) handles routing for all Services based on hostname and URL path rules. SSL termination, rate limiting, authentication, and canary routing are implemented at the Ingress layer. This is the correct architecture for most production HTTP workloads β one load balancer, many Services.
Network Policies β Zero-Trust Networking
By default, all Pods in a Kubernetes cluster can communicate with all other Pods. Network Policies are namespace-scoped firewall rules that restrict this. A best-practice production cluster has a default-deny policy that blocks all ingress and egress, then explicit allow policies for required communication paths. The CNI plugin must support Network Policies β Calico, Cilium, and Weave all do; Flannel does not.
Production YAML Examples with Real Annotations
Production-Grade Deployment
Ingress with TLS, Rate Limiting, and CORS
Health Probes β The Most Misunderstood Feature
Health probe misconfiguration is the single most common cause of production incidents I have investigated over four years. There are three probe types and they have fundamentally different behaviours β confusing them creates cascading failures.
initialDelaySeconds too low on the liveness probe for slow-starting applications. A Java Spring Boot service may take 45β90 seconds to start on cold JVM. If your liveness probe starts checking at 10 seconds and the app isn't responding yet, Kubernetes kills it β and you enter CrashLoopBackOff before the app ever successfully started. Use a startup probe with generous failure thresholds for slow-starting apps.Security, RBAC, and Secrets Management
Role-Based Access Control (RBAC) is the primary access control mechanism inside a Kubernetes cluster. The principle of least privilege must be applied everywhere: each ServiceAccount should only have access to the exact Kubernetes API resources it needs, and nothing more. RBAC is always a topic in senior DevOps and SRE interviews.
The RBAC model has four resource types: Role (namespace-scoped permissions), ClusterRole (cluster-wide permissions), RoleBinding (assigns a Role to a user, group, or ServiceAccount), and ClusterRoleBinding (assigns a ClusterRole cluster-wide). Start with Role/RoleBinding unless you specifically need cluster-wide access.
Autoscaling: HPA, VPA, and KEDA
Kubernetes provides three autoscaling mechanisms, each operating at a different level:
Horizontal Pod Autoscaler (HPA) scales the number of Pod replicas based on metrics. CPU and memory utilisation are the built-in metrics. Custom metrics (HTTP request rate, queue depth, latency) are available via the Custom Metrics API (Prometheus Adapter, KEDA). HPA works best when your application can scale horizontally and each Pod has well-calibrated resource requests.
Vertical Pod Autoscaler (VPA) adjusts the CPU and memory requests of existing Pods based on historical usage. It solves the "how much resource should I request?" question automatically. VPA requires a Pod restart to apply new resource requests, so it is not suitable for stateless production services under constant load β use it in combination with Goldilocks to generate recommendations, then bake them into your deployment manifests.
KEDA (Kubernetes Event-Driven Autoscaling) extends HPA to scale on virtually any event source: Kafka consumer lag, SQS queue depth, Redis queue length, Cron schedules, Prometheus metrics, and 60+ other scalers. KEDA can scale down to zero (for cost saving) and back up again, which HPA cannot do natively.
Storage: PersistentVolumes and StorageClasses
Kubernetes abstracts storage through three resource types. A PersistentVolume (PV) is a piece of storage provisioned in the cluster β an EBS volume, an NFS share, a Ceph pool. A PersistentVolumeClaim (PVC) is a request for storage by a user or workload β "I need 10GB of ReadWriteOnce storage." A StorageClass enables dynamic provisioning β when a PVC is created, the StorageClass automatically provisions a matching PV without manual admin intervention.
In production on AWS EKS, you use the EBS CSI Driver (for block storage, ReadWriteOnce β one Pod per volume) and EFS CSI Driver (for shared file storage, ReadWriteMany β multiple Pods can mount the same volume). Always use dynamic provisioning with StorageClasses rather than manually creating PVs. Set reclaimPolicy: Retain for production databases so volumes are not deleted when the PVC is deleted.
volumeBindingMode: WaitForFirstConsumer on your StorageClass. This delays PV provisioning until a Pod is scheduled, ensuring the volume is created in the same Availability Zone as the Pod. Without this, EBS volumes are created in a random AZ and Pods may fail to schedule because the volume is in a different AZ from the only available nodes.Debugging Kubernetes in Production
This is the section that separates senior engineers from juniors in interviews and on-call. When a Pod is broken in production at 3am, you need a systematic, fast debugging methodology β not random kubectl commands fired in panic. Here is the exact flow I use in production incidents:
Essential kubectl Command Reference
| Command | What it does |
|---|---|
| kubectl get pods -A | List all Pods in all namespaces |
| kubectl get pods -n prod -o wide | Wide output including node and IP |
| kubectl describe pod <name> | Full details including Events β start debugging here |
| kubectl logs <pod> --previous | Logs from the last crashed container instance |
| kubectl rollout status deploy/<name> | Wait for rollout to complete |
| kubectl rollout undo deploy/<name> | Rollback to previous ReplicaSet immediately |
| kubectl scale deploy/<name> --replicas=5 | Imperatively scale a Deployment |
| kubectl set image deploy/<name> app=image:tag | Trigger a rolling update to a new image |
| kubectl apply -f manifest.yaml | Declarative create or update (idempotent) |
| kubectl delete -f manifest.yaml | Delete all resources defined in the file |
| kubectl exec -it <pod> -- bash | Interactive shell in a running container |
| kubectl cp <pod>:/path ./local | Copy files from container to local |
| kubectl auth can-i <verb> <resource> | Test RBAC permissions for current context |
| kubectl top pods -n prod | Live CPU and memory usage per Pod |
| kubectl get events --sort-by=.lastTimestamp | Cluster events sorted by time β best for incident triage |
| kubectl drain <node> --ignore-daemonsets | Safely evict all Pods before node maintenance |
| kubectl cordon <node> | Mark node unschedulable (no new Pods) |
| kubectl uncordon <node> | Return node to schedulable state |
Common Production Errors and How to Fix Them
Fix: Run
kubectl logs <pod> --previous to see the crash output. Check exit code with kubectl describe pod β exit code 137 means OOMKilled, exit code 1 means application error. Fix the root cause: increase memory limit, fix missing env var, add a startup probe for slow-starting apps.Fix: Verify image tag exists in the registry. Check that the imagePullSecret is created and referenced in the Pod spec. Verify network connectivity from nodes to the registry. For ECR on EKS, verify the node IAM role has
ecr:GetAuthorizationToken and ecr:BatchGetImage permissions.Fix: Run
kubectl describe pod <pending-pod> and read the Events section β the scheduler will tell you exactly why it couldn't schedule. Add more nodes, adjust resource requests, or relax affinity rules as appropriate.Fix: Check
kubectl top pods for current memory usage. Increase the memory limit in the resource spec. For Java, set explicit heap limits with -Xmx that are lower than the container limit (leaving 200-300MB for JVM overhead). Investigate memory leaks in long-running Pods with heap dumps.15 Kubernetes Interview Questions with Expert Answers
kubectl logs <pod> --previous β this shows the output of the last crashed instance, which almost always contains the error. Then kubectl describe pod to see the exit code and events. Exit code 137 means OOMKilled (raise memory limit). Exit code 1 typically means an application startup error (missing config, failed database connection). Exit code 2 often means a missing command or entrypoint. Use a startup probe with generous failure thresholds to prevent CrashLoopBackOff during legitimate slow startups.maxUnavailable: 0 and maxSurge: 1 in the rolling update strategy so capacity never drops below 100%. Second, configure a readiness probe so new Pods only receive traffic after they are genuinely ready to serve β not just after the container starts. Third, add a preStop lifecycle hook (sleep 5) so the Pod has time to finish in-flight requests after receiving SIGTERM. Fourth, set terminationGracePeriodSeconds long enough for the app to drain gracefully (typically 30-60 seconds). Fifth, configure a PodDisruptionBudget with minAvailable or maxUnavailable to prevent voluntary disruptions (node drains, cluster upgrades) from taking down more Pods than intended. Missing any one of these five conditions will cause at least brief downtime during deployments.minAvailable: 2 prevents the drain from proceeding unless at least 2 Pods will remain healthy. Create a PDB for every production Deployment with more than one replica. Set minAvailable to at least one below the replica count, or maxUnavailable to a small number. Note: PDBs only protect against voluntary disruptions β involuntary disruptions (node hardware failure, kernel panic) can still cause downtime regardless of PDB settings.--node-monitor-grace-period and --pod-eviction-timeout on the controller manager β but lowering these too aggressively can cause false evictions during transient network issues.nvidia.com/gpu=present:NoSchedule. Only ML inference Pods with a matching toleration will schedule on GPU nodes β all other Pods are automatically repelled without needing nodeSelectors. Another use case: dedicated nodes for your ingress controller (taint with ingress=true:NoSchedule, add toleration only to ingress Pods) so ingress processing doesn't share CPU with application workloads. Also used for spot/preemptible nodes β taint them with spot=true:NoSchedule so only fault-tolerant batch workloads run on them.kubectl get endpoints <service-name> β if the endpoints list is empty, the Service selector does not match any Pod labels, or all matching Pods have failing readiness probes. Second, compare the Service's selector (kubectl describe svc <name>) with the actual Pod labels (kubectl get pods --show-labels) β a single typo or missing label breaks routing. Third, if endpoints exist but traffic still fails, the issue is at the kube-proxy / iptables level or the application itself. Test by curling directly to the Pod IP (bypassing the Service): kubectl exec <pod> -- curl http://<pod-ip>:8080. If that works, the app is fine and the problem is in Service routing. Use kubectl port-forward svc/<name> to test the Service IP directly from your machine.βΈοΈ Explore Kubernetes on the Interactive Mind Map
See how Kubernetes connects to Docker, Helm, ArgoCD, Prometheus, AWS, and more β with real commands per tool.
Open Interactive Mind Map β DevOps Basics First