← Home🗺 Mind Map☕ Ko-fi💳 Razorpay
// Linux Guide · SRE & DevOps

Linux Complete Guide 2026: Processes, Networking, Performance & Expert Interview Q&A

📅 Updated May 2026⏰ 22 min read🏷 Linux · SRE · DevOps · Shell · Performance
👨‍💻
Dhanush R — Senior DevOps Engineer
4.5+ years of daily Linux usage in production — EC2 instances, Kubernetes nodes, container debugging, and SRE incident response. Every command in this guide has been used in real production investigations.
// Table of Contents
  1. Linux Filesystem Hierarchy
  2. Process Management
  3. Networking Commands for DevOps
  4. Disk and Filesystem Management
  5. Log Management and Analysis
  6. Performance Analysis — The USE Method
  7. Systemd Service Management
  8. File Permissions and Users
  9. Shell Scripting for DevOps Automation
  10. 12 Linux Interview Questions with Expert Answers

Linux is the operating system that runs the modern internet. Every cloud VM, every Docker container, every Kubernetes node, every web server runs Linux. As a DevOps or SRE engineer, Linux proficiency is not optional — it is the single most foundational skill in the entire field. I use Linux daily for cluster management, incident investigation, performance analysis, and automation scripting. This guide covers the Linux knowledge you need to excel in production and answer every Linux interview question confidently.

Linux Filesystem Hierarchy

The Linux filesystem follows the Filesystem Hierarchy Standard (FHS). Understanding where things are is essential for troubleshooting:

Process Management

# View all running processes with full detail ps aux # a=all users, u=user format, x=no tty ps aux | grep nginx # filter by process name ps -ef --forest # show process tree (parent-child) # Real-time process monitoring top # classic — press 1 for per-CPU view htop # modern, coloured, interactive pidstat 1 5 # per-process CPU/IO stats every 1s, 5 times # Find PID by name pgrep nginx # returns PID(s) of matching processes pidof nginx # Send signals kill -15 1234 # SIGTERM — graceful shutdown request kill -9 1234 # SIGKILL — immediate kill, uncatchable kill -1 1234 # SIGHUP — reload config (many daemons) pkill nginx # kill by process name killall nginx # kill all processes named nginx # Process priority — nice value: -20 (highest) to +19 (lowest) nice -n 10 ./batch-job.sh # start with lower priority renice -n -5 -p 1234 # change priority of running process

Linux Networking Commands for DevOps

# Network interface status ip addr show # show all interfaces and IPs (modern) ip link show ip route show # show routing table # Active connections and listening ports ss -tuln # TCP/UDP listening sockets (faster than netstat) ss -tulnp # +process name/PID ss -tnp state established # established TCP connections netstat -tulnp # older equivalent of ss # Test connectivity ping -c 4 8.8.8.8 traceroute api.company.com # trace network path curl -v https://api.example.com # full HTTP request/response detail curl -w "%{time_total} " -o /dev/null -s https://api.example.com # DNS resolution troubleshooting dig api.company.com # full DNS query output dig +short api.company.com # just the IP nslookup api.company.com cat /etc/resolv.conf # check configured DNS servers # Firewall rules (iptables / nftables) iptables -L -n -v --line-numbers # list all rules with counters iptables -A INPUT -p tcp --dport 8080 -j ACCEPT iptables -I INPUT 1 -s 10.0.0.0/8 -j ACCEPT # insert at position 1

Disk and Filesystem Management

# Disk usage df -h # filesystem disk usage (human readable) df -ih # inode usage — full inodes = cannot create files du -sh /var/log/* # size of each item in /var/log du -sh /* 2>/dev/null | sort -rh | head -20 # find largest dirs # Find large files quickly find / -type f -size +1G 2>/dev/null | sort find /var/log -name "*.log" -mtime +30 -delete # delete logs older than 30 days # Disk I/O performance iostat -x 1 5 # extended disk I/O stats every 1s iotop # real-time per-process disk I/O lsblk # list block devices and mount points # Mount and unmount mount /dev/sdb1 /mnt/data umount /mnt/data mount | grep ext4 # show mounted ext4 filesystems

Log Management and Analysis

Effective log analysis is a core SRE skill. Linux logs are in /var/log and via the systemd journal:

# systemd journal — the modern logging system journalctl -u nginx # logs for the nginx service journalctl -u nginx --since "1 hour ago" journalctl -u nginx -f # follow live (like tail -f) journalctl -p err # error level and above only journalctl --disk-usage # how much space journal uses # Traditional log files tail -f /var/log/nginx/access.log tail -n 1000 /var/log/syslog | grep ERROR grep -i "out of memory" /var/log/syslog # find OOMKill events # Parse structured logs efficiently cat access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head -20 cat access.log | awk '{print $9}' | sort | uniq -c | sort -rn # HTTP status codes grep "HTTP/1.1" 5" access.log | wc -l # count 5xx errors

Linux Performance Analysis — The USE Method

The USE method (Utilisation, Saturation, Errors) is a systematic approach to performance analysis developed by Brendan Gregg. For every resource (CPU, memory, disk, network), check utilisation (how busy is it?), saturation (is there a queue forming?), and errors (is it failing?).

# CPU — utilisation, saturation, errors top # %us (user), %sy (system), %wa (iowait) vmstat 1 5 # r column = run queue (saturation) mpstat -P ALL 1 # per-CPU utilisation sar -u 1 10 # historical CPU stats # Memory — utilisation, saturation free -h # total, used, free, buff/cache, available cat /proc/meminfo # detailed kernel memory breakdown vmstat 1 | awk '{print $7}' # si (swap in) — saturation indicator # If si/so (swap in/out) > 0, system is memory-saturated # Find memory-hungry processes ps aux --sort=-%mem | head -10 # Check for OOMKills in kernel log dmesg | grep -i "oom\|killed process" # Load average interpretation uptime # 1m, 5m, 15m load averages # Load average = number of processes in R (running) or D (uninterruptible sleep) # Load of 1.0 on 1 CPU = 100% utilised, 1.0 on 4 CPUs = 25% utilised # Load consistently > nCPUs indicates CPU saturation nproc # number of CPU cores

Systemd Service Management

# Service management systemctl status nginx # status, recent logs, enabled state systemctl start nginx systemctl stop nginx systemctl restart nginx systemctl reload nginx # reload config without full restart (if supported) systemctl enable nginx # start on boot systemctl disable nginx # don't start on boot systemctl list-units --failed # show all failed units # Write a custom systemd service cat > /etc/systemd/system/myapp.service << 'EOF' [Unit] Description=My Application After=network.target [Service] Type=simple User=appuser WorkingDirectory=/opt/myapp ExecStart=/opt/myapp/bin/server --port=8080 Restart=on-failure RestartSec=5s StandardOutput=journal StandardError=journal [Install] WantedBy=multi-user.target EOF systemctl daemon-reload systemctl enable --now myapp

File Permissions and Users

# Permission breakdown: -rwxr-xr-- = type + owner + group + others # r=4, w=2, x=1; 755 = rwxr-xr-x; 644 = rw-r--r-- chmod 755 /opt/myapp/server # owner=rwx, group=rx, others=rx chmod 600 ~/.ssh/id_rsa # SSH key must be 600 (owner read/write only) chmod -R 750 /opt/myapp # recursive, directories need x to enter chown appuser:appgroup /opt/myapp chown -R appuser:appgroup /opt/myapp/data # Find files with dangerous permissions find / -perm -4000 -type f 2>/dev/null # SUID files (run as owner) find /etc -writable -type f 2>/dev/null # world-writable config files (dangerous) # User management useradd -m -s /bin/bash -G docker appuser # create user with home and docker group usermod -aG sudo appuser # add to sudo group passwd appuser # set password id appuser # show UID, GID, groups su - appuser # switch to user with login shell

Shell Scripting for DevOps Automation

#!/bin/bash # Production-grade bash script template set -euo pipefail # -e exit on error, -u error on undefined var, -o pipefail IFS=$' ' # safer word splitting SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" LOG_FILE="/var/log/myapp/deploy.log" log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG_FILE"; } error() { log "ERROR: $*" >&2; exit 1; } # Check required tools command -v kubectl &>/dev/null || error "kubectl not found" command -v aws &>/dev/null || error "aws CLI not found" # Use functions to organise code deploy_service() { local service="${1:?Service name required}" local image_tag="${2:?Image tag required}" log "Deploying $service with tag $image_tag" kubectl set image "deployment/$service" "$service=123456.dkr.ecr.ap-south-1.amazonaws.com/$service:$image_tag" -n production kubectl rollout status "deployment/$service" -n production --timeout=5m || error "Rollout failed for $service" log "Deployment of $service successful" } # Array iteration SERVICES=("api" "worker" "scheduler") for svc in "${SERVICES[@]}"; do deploy_service "$svc" "${IMAGE_TAG:-latest}" done

12 Linux Interview Questions with Expert Answers

Q1: What is the difference between a process and a thread in Linux?
A process is an independent execution unit with its own memory address space, file descriptor table, and PID. It is created with fork() — which copies the parent's address space (copy-on-write in Linux). A thread is a lightweight execution unit that shares the memory address space, file descriptors, and other resources with other threads in the same process. In Linux, both are implemented as tasks and scheduled by the same kernel scheduler — the distinction is which resources they share (clone() system call controls what is shared). Practical implications: a multi-threaded application can share data between threads efficiently but must use synchronisation (mutexes, channels) to avoid data races. A crash in one thread can kill the entire process. Separate processes are more isolated — a crash doesn't affect other processes — but communication requires IPC (pipes, sockets, shared memory), which is more expensive than shared memory between threads.
Q2: What does load average mean and when is it a problem?
Load average in Linux represents the average number of processes in a runnable state (R — actually running or waiting for CPU) or in uninterruptible sleep (D — waiting for I/O) over the last 1, 5, and 15 minutes. It is shown by uptime and the first line of top. A load average equal to the number of CPU cores means the system is exactly 100% CPU-utilised. Load average greater than CPU count indicates saturation — processes are queuing for CPU time. A load of 4.0 on a 1-core system is severe. A load of 4.0 on a 16-core system is fine. Critically: load average counts D-state processes (waiting for disk I/O), so high load with low CPU utilisation often indicates disk I/O saturation. Check iostat to confirm. Transient spikes are normal — a sustained load average above nCPUs for more than 5 minutes warrants investigation.
Q3: How do you troubleshoot a server that is completely unresponsive via SSH?
Systematic approach without SSH access: (1) Check the cloud console (AWS EC2 console, GCP console) — the instance status checks tell you if the hypervisor-level health check is passing. (2) Use the EC2 Serial Console or AWS Systems Manager Session Manager (which doesn't use SSH) to get a terminal session if SSH is down. (3) Get instance console output (AWS: Actions > Monitor > Get System Log) — shows kernel panic messages, OOM events, and boot errors that never make it to syslog. (4) Check network: security group rules may be blocking SSH, or the network interface may be down. (5) If accessible via console: check dmesg | tail -50 for kernel errors, journalctl -b --no-pager | tail -100 for recent events, df -h for full disk (full / or /var causes all sorts of failures), free -h for OOM, and ps aux for runaway processes. A full disk that fills /var/log is the most common cause of sudden SSH failures on otherwise healthy instances.
Q4: Explain the difference between hard links and soft links.
A hard link is a directory entry that points directly to the same inode (the data blocks on disk) as the original file. Both the original and the hard link are equal — there is no "original" vs "copy". Deleting one does not delete the data until all hard links pointing to that inode are removed (the link count reaches zero). Hard links cannot span filesystems (inodes are filesystem-specific) and cannot link to directories. A symbolic link (soft link) is a file that contains a path string pointing to another file. It is a separate inode and a separate file. If the target file is deleted or moved, the symlink becomes a dangling link — it points to a non-existent path. Symlinks can span filesystems, can point to directories, and can use relative or absolute paths. In DevOps practice: use symlinks to switch between versions of a binary or config file atomically (ln -sf /opt/app-v2.1 /opt/app).
Q5: What is the /proc filesystem and how is it useful for debugging?
/proc is a virtual filesystem that exposes kernel and process information as files. It contains no actual data on disk — reads call into the kernel and get live data. Most useful for debugging: /proc/<PID>/environ (environment variables of a running process — cat /proc/1234/environ | tr '' ' '), /proc/<PID>/cmdline (exact command and arguments), /proc/<PID>/fd (file descriptors — see which files and sockets a process has open), /proc/<PID>/net/tcp (active TCP connections from that process's network namespace), /proc/meminfo (detailed kernel memory breakdown), /proc/loadavg (load averages), /proc/cpuinfo (CPU topology), /proc/sys (tunable kernel parameters — sysctl reads from here). In containers, /proc is the process's own namespace — /proc/1 inside a container is the container's PID 1, not the host's PID 1.
Q6: How do you identify and kill a process using a specific port?
To find what is listening on port 8080 and kill it: ss -tulnp | grep :8080 gives the PID in the output. Alternatively, lsof -i :8080 (list open files for network files) shows the process name and PID. fuser -n tcp 8080 directly outputs the PID. Then kill -15 <PID> for graceful shutdown or kill -9 <PID> for immediate kill. One-liner: fuser -k 8080/tcp finds and kills in one step. In a Kubernetes environment, you would not kill processes on the node directly — use kubectl delete pod to let Kubernetes manage the container lifecycle. Killing a container process on the node directly may cause kubelet to immediately restart the container but could leave the pod in an inconsistent state and bypass proper graceful shutdown.
Q7: What is inode exhaustion and how do you diagnose and fix it?
Every file on an ext4/xfs filesystem consumes one inode — a data structure holding file metadata (owner, permissions, timestamps, pointers to data blocks). A filesystem has a fixed number of inodes allocated at creation time. When inodes are exhausted, you cannot create new files even if there is plenty of disk space — you get "No space left on device" errors even though df -h shows free space. Diagnose with df -ih (the -i flag shows inode usage). The IUse% column will show 100%. Fix: find the directory with an enormous number of tiny files — common culprits are: a runaway logging process writing millions of tiny log files, a mail spool filling up, a /tmp directory with thousands of session files, or a build cache. Delete the files in batches. Long-term: on ext4, you can increase inode density at format time with mkfs.ext4 -T largefile4 for few large files or -T small for many small files.
Q8: How does the Linux kernel OOM killer work?
When a Linux system runs out of physical memory and all swap space is exhausted, the kernel's OOM (Out of Memory) killer activates. It calculates an "oom_score" for each process (higher score = more likely to be killed). The score is based on: physical memory usage (larger processes score higher), whether the process ran as root (root gets a lower score — system daemons are protected), how long the process has been running, and the process's oom_score_adj setting (-1000 = never kill, +1000 = kill first). The OOM killer selects the highest-scored process and sends SIGKILL. This appears in dmesg as "Out of memory: Kill process <PID> (<name>) score <N>". In container environments (Docker, Kubernetes), the OOM killer is constrained by cgroup memory limits — a container exceeding its limit gets OOMKilled even if the host has free memory. Tune oom_score_adj for critical system processes: echo -500 > /proc/<PID>/oom_score_adj.
Q9: What is the difference between SIGTERM and SIGKILL?
SIGTERM (signal 15) is a polite termination request that the process can catch, handle, or ignore. When a process receives SIGTERM, it can run cleanup code: flush write buffers, close database connections, finish in-flight requests, delete temporary files, and log a shutdown message before exiting. Kubernetes sends SIGTERM to containers when a Pod is terminated, waits terminationGracePeriodSeconds (default 30s), then sends SIGKILL if the container hasn't exited. SIGKILL (signal 9) is sent directly to the kernel and cannot be caught, ignored, or blocked by the process. The kernel immediately terminates the process regardless of what it is doing — no cleanup code runs. Use SIGKILL only when SIGTERM fails and the process must be killed immediately. Always try SIGTERM first (kill -15 or just kill) and only escalate to SIGKILL if the process doesn't respond within a reasonable time. Never send SIGKILL to a database process unless absolutely necessary — it risks data corruption.
Q10: How do you optimise a slow bash script?
Profile first: use bash -x script.sh 2>trace.txt to see every command, then identify slow sections with time around suspect blocks. Common optimisations: (1) Avoid spawning subprocesses in loops — each $(command) forks a new process. Do string manipulation with built-in bash parameter expansion instead of calling sed/awk/tr. (2) Reduce external command calls — replace cat file | grep pattern with grep pattern file (cat is useless here). (3) Use mapfile/readarray instead of a while read loop for reading files into arrays. (4) Run independent operations in parallel: command1 &; command2 &; wait. (5) Replace grep/sed/awk with native bash string operations for simple cases — no subprocess spawn. (6) For heavy processing, replace bash with Python — bash is slow for anything involving complex logic, data structures, or hundreds of iterations. (7) Cache results of expensive commands in variables rather than calling them multiple times.
Q11: What is a zombie process and how do you handle it?
A zombie process (Z state in ps output) is a process that has completed execution but still has an entry in the process table because its parent has not yet read its exit status. When a child process exits, it sends SIGCHLD to its parent and enters zombie state waiting for the parent to call wait() to collect its exit code. Once the parent calls wait(), the zombie entry is removed. Zombies cannot be killed with SIGKILL — they are already dead. They consume a process table entry but no CPU or memory. A small number of zombies is harmless. Many zombies indicate a poorly written parent process that does not call wait() on its children, or a parent that has crashed without reaping its children. Fix: kill the parent process. When the parent dies, init (PID 1) adopts all orphaned children and immediately calls wait() on any zombies, clearing them. In Docker containers, this is why you need a proper init process (tini, s6-overlay) as PID 1 — your application process typically doesn't handle SIGCHLD and child zombie reaping.
Q12: How do you investigate a sudden spike in CPU usage on a production server?
Systematic investigation: (1) top or htop — identify which process(es) are consuming CPU. Note the PID. (2) ps aux --sort=-%cpu | head -10 — confirm the top CPU consumers and get their command lines. (3) strace -p <PID> — attach to the process and see what system calls it is making. A process burning CPU in a tight loop shows repeated system calls. (4) perf top -p <PID> — shows which functions inside the process are consuming CPU (kernel and user space). (5) If it is a Java process: jstack <PID> for a thread dump showing which threads are running and what they are doing. (6) Check if the spike correlates with a recent deployment (check deployment timestamps vs metric spike). (7) Check application logs around the spike time for errors, slow queries, or unusual request patterns. (8) Check external load: are requests to the service unusually high? Check ALB/Ingress metrics. High CPU is most often caused by: a hot code path being called unexpectedly frequently, a regex or algorithm with poor time complexity, a background job that ran at an unusual time, or a traffic spike.

🐧 Explore Linux on the Interactive Mind Map

See how Linux connects to Docker, Kubernetes, CI/CD, and monitoring tools.

Open Interactive Mind Map📊 Prometheus Next →
// More Guides
📖 DevOps⎈ Kubernetes🐳 Docker⚙ CI/CD☁ AWS📀 Terraform📊 Prometheus🐧 Linux🌿 Git
Advertisement
☕ Support Master DevOps

All guides are 100% free. If this helped you, your support keeps the project alive.

☕ Ko-fi💳 Razorpay
🐧
Written by Dhanush R
Senior DevOps Engineer · 4.5+ Years · Bengaluru

4.5+ years of daily Linux usage in production — EC2 instances, Kubernetes nodes, container debugging, and SRE incident response. Every command in this guide has been used in real production investigations. Last updated: May 2026.

📷 Instagram▶ YouTube💼 LinkedIn
🌙