// Prometheus & Grafana Guide · Observability & SRE
Prometheus & Grafana Complete Guide 2026: PromQL, Alerting, SLOs & Expert Interview Q&A
📅 Updated May 2026
⏰ 22 min read
🏷 Prometheus · Grafana · SRE · Observability · PromQL
👨💻
Dhanush R — Senior DevOps Engineer
4.5+ years building production monitoring stacks with Prometheus, Grafana, AlertManager, and Thanos on AWS EKS. Every PromQL query and alerting pattern in this guide has been used in real production SRE workflows.
Prometheus is the de-facto standard for metrics-based monitoring in cloud-native environments. Combined with Grafana for visualisation and AlertManager for notification routing, it forms the observability backbone of almost every production Kubernetes deployment. I have built and operated Prometheus monitoring stacks for production EKS clusters for 4.5 years — writing PromQL queries for SLO dashboards, tuning alert thresholds to eliminate alert fatigue, and architecting Thanos for long-term multi-cluster metrics storage. This guide covers everything from core concepts to advanced SLO monitoring.
What is Prometheus? Architecture and Pull Model
Prometheus is an open-source monitoring system that collects metrics by pulling (scraping) HTTP endpoints exposed by targets. This is the opposite of push-based systems (StatsD, Graphite, Datadog Agent) where applications push metrics to a central collector. The pull model has a key operational advantage: Prometheus can detect whether a target is down (scrape fails) rather than just seeing silence (push stopped). If a service stops pushing metrics, a push-based system cannot distinguish between "service is healthy but not producing metrics" and "service is down."
Prometheus components: the Prometheus server scrapes targets on a configurable interval (default 15s), evaluates alerting and recording rules, and stores time-series data in its local TSDB (Time Series Database). Exporters expose metrics from systems that don't natively speak Prometheus: Node Exporter (Linux OS metrics), cAdvisor (container metrics), Blackbox Exporter (probing HTTP/TCP/DNS endpoints), and hundreds of community exporters. AlertManager receives alerts from Prometheus, deduplicates, groups, silences, and routes them to PagerDuty, OpsGenie, Slack, email, or webhooks. Pushgateway is the exception to the pull model: batch jobs and short-lived jobs push metrics to the gateway, which Prometheus then scrapes.
Why Prometheus won: The combination of a simple exposition format (plain-text HTTP /metrics endpoint), dimensional data model (labels for any cardinality), a powerful functional query language (PromQL), and native Kubernetes integration (ServiceMonitors via Prometheus Operator) made Prometheus the unanimous choice for cloud-native observability. Every major observability platform (Datadog, New Relic, Grafana Cloud) now supports PromQL and the Prometheus exposition format.
Prometheus Metric Types Explained
Prometheus has four metric types, each with specific use cases. Using the wrong type is one of the most common mistakes in custom instrumentation:
- Counter — A value that only ever increases (or resets to zero on restart). Ideal for: total HTTP requests, total errors, total bytes sent. Never use a counter for values that go up and down (like current active connections). In PromQL, always use
rate() or increase() on counters — the raw counter value is rarely meaningful. Example: http_requests_total{status="500"}
- Gauge — A value that can go up and down. Ideal for: current memory usage, number of active goroutines, current queue depth, temperature. Unlike counters, gauge values are meaningful on their own. Example:
node_memory_MemAvailable_bytes
- Histogram — Samples observations into configurable buckets and provides a count and sum. Used for measuring the distribution of values: request latency, response size. Creates three time series per metric:
_bucket (one per configured boundary), _count (total observations), _sum (sum of all observed values). Allows calculating percentiles with histogram_quantile(). Example: http_request_duration_seconds_bucket{le="0.1"}
- Summary — Similar to Histogram but calculates quantiles on the client side. Less flexible than Histogram for PromQL aggregation (cannot aggregate quantiles across instances) but more accurate for high-cardinality quantile requirements. Prefer Histogram in almost all cases because server-side quantile calculation allows aggregation.
PromQL — The Query Language
PromQL is a functional query language that operates on time series data. Every interview at SRE level will test your PromQL knowledge. Here are the most important functions and patterns:
# rate() vs irate() vs increase()
# rate() - per-second average rate over the range window (smooth)
rate(http_requests_total[5m])
# irate() - instantaneous rate based on last 2 samples (spiky, best for fast-moving)
irate(http_requests_total[5m])
# increase() - total increase over the window (for dashboards showing total count)
increase(http_requests_total[1h])
# HTTP error rate as percentage
100 * sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
# 99th percentile request latency (requires histogram metric)
histogram_quantile(0.99,
sum by (le, service) (
rate(http_request_duration_seconds_bucket[5m])
)
)
# CPU utilisation per node
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage percentage
100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))
# Pod restarts in last 30 minutes (detect CrashLoopBackOff)
increase(kube_pod_container_status_restarts_total[30m]) > 2
# Top 5 pods by CPU usage
topk(5,
sum by (pod, namespace) (
rate(container_cpu_usage_seconds_total{container!=""}[5m])
)
)
# Disk space will run out in N hours (linear prediction)
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 4*3600) < 0
Label Matching and Aggregation
# Aggregation operators
sum(http_requests_total) by (service) # sum grouped by service label
sum(http_requests_total) without (instance) # sum dropping instance label
avg(node_cpu_seconds_total) by (node)
max(kube_pod_container_resource_requests{resource="memory"}) by (namespace)
# Label matching operators
http_requests_total{status="200"} # exact match
http_requests_total{status!="200"} # not equal
http_requests_total{status=~"2.."} # regex match (2xx)
http_requests_total{status!~"2.."} # regex not match
# Multi-label matching
http_requests_total{service="api", status=~"5..", region="ap-south-1"}
Recording Rules for Query Performance
Complex PromQL expressions evaluated on large datasets can be slow — especially when used in Grafana dashboards with multiple panels all running the same expensive query. Recording rules pre-compute expensive expressions at scrape time and store the result as a new time series. Dashboards then query the pre-computed series, which is extremely fast regardless of the underlying data volume.
# recording-rules.yaml
groups:
- name: http_request_rates
interval: 30s # evaluate every 30s instead of on-demand
rules:
- record: job:http_requests_total:rate5m
expr: sum by (job, status) (rate(http_requests_total[5m]))
- record: job:http_error_rate:ratio5m
expr: |
sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
/
sum by (job) (rate(http_requests_total[5m]))
- record: job:http_p99_latency:histogram_quantile5m
expr: |
histogram_quantile(0.99,
sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
)
- name: node_resource_usage
interval: 60s
rules:
- record: instance:node_cpu_utilisation:rate5m
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
- record: instance:node_memory_utilisation:ratio
expr: 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
Alerting Rules and AlertManager
Good alerting is one of the hardest problems in SRE. Too many alerts cause alert fatigue — on-call engineers start ignoring pages. Too few alerts mean outages go undetected. The key principle: alert on symptoms, not causes. Alert on high error rate (what users experience), not on "pod restarted" (a cause that may be self-healing).
# alerting-rules.yaml — production-quality alerts
groups:
- name: api-slos
rules:
# Alert on user-facing error rate (symptom)
- alert: HighErrorRate
expr: job:http_error_rate:ratio5m{job="api"} > 0.01
for: 5m # must be true for 5 min before firing (avoids flaps)
labels:
severity: critical
team: platform
annotations:
summary: "API error rate above 1% for 5 minutes"
description: "Error rate is {{ printf \"%.2f\" $value }}% on {{ $labels.job }}"
runbook_url: "https://wiki.company.com/runbooks/api-high-error-rate"
# Alert on latency SLO breach
- alert: HighP99Latency
expr: job:http_p99_latency:histogram_quantile5m{job="api"} > 1.0
for: 10m
labels:
severity: warning
team: platform
annotations:
summary: "API p99 latency above 1s"
description: "p99 latency is {{ printf \"%.3f\" $value }}s for {{ $labels.job }}"
# Alert when disk will fill in 4 hours
- alert: DiskWillFillSoon
expr: predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 4*3600) < 0
for: 30m
labels:
severity: warning
annotations:
summary: "Disk on {{ $labels.instance }} will fill within 4 hours"
AlertManager Configuration
# alertmanager.yml — route alerts to correct teams
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/TOKEN'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
# wait 30s before sending first alert in a group
group_interval: 5m
# wait 5m before sending additional alerts in group
repeat_interval: 4h
# resend if still firing after 4h
receiver: 'slack-warnings'
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
continue: true
# also send to default receiver
- match:
team: platform
receiver: 'slack-platform'
receivers:
- name: 'pagerduty-critical'
pagerduty_configs:
- routing_key: '
'
description: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
- name: 'slack-platform'
slack_configs:
- channel: '#platform-alerts'
title: '{{ .Status | toUpper }}: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ "\n" }}{{ end }}'
send_resolved: true
inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal: ['alertname', 'cluster'] # suppress warning if critical already firing
Prometheus on Kubernetes — kube-prometheus-stack
The kube-prometheus-stack Helm chart is the standard way to deploy Prometheus, Grafana, AlertManager, Node Exporter, kube-state-metrics, and the Prometheus Operator on Kubernetes in a single installation. The Prometheus Operator extends Kubernetes with custom resources: ServiceMonitor (auto-discover services to scrape), PodMonitor (scrape individual pods), PrometheusRule (manage alerting and recording rules as Kubernetes resources), and Alertmanager (configure AlertManager declaratively).
# Install kube-prometheus-stack with Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm upgrade --install kube-prometheus-stack \
prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName=gp3 \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=100Gi \
--set grafana.persistence.enabled=true \
--set grafana.persistence.storageClassName=gp3 \
--set grafana.persistence.size=10Gi
# ServiceMonitor — auto-discover your application's /metrics endpoint
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: api-service-monitor
namespace: monitoring
labels:
release: kube-prometheus-stack # must match Prometheus selector
spec:
selector:
matchLabels:
app: api-server # matches Service with this label
namespaceSelector:
matchNames: [production]
endpoints:
- port: http
path: /metrics
interval: 15s
scrapeTimeout: 10s
SLOs, SLIs, and Error Budget Monitoring
Service Level Objectives (SLOs) are the cornerstone of SRE practice. An SLO is a target reliability commitment: "99.9% of requests will succeed over a 30-day rolling window." A Service Level Indicator (SLI) is the measurement: the actual success rate. The Error Budget is the amount of downtime or errors permitted before breaching the SLO: 99.9% SLO = 0.1% error budget = 43.8 minutes/month of allowed downtime.
# SLO monitoring with multi-window, multi-burn-rate alerts
# The "burn rate" approach: alert earlier when burning budget faster
# Fast burn: consuming 14x budget in 1h (1h window + 5m window)
- alert: ErrorBudgetBurnRateFast
expr: |
(
job:http_error_rate:ratio1h{job="api"} > (14 * 0.001)
and
job:http_error_rate:ratio5m{job="api"} > (14 * 0.001)
)
labels: {severity: critical, page: "true"}
annotations:
summary: "Burning error budget 14x faster than target (1h window)"
# Slow burn: consuming 2x budget over 6h
- alert: ErrorBudgetBurnRateSlow
expr: |
job:http_error_rate:ratio6h{job="api"} > (2 * 0.001)
for: 30m
labels: {severity: warning}
annotations:
summary: "Elevated error rate consuming error budget (6h window)"
# Error budget remaining (PromQL for Grafana panel)
# SLO = 99.9% (0.999), window = 30 days
1 - (
sum(increase(http_requests_total{status=~"5.."}[30d]))
/
sum(increase(http_requests_total[30d]))
) / 0.001
Grafana Dashboards and Best Practices
Grafana is the standard visualisation layer for Prometheus. Effective Grafana dashboards follow specific design principles that separate useful operational dashboards from noisy, hard-to-read ones:
- The RED method for service dashboards: Rate (requests per second), Errors (error rate), Duration (latency percentiles). Every service should have a dashboard with these three panels at the top. They answer the first three questions during an incident: how much traffic, how many errors, how slow?
- The USE method for resource dashboards: Utilisation, Saturation, Errors. For every resource (CPU, memory, disk, network), show all three. A node that is 95% CPU utilised with a run queue of 3 is saturated and needs investigation.
- Use template variables for cluster, namespace, and service. A single dashboard template with dropdowns is far more maintainable than dozens of hardcoded dashboards. Define variables as Prometheus label queries:
label_values(kube_pod_info, namespace).
- Stat panels for SLO status: A large coloured stat panel showing "99.94% (Target: 99.9%)" with green/red colouring based on threshold is the best way to communicate SLO health at a glance.
- Store dashboards as code: Use Grafonnet (Jsonnet library) or Grafana's provisioning mechanism to store dashboards in Git. Dashboards created manually in the UI are lost when the Grafana pod is restarted without persistence.
Thanos for Long-Term Storage and Multi-Cluster
Prometheus stores data locally with a default 15-day retention. Thanos extends Prometheus for: long-term storage (upload to S3/GCS/Azure Blob), global query (query across multiple Prometheus instances as if they were one), and high availability (deduplicate data from multiple Prometheus replicas). Thanos Sidecar runs alongside each Prometheus instance, uploads TSDB blocks to object storage, and exposes a gRPC endpoint. Thanos Query aggregates results from all Sidecar endpoints. Thanos Compactor downsamples old data for efficient long-term storage. This architecture provides unlimited retention without scaling Prometheus local storage.
Troubleshooting Common Prometheus Issues
- Target shows as DOWN in /targets: Check that the target's /metrics endpoint is accessible from the Prometheus pod's network. Verify the ServiceMonitor label selectors match the Service labels. Check security groups or Network Policies blocking the scrape port. Run
kubectl port-forward svc/myapp 8080:8080 and test manually.
- MANY_TO_ONE match error in PromQL: This occurs when joining two time series with different label sets using
on() and the right side has more unique label combinations than the left. Use group_left() or group_right() to specify which side to fan out.
- High cardinality causing OOM: Each unique combination of label values creates a separate time series. A label with 10,000 unique values (like user_id or request_id) creates 10,000 time series. This is a cardinality explosion. Never use high-cardinality values as Prometheus labels. Check TSDB head cardinality with
prometheus_tsdb_head_series metric. Use the /api/v1/status/tsdb endpoint to find the highest-cardinality metrics.
- Alerts not routing to correct receiver: Use AlertManager's /api/v2/alerts endpoint to see active alerts and their labels. Check the route matching with the amtool command:
amtool config routes test --config.file=alertmanager.yml severity=critical team=platform. AlertManager routes are matched top-down; the first match wins unless continue: true is set.
12 Prometheus & Observability Interview Questions with Expert Answers
Q1: What is the difference between a Counter and a Gauge in Prometheus?
A Counter is a metric that only increases monotonically (or resets to zero on process restart). It represents cumulative totals: total HTTP requests served, total bytes transmitted, total errors encountered. You always use rate() or increase() on a counter in PromQL to see the rate of change, not the raw counter value. A Gauge is a metric that can go up and down freely, representing a current value: current memory usage in bytes, number of active connections, current queue depth, CPU temperature. Gauge values are meaningful on their own without any PromQL function. The most common mistake is using a Gauge for something that should be a Counter (like request count) or using a Counter for something that should be a Gauge (like active sessions). Using the wrong type causes incorrect rate calculations and misleading dashboards.
Q2: Explain how histogram_quantile works and why you need a Histogram metric for it.
A Histogram metric creates a set of counter series called buckets, each counting how many observations fell below a boundary (le = "less than or equal to"). For example, http_request_duration_seconds_bucket{le="0.1"} counts requests that completed in under 100ms. histogram_quantile(0.99, rate(metric_bucket[5m])) uses linear interpolation across the bucket boundaries to estimate the value below which 99% of observations fall. This calculation happens on the Prometheus server using the aggregated bucket counts — enabling aggregation across multiple instances before computing the quantile. Summary metrics compute quantiles on the client side before sending to Prometheus, which means you cannot aggregate quantiles from multiple instances afterward. A p99 from instance A and a p99 from instance B cannot be meaningfully combined. Histograms solve this: you aggregate the bucket counts first, then compute quantile once on the combined data. This is why Histograms are preferred over Summaries in almost all production use cases.
Q3: What is the difference between rate() and irate() in PromQL?
rate(metric[5m]) calculates the per-second average rate of increase over the entire 5-minute window by using the first and last data points (and adjusting for counter resets). It produces a smooth average that absorbs short spikes. irate(metric[5m]) calculates the instantaneous rate using only the last two data points within the 5-minute window. It is much more sensitive to sudden spikes but also more noisy. Use rate() for dashboards and alerting — the smoothed average is more reliable for threshold comparisons and trend detection. Use irate() when you specifically need to detect sudden spikes in real time and can tolerate noise. For SLO calculations over longer windows, always use rate(). The range window in both functions must be at least twice the scrape interval — with a 15s scrape interval, use at least [30s] or more typically [5m].
Q4: What is cardinality in Prometheus and why is high cardinality a problem?
Cardinality refers to the number of unique time series. Each unique combination of metric name and label values creates a separate time series stored in Prometheus's TSDB. High cardinality means an extremely large number of time series, which causes: high memory usage (each active time series requires ~3-4KB of RAM in the TSDB head block), slow query performance (more series to scan for each query), excessive scrape overhead, and potentially OOM crashes for the Prometheus server. Common cardinality explosions: using user_id, request_id, session_id, or URL paths with variable segments as label values — each unique ID creates a new time series. With a million users, a user_id label creates a million time series for a single metric. Prevention: never use unbounded high-cardinality values as labels. Use histograms for distributions instead of individual labelled observations. Monitor cardinality with prometheus_tsdb_head_series and investigate spikes immediately.
Q5: What is an SLO, SLI, and error budget? How do you implement them?
An SLI (Service Level Indicator) is a quantitative measurement of service behaviour — the ratio of successful requests, p99 latency, availability. An SLO (Service Level Objective) is the target value for an SLI: "99.9% of requests succeed over a 30-day rolling window." The error budget is 100% minus the SLO: 0.1% of requests can fail. This translates to approximately 43 minutes of total downtime per 30 days. Error budgets are the key SRE insight: they quantify exactly how much unreliability is acceptable, enabling data-driven decisions about risk. If the error budget is healthy (mostly unspent), teams can deploy faster and take more risk. If the error budget is nearly exhausted, teams should slow deployments and focus on reliability. Implementation in Prometheus: record the SLI as a ratio metric using recording rules, implement multi-window burn-rate alerts (alert faster when burning budget faster), and display error budget remaining on the team's SLO dashboard in Grafana.
Q6: How does AlertManager deduplication and grouping work?
AlertManager receives raw alerts from Prometheus (each firing alert is sent every evaluation interval). Grouping combines multiple related alerts into a single notification: if 20 pods all have high error rate, GroupBy=['alertname','service'] produces one notification saying "20 instances of HighErrorRate on service=api" instead of 20 separate pages. Group_wait (default 30s) is how long AlertManager waits before sending the first notification for a new group, allowing related alerts to accumulate. Group_interval (default 5m) controls how often subsequent notifications are sent for an ongoing group. Repeat_interval (default 4h) controls how often AlertManager re-sends notifications for still-firing alerts. Inhibition rules suppress lower-severity alerts when a higher-severity alert is firing for the same entity: if a node is down (critical), suppress all the pod-level alerts (warning) from that node. Silences are time-bounded mute rules: suppress all alerts matching a label set during a maintenance window.
Q7: What is the Prometheus Operator and what does a ServiceMonitor do?
The Prometheus Operator extends Kubernetes with custom resources that make managing Prometheus declarative and Kubernetes-native. Instead of editing Prometheus's static scrape configuration and reloading it, you create Kubernetes custom resource objects that the Operator converts to Prometheus configuration automatically. A ServiceMonitor tells Prometheus which Kubernetes Services to scrape: you specify label selectors that match Services, the port name to scrape, the path (/metrics by default), and the interval. When a new Service matching the selector is created, Prometheus automatically adds it to its scrape targets within seconds — no configuration change or reload required. This is the foundation of automatic service discovery in Kubernetes monitoring. PodMonitor does the same for individual Pods that don't have a Service. PrometheusRule manages alerting and recording rules as Kubernetes objects with full GitOps support.
Q8: How do you monitor Kubernetes cluster health with Prometheus?
kube-prometheus-stack deploys all components needed for comprehensive Kubernetes monitoring: Node Exporter (Linux OS metrics per node: CPU, memory, disk, network), cAdvisor (container-level metrics: CPU throttling, memory usage, OOMKills), kube-state-metrics (Kubernetes object state: pod phases, deployment replica counts, PVC binding status), and the Prometheus server itself with pre-built alerting rules for all these metrics. Critical Kubernetes alerts to implement: PodCrashLooping (increase in restart count), PodNotReady (pod in non-Ready state for N minutes), NodeNotReady (node condition), KubePersistentVolumeFillingUp (PVC usage above threshold), KubeAPIServerDown (API server unreachable), KubeDeploymentReplicasMismatch (desired vs available replicas mismatch). The complete set of production Kubernetes alerting rules is published in the kube-prometheus project's alerts.yaml — use these as your baseline and customise thresholds for your workload.
Q9: What is the difference between black-box and white-box monitoring?
White-box monitoring instruments the internals of a system — the application exposes metrics about its own behaviour: request count, error count, database query duration, cache hit rate, goroutine count, heap memory. Prometheus scraping /metrics endpoints is white-box monitoring. It gives deep insight into what the system is doing internally. Black-box monitoring tests the system from the outside, as a user would experience it: probe an HTTP endpoint and check that it returns 200, measure TLS certificate expiry, test that a TCP port is accepting connections, verify that a DNS record resolves correctly. Prometheus Blackbox Exporter implements black-box probing. Both are necessary: white-box tells you why a system is failing (internal state), black-box tells you that a system is failing from the user's perspective and is the ground truth for SLI measurement. White-box alone is insufficient — a service can have healthy internal metrics while being unreachable to users due to a network or load balancer issue.
Q10: How do you alert on symptoms instead of causes?
Alerting on causes generates noisy, actionable-at-the-wrong-level alerts: "pod restarted", "memory usage above 80%", "disk I/O high". These may or may not affect users and create alert fatigue. Alerting on symptoms means alerting on what users actually experience: high error rate, elevated latency, availability below SLO. These alerts tell you something is wrong for users and require immediate action. The distinction in practice: a pod restarting (cause) that self-heals in 10 seconds and causes 0.01% of requests to fail is not worth paging. A 5% error rate (symptom) persisting for 5 minutes is always worth paging regardless of whether it is caused by a restarting pod, a bad deployment, a database issue, or a network problem. Use cause-based metrics for dashboards and for context during incident investigation, but page only on symptoms. This approach, combined with proper alert thresholds and burn-rate alerting, is the foundation of sustainable on-call rotations with low alert fatigue.
Q11: What is Thanos and when do you need it?
Thanos is a set of components that extend Prometheus for long-term storage, high availability, and global query across multiple clusters. You need Thanos when: (1) You need retention beyond what local Prometheus storage can economically hold (Prometheus local storage is expensive at scale; Thanos uploads TSDB blocks to S3/GCS at a fraction of the cost for historical data). (2) You run multiple Kubernetes clusters and need a single query interface across all of them for cross-cluster dashboards and alerting. (3) You need HA for Prometheus itself — running two replicated Prometheus instances and deduplicating their data via Thanos Query. Thanos Sidecar runs alongside Prometheus, uploads completed 2-hour TSDB blocks to object storage, and exposes a gRPC StoreAPI. Thanos Query (the global query frontend) federates across all Sidecar and Store endpoints and handles deduplication. Cortex and Grafana Mimir are horizontally-scalable alternatives that are better for very high ingestion rates but more complex to operate.
Q12: How do you use recording rules and why are they important?
Recording rules pre-compute expensive or frequently-used PromQL expressions at scrape time and store the result as a new time series with a custom name. They are important for three reasons: (1) Performance — a complex expression joining multiple metrics and aggregating over millions of time series can take seconds to evaluate on demand. Pre-computed to a single recording rule time series, the same Grafana dashboard panel loads in milliseconds. (2) Consistency — multiple dashboards and alerts using the same recording rule always compute the same value. Without recording rules, slight variations in the PromQL expression across dashboards produce inconsistent numbers. (3) SLO accuracy — multi-window burn-rate SLO alerting requires recording rules for 5m, 30m, 1h, 6h, and 1d error rate windows. Evaluating these on demand for alerting would be too slow and miss events. Best practice: use the naming convention <labels>:<metric>:<operation> for recording rules, e.g., job:http_requests_total:rate5m. Define recording rules for any expression used in more than one dashboard or any expression that takes more than 1 second to evaluate.