Monitoring & Observability Study Guide

📈

METRICS & PROMETHEUS 5 questions

01 What is the difference between logs, metrics, and traces?

›

Metrics: numeric time series, cheap aggregation, good for dashboards/alerts. Logs: discrete events, high cardinality text, great for debugging detail. Traces: request path across services, latency breakdown. Use all three (three pillars) - correlate with exemplars and trace IDs in logs.

02 What is a Prometheus time series and what are labels?

›

A time series is a metric name + sorted label set to timestamped samples. Labels identify dimensions (instance, job, path). High-cardinality labels (user IDs) explode storage - design cardinality carefully.

03 How does Prometheus service discovery work in Kubernetes?

›

Kubernetes SD discovers pods/services; relabel configs map K8s metadata to target labels. Sidecar vs centralized scraping patterns exist; kube-prometheus-stack is common. RBAC needed for API access.

04 What are counters vs gauges vs histograms?

›

Counter only increases (reset on process restart) - use rate(). Gauge goes up/down - memory, queue depth. Histogram buckets latency/size - compute quantiles with recording rules or Grafana; beware histogram bucket cardinality.

05 What pitfalls exist with <code>rate()</code> and <code>irate()</code> windows?

›

rate() needs range ≥2 scrape intervals; too-short windows are noisy. Counter resets (pod restart) handled by Prometheus. irate is more volatile for dashboards; prefer rate for alerting stability. Alignment of scrape and eval windows matters.

▣

DASHBOARDS & VISUALIZATION 5 questions

06 What makes a good on-call dashboard vs a pretty wallboard?

›

On-call: RED/USE golden signals, clear SLO burn, links to runbooks, minimal clutter. Wallboards can be high-level KPIs. Tag dashboards by service and env; avoid duplicating the same chart 20 ways.

07 What is the RED method?

›

Rate (requests/sec), Errors (failed requests), Duration (latency distribution). Service-oriented complement to USE (utilization, saturation, errors) for resources.

08 What is the USE method?

›

For every resource: Utilization (% time busy), Saturation (queued work), Errors (count). Great for nodes, disks, NICs - complements RED for services.

09 What are recording rules and why use them?

›

Precompute expensive PromQL (aggregates, histogram quantiles) on scrape path. Faster dashboards, consistent alert queries, lower query load. Trade-off: more metrics to store and manage.

10 How do you avoid alert fatigue?

›

Alert on symptoms (SLO burn, user-visible errors) not every cause. Require sustained breaches, dedupe/aggregate by service, on-call rotations, severity levels, regular alert audits, auto-resolve noise, runbooks mandatory.

☰

LOGGING PIPELINES 5 questions

11 What is structured logging and why prefer JSON?

›

Key-value fields (level, trace_id, user) instead of free text. JSON (or logfmt) enables query filters in Loki/ELK without fragile regex. Consistent schema beats clever prose.

12 Compare Elasticsearch/Loki/splunk-style logging at a high level.

›

ELK/Opensearch: index-heavy, powerful search, higher cost. Loki: index labels only, cheap at scale, pairs with Grafana. Commercial SIEM adds security analytics. Choose by query patterns, retention, and budget.

13 What is log cardinality vs volume?

›

Volume: bytes/sec ingested. Cardinality: unique label combinations - high cardinality (per-request IDs as labels) breaks Loki-like systems and expensive in Elasticsearch. Sample debug logs or use traces instead.

14 What is a sidecar vs daemonset log collection pattern?

›

Sidecar (Fluent Bit per pod): isolation, per-tenant config, more resources. Daemonset: one agent per node reads container logs - simpler, shared resources. Both ship to central sink.

15 How do you handle PII in logs for compliance?

›

Redact at source, tokenize fields, restrict access with RBAC, short retention for raw logs, separate security logging to SIEM, encryption at rest/transit, data classification policies, avoid logging bodies of requests by default.

↯

DISTRIBUTED TRACING 5 questions

16 What is a trace, span, and parent span?

›

Trace: end-to-end request DAG. Span: one timed operation (HTTP call, DB query). Parent/child links show call tree. Context propagation (trace_id, span_id) passes across services (W3C Trace Context headers).

17 What is tail-based vs head-based sampling?

›

Head-based: decide sample at trace start - simple, may drop interesting rare errors. Tail-based: buffer spans, decide after seeing full trace (e.g. errors/slow) - better signal, higher memory cost (Tempo/Grafana Agent patterns).

18 Name common OpenTelemetry components.

›

OTel SDK instruments apps; Collector receives, processes, exports to backends (Jaeger, Tempo, X-Ray); semantic conventions standardize attribute names. Vendor-neutral instrumentation reduces lock-in.

19 How do traces connect to metrics and logs?

›

Exemplars link histogram buckets to trace IDs. Logs include trace_id for drill-down. Unified query UI (Grafana) or jump links from alert to trace. Reduces mean time to resolution.

20 What overhead does tracing add and how do you minimize it?

›

CPU/memory for span creation, network export, and tail sampling buffers. Mitigate: aggressive sampling for health checks, async export, batching, eBPF-based auto-instrumentation where appropriate, avoid huge span attributes.

◐

SLOS, ERROR BUDGETS & ALERTING 5 questions

21 Define SLI, SLO, and SLA.

›

SLI: measured metric (availability, latency). SLO: target over window (99.9% success monthly). SLA: contractual consequence with customers - usually stricter external promise than internal SLO.

22 What is error budget burn and multi-window alerting?

›

Burn compares consumed errors to allowed budget. Multi-window (e.g. 1h + 6h) reduces flapping: short window catches sudden incidents, long window catches slow leaks. Google SRE workbook pattern.

23 What is the difference between black-box and white-box probes?

›

Black-box: synthetic checks from outside (Pingdom, canaries) - user-visible. White-box: internal metrics (queue depth, GC pauses). Need both: black-box catches DNS/LB issues white-box misses.

24 What belongs in a runbook link from an alert?

›

Impact, first steps, dashboards, logs queries, recent deploys, escalation path, rollback/feature flag toggles, known false positives. Goal: any on-call engineer can start mitigation without tribal knowledge.

25 How do you SLO a dependency you do not control (SaaS)?

›

Measure client-side outcomes (timeouts, error codes), cache/fallback strategies, track vendor status pages, define internal SLO on your handling of dependency failure, contractual SLAs where possible.

⌁

EBPF, RUM & COST 5 questions

26 What is eBPF used for in observability?

›

Kernel-level programs for low-overhead metrics, security auditing, network tracing without full packet capture to userspace. Tools like Pixie, Cilium Hubble, bpftrace. Requires recent kernels and careful safety review.

27 What is Real User Monitoring (RUM) vs synthetic monitoring?

›

RUM: actual browser/mobile sessions - Core Web Vitals, geo/device variance. Synthetic: scripted probes from fixed locations - steady baseline. Combine: synthetic for uptime, RUM for real UX.

28 What is cardinality explosion in metrics?

›

Too many unique label combinations (per-user, per-IP) makes TSDB storage and query cost explode. Fix: aggregate at ingest, drop labels, use logs/traces for high-cardinality debugging, recording rules to roll up.

29 How do you right-size observability spend?

›

Tier retention (hot/warm/cold), sampling, drop debug logs in prod, index only useful fields, use object storage for archives, chargeback by team, review top cardinality metrics monthly, negotiate commercial tool ingest.

30 What is continuous profiling and when is it worth it?

›

Samples CPU/heap periodically (Parca, Pyroscope, Datadog profiler) to find hot code paths. Worth it for latency/cost optimization of steady-state services; overhead low with sampling; less critical for tiny services.