SRE Study Guide

📐

SRE FUNDAMENTALS 5 questions

01 What is the difference between SLI, SLO, and SLA?

›

SLI (Service Level Indicator) — the actual metric you measure. Example: 99.2% of requests completed in under 300ms over the last 30 days.

SLO (Service Level Objective) — the internal target for that SLI. Example: 99.5% of requests must complete in under 300ms. Stricter than the SLA to give you a buffer.

SLA (Service Level Agreement) — the contractual commitment to customers, with financial penalties if breached. Example: 99.9% uptime guaranteed or credits issued.

Key hierarchy: SLA < SLO < current measured SLI (ideally). The gap between SLO and SLA is your buffer for incidents before customer commitments are breached.

02 What is an error budget and how do you use it to make decisions?

›

Error budget = 100% − SLO. For a 99.9% SLO, you have 0.1% — about 43.8 minutes per month of allowed downtime.

Budget healthy: dev teams ship freely; risky experiments are allowed.
Budget nearly exhausted: freeze non-critical releases; focus on reliability work.
Budget burned: no new features until reliability is restored; SRE and dev collaborate on root causes.

The error budget creates a shared, objective metric that aligns SRE and product teams — it removes the "SRE vs dev" dynamic and replaces it with a shared constraint.

03 What is toil and why does SRE care about reducing it?

›

Toil is manual, repetitive, automatable, tactical work tied to keeping a service running that scales linearly with service growth. It has no enduring value — doing it once produces no lasting improvement.

Examples: manually restarting a flapping service, rotating credentials by hand, updating config files on each deploy.

SRE principle: keep toil below 50% of each SRE's time. The remaining time goes to engineering work that reduces future toil and improves reliability. If toil grows unchecked, SREs become operators with no capacity to improve anything.

04 How does SRE differ from traditional DevOps or operations?

›

SRE is an opinionated implementation of DevOps principles using software engineering to solve operations problems. Key differences:

SLO-driven — reliability is defined and measured, not just aspirational.
Error budget — gives developers a quantified risk budget instead of a binary "allowed/not allowed."
Engineering-first — SREs spend engineering time automating away operations work; ops teams often do the opposite.
Engagement model — SREs can withdraw support if a service burns too much error budget, creating accountability.
Postmortems — blameless, systemic analysis is a core cultural practice, not an afterthought.

05 What are the Four Golden Signals and when do you use them?

›

From the Google SRE Book — if you can only monitor four things for a user-facing service:

Latency — time to serve a request. Track success and error latency separately. Watch p95/p99, not just average.
Traffic — demand on the system. RPS, messages/sec, active connections.
Errors — rate of failed requests (explicit 5xx, implicit wrong content, or policy-based SLO breaches).
Saturation — how "full" the service is. The most constrained resource: CPU, memory, thread pool, queue depth.

Use these for your initial alert layer. Complement with the RED method (Rate/Errors/Duration) for services and USE method (Utilization/Saturation/Errors) for resources.

Alert on symptoms (Golden Signals) first — they reflect user impact. Use internal metrics to diagnose root cause after you know something is wrong.

🚨

INCIDENT MANAGEMENT 6 questions

06 Walk me through the full incident lifecycle from detection to close.

›

Detect — alert fires or user report. Acknowledge within response SLA.
Triage — assess severity (P1–P4). What is the blast radius? Is it customer-impacting?
Communicate — notify stakeholders, open incident channel, assign incident commander (IC) and communications lead.
Mitigate first — restore service before finding root cause. Rollback, failover, disable feature flag, shed load.
Investigate — metrics, logs, recent deploys, config changes. Form hypotheses, test them.
Resolve — confirm service restored via metrics and real user traffic, not just internal checks.
Postmortem — within 48–72 hours. Blameless. Timeline, impact, root cause, contributing factors, action items.

Key principle: mitigate before you investigate. A rollback now costs 15 minutes; a root cause investigation in the middle of an outage costs hours.

07 What makes a good blameless postmortem?

›

Blameless — focus on systems and processes, never individuals. The question is "what conditions allowed this?" not "who caused it?"
Clear timeline — when did the issue start, when was it detected, when was it mitigated?
Impact statement — users affected, duration, revenue or SLO impact.
Root cause (5 Whys) — drill past symptoms to the underlying systemic cause.
Contributing factors — what made it worse or harder to detect?
Action items — concrete, assigned, time-bounded. Prevent recurrence or reduce MTTR.
Shared widely — postmortems build org-wide institutional knowledge.

A postmortem without action items that get done is just theater. Track items in your issue tracker and review them at the next incident review.

08 What is MTTR, MTBF, and MTTD? Which is most actionable?

›

MTTD (Mean Time To Detect) — how long before you know something is wrong. Reduced by better alerting coverage.
MTTR (Mean Time To Recover) — average time from failure detection to service restoration. Reduced by runbooks, automation, and practiced on-call response.
MTBF (Mean Time Between Failures) — average time between failures ending and the next one beginning. Measures underlying reliability.
MTTF (Mean Time To Failure) — average time a system operates before its first failure. Used for hardware/non-repairable components.

MTTR is usually most actionable. You often can't prevent every failure, but you can ensure you detect it fast and recover faster through better runbooks, automation, and on-call practice. MTBF requires deeper reliability engineering investments.

09 How do you define P1/P2/P3/P4 severity levels?

›

P1 / SEV1 — complete service outage, critical data loss, or security breach. All hands on deck 24/7. Customer-facing impact is severe.
P2 / SEV2 — major feature broken, significant degradation, partial outage. Urgent response; may page after hours.
P3 / SEV3 — non-critical feature impaired, workaround available. Response within business hours.
P4 / SEV4 — minor cosmetic bug, no user impact. Queued for normal sprint work.

Good severity definitions include: impact scope, degradation type, and response SLA. Document them in your runbook so on-call engineers aren't guessing during an incident.

10 What is the role of an Incident Commander (IC)?

›

The IC owns the incident process — not the technical solution. Responsibilities:

Declare the incident and set severity.
Coordinate responders — assign roles (comms lead, subject matter experts).
Ensure the team is working the right problem, not rabbit-holing.
Make go/no-go decisions (rollback, failover, customer notification).
Manage communication to stakeholders — regular status updates.
Drive to mitigation then resolution, and hand off to postmortem.

The IC does not need to be the most technical person in the room. They need to be calm, decisive, and good at unblocking people.

11 How do you design an on-call rotation that doesn't burn out your team?

›

Minimum rotation size — at least 8 people for a follow-the-sun or 4+ for a single-timezone rotation. Fewer means too-frequent paging.
Alert quality over quantity — every page must be actionable. Noisy on-call causes alert fatigue and burnout faster than frequency alone.
Compensation — on-call should be compensated fairly (time off or pay). Volunteering for unpaid on-call is unsustainable.
Escalation paths — clear secondary/tertiary escalation so the primary on-call isn't alone on hard problems.
Postmortem on repeated pages — if the same alert fires repeatedly without resolution, it must be fixed or removed.
Track on-call burden — measure pages per shift, time-to-acknowledge, sleep interruptions. Make burnout visible before people quit.
Dev teams share on-call — when developers are on-call for their own services, alert quality and reliability investment both improve dramatically.

📊

AVAILABILITY & RELIABILITY 5 questions

12 What does 99.9% availability mean in practice? What about 99.99%?

›

99% (two nines) — ~3.65 days downtime/year. ~7.2 hours/month.
99.9% (three nines) — ~8.76 hours/year. ~43.8 minutes/month.
99.95% — ~4.38 hours/year. ~21.9 minutes/month.
99.99% (four nines) — ~52.6 minutes/year. ~4.38 minutes/month.
99.999% (five nines) — ~5.26 minutes/year. ~26 seconds/month.

Key insight: going from three nines to four nines is not "a little better" — it's 10× stricter. Each additional nine requires significant architectural investment. Most consumer services target 99.9–99.99%. Five nines is typically reserved for telco/financial critical systems.

13 What is the difference between reliability and availability?

›

Availability — the fraction of time a system is operational and accessible. Usually expressed as an uptime percentage. Binary (up or down).

Reliability — the probability that a system performs its intended function without failure over a specified period. A system can be available (responding) but unreliable (returning wrong data, slow responses, partial failures).

Example: a search service that is up but returns stale results from 6 hours ago is available but unreliable. SRE cares about both — SLIs should capture reliability, not just availability.

14 How do you calculate system availability when multiple components are involved?

›

Serial (all must be up): A_total = A1 × A2 × A3
Three 99.9% components in series: 0.999³ ≈ 99.7% — worse than any individual component.

Parallel (any one being up is sufficient): A_total = 1 − ((1−A1) × (1−A2))
Two 99% components in parallel: 1 − (0.01 × 0.01) = 99.99%.

This is why redundancy is fundamental to high availability — parallel paths dramatically improve overall availability. Also why a long dependency chain is risky — every serial component multiplies failure probability.

15 How do you approach multi-window, multi-burn-rate alerting for SLOs?

›

Simple threshold alerts are noisy — a spike can trigger without actually threatening the SLO. Multi-window burn-rate alerting (from the Google Workbook) solves this:

Burn rate = how fast you're consuming error budget vs. the allowed rate. A burn rate of 1× means you'll use exactly 100% of budget in 30 days.

Page: fast burn — 14.4× burn rate sustained for 1h (and 5m window confirms). This burns 2% of monthly budget in 1 hour. Wake someone up.
Ticket: slow burn — 6× burn rate over 6h. Burning budget slowly but consistently.
Warning — 3× burn rate over 3 days. Create a ticket, don't page.

This approach reduces false positives while catching genuine SLO threats early enough to act.

Implement in Prometheus with rate(errors[1h]) / rate(total[1h]) compared against error_budget_burn_rate recording rules.

16 What is chaos engineering and when is it appropriate?

›

Chaos engineering is the practice of deliberately injecting failures into a system to verify that it behaves as expected under adverse conditions. It turns implicit assumptions about resilience into explicit, verified facts.

Prerequisites before starting chaos:

You have a steady-state baseline (metrics you can compare against).
You have monitoring and alerting in place to detect injected failures.
You can contain the blast radius — start in non-prod, then limited prod traffic.

Common experiments: kill a pod, saturate CPU, inject latency into a downstream call, cause a dependency to return 500s, simulate AZ outage.

Tools: Chaos Monkey (Netflix), LitmusChaos, AWS Fault Injection Simulator, Gremlin.

Not appropriate when: you don't have observability to know if something broke, or when the system can't tolerate intentional failures (e.g., a payment system with no redundancy).

🔧

RELIABILITY PATTERNS 5 questions

17 What is a circuit breaker pattern and why does it matter?

›

A circuit breaker wraps calls to a downstream service and monitors failure rate. When failures exceed a threshold, the circuit opens — subsequent calls fail fast without hitting the failing service.

States:

Closed — normal operation, requests pass through.
Open — failing fast; calls rejected immediately without network round-trip.
Half-open — probe traffic sent to test if the downstream has recovered.

Without circuit breakers, a slow downstream causes upstream threads to queue waiting for responses — leading to cascading failures across the entire call chain. Circuit breakers stop the cascade.

Implementations: Resilience4j (Java), Hystrix (legacy), Istio circuit breaking at service mesh level.

18 Explain retry strategies — when are retries harmful?

›

Good retries: transient network blips, rate limit 429s with backoff, idempotent operations.

Retry best practices:

Exponential backoff — double the wait time on each retry. wait = base × 2^attempt
Jitter — add random offset to prevent retry storms (thundering herd). wait = random(0, base × 2^attempt)
Max retries + timeout — always cap. Infinite retries with no timeout = hung request.

When retries are harmful:

Non-idempotent operations (POST that creates a record) — retry = duplicate.
Downstream is already overloaded — retries amplify load, causing retry storms.
Without circuit breakers — retries keep hammering a failing service.

Always pair retries with circuit breakers. Retries without circuit breakers are a DDoS you do to yourself.

19 What is graceful degradation and how do you design for it?

›

Graceful degradation means the system continues serving users in a reduced capacity when a dependency fails, rather than failing completely.

Patterns:

Fallback to cache — serve stale data if live data is unavailable.
Feature flags — disable non-critical features when their dependencies degrade.
Default responses — return a useful default instead of an error (e.g., empty recommendations instead of a 500).
Static fallback — serve a static page or cached version when dynamic generation fails.
Load shedding — reject low-priority requests first when overloaded to protect core functionality.

Design principle: identify the core user journey and ensure it survives even when non-critical dependencies fail.

20 What is the bulkhead pattern and when do you apply it?

›

Named after ship bulkheads — compartments that prevent one flooding section from sinking the whole ship. In software: isolate resource pools so that failures in one partition don't exhaust resources used by others.

Examples:

Separate thread pools per downstream service — a slow downstream only blocks its own pool, not every request.
Separate connection pools for critical vs. non-critical database queries.
Separate Kubernetes node pools for latency-sensitive vs. batch workloads.
Separate queues for high and low priority jobs.

Apply when: a shared resource (thread pool, DB connection pool, queue) is at risk of being fully consumed by one traffic type or one failing dependency.

21 How do you approach capacity planning for a high-traffic service?

›

Baseline — measure current resource utilization (CPU, memory, disk I/O, network, DB connections) at known traffic levels.
Identify bottleneck — which resource saturates first as load increases?
Model growth — use historical traffic trends, seasonality (holiday spikes), and planned launches to project demand.
Load test — validate the system handles projected peak with realistic traffic patterns (not just raw RPS).
Safety margin — provision to ~70% utilization at expected peak, leaving headroom for spikes and autoscaling lag.
Automate — HPA, ASGs, KEDA. Autoscaling should react before users experience degradation.
Review cadence — capacity plans decay; revisit quarterly or after major product changes.

🌐

DISTRIBUTED SYSTEMS 5 questions

22 Explain the CAP theorem and how it applies to real systems.

›

A distributed system can guarantee only two of three properties during a network partition:

Consistency (C) — every read receives the most recent write or an error.
Availability (A) — every request receives a response (not guaranteed to be the latest data).
Partition Tolerance (P) — the system operates despite network partitions.

In practice, network partitions happen, so you choose between CP (consistency during partition, may become unavailable) or AP (available during partition, may return stale data).

CP examples: Zookeeper, etcd, HBase — prefer consistency over availability.
AP examples: DynamoDB (eventual), Cassandra, CouchDB — prefer availability over consistency.
CA: traditional RDBMS (not partition-tolerant by default).

23 What is the difference between horizontal and vertical scaling?

›

Vertical scaling (scale up) — add more CPU/RAM to an existing instance. Simple to implement; hard limits; single point of failure; may require downtime.

Horizontal scaling (scale out) — add more instances of the service. Preferred for cloud-native workloads; enables high availability; requires stateless service design.

Stateless design requirements: sessions stored externally (Redis), files on shared storage, no local state that can't be lost.

For databases: vertical is simpler but limited. Read replicas scale reads horizontally. Sharding scales writes but adds significant operational complexity. Choose based on actual bottleneck (reads vs. writes).

24 How do you handle distributed tracing in a microservices architecture?

›

Distributed tracing tracks a request as it flows across multiple services, identifying latency contributors and failure points.

Key concepts:

Trace — the full journey of a request through all services.
Span — a single operation within a trace (one service call, one DB query).
Context propagation — trace ID and span ID passed via HTTP headers (traceparent in W3C format, or X-B3-TraceId in Zipkin).

Implementation:

Instrument services with OpenTelemetry SDK — auto-instrumentation handles most HTTP/gRPC calls.
Propagate context at every service boundary — missing propagation = broken traces.
Sample strategically — 100% sampling is expensive; use head-based or tail-based sampling.

Tools: Jaeger, Zipkin, AWS X-Ray, Datadog APM, Grafana Tempo.

25 What is eventual consistency and how do you reason about it?

›

Eventual consistency means that if no new writes are made, all replicas will eventually converge to the same value. There is no guarantee of how long convergence takes.

Implications for SREs:

Read-your-writes consistency is not guaranteed — a user may not see their own recent write immediately.
Stale reads can cause user-visible issues (e.g., deleted item still appears).
Conflict resolution strategy matters — last-write-wins, vector clocks, CRDTs.

When to accept it: user preferences, shopping carts, counters, social feeds — where a brief inconsistency is acceptable for the availability and performance benefit.

When to avoid it: financial transactions, inventory management, anything requiring strong read-after-write guarantees.

26 How do you design a deployment strategy for zero-downtime releases?

›

Rolling deploy — replace instances incrementally. Kubernetes default. Fast, but both versions run simultaneously — requires backwards-compatible API changes.

Blue/green — run two identical environments; switch traffic at the load balancer. Instant cutover; instant rollback; doubles infrastructure cost during deploy.

Canary — route a small % of traffic (1%, 5%) to the new version first. Monitor error rates, latency, business metrics. Expand gradually. Best for high-risk changes.

Feature flags — deploy code dark; enable feature separately. Decouples deployment from release. Requires flag management overhead.

Database migrations — always use expand-contract pattern: add new column (expand), migrate data, update code to use new column, remove old column (contract). Never change column type or remove columns in a single deploy.

Canary + feature flags = maximum control. Most orgs should combine both for high-risk releases.

⚙️

TOIL & AUTOMATION 4 questions

27 How do you identify and prioritize toil reduction work?

›

Identify toil by asking:

Is this manual and could it be automated?
Is it repetitive — do I do this same task regularly?
Is it tied to service scale — does it grow as traffic or customers grow?
Does it have no lasting value — am I in exactly the same position after I finish?

Tracking: log toil tasks and time spent for a few weeks. Spreadsheet or Jira is fine. Data makes the case to management.

Prioritization: toil that is (1) highest frequency × (2) highest time cost × (3) highest error risk = automate first. A manual cert rotation that happens 20 times/month and takes 1 hour each beats a monthly 15-minute task.

28 What is runbook automation and how does it differ from full automation?

›

Runbook — documented manual steps for a known operational task. Reduces cognitive load and error rate; doesn't eliminate toil.

Runbook automation — a script or tool that executes runbook steps, triggered by a human. Human approves the action; machine executes it consistently. Good middle ground when full automation is risky.

Full automation — system detects condition and executes remediation without human involvement. Highest toil reduction; requires high confidence the automation is correct and safe.

Progression: undocumented → runbook → runbook automation → full automation. Each step reduces toil. Not everything should reach full automation — some tasks need human judgment.

29 How do you prevent automation from creating new failure modes?

›

Dry-run mode first — run automation in observe-only mode before giving it write permissions.
Blast radius limits — automation that restarts services should restart 1 at a time, not all at once. Rate-limit destructive operations.
Idempotency — automation should be safe to run multiple times without causing double-action.
Manual override — always provide a kill switch or pause mechanism. Automation that can't be stopped is dangerous.
Alerting on automation actions — log and alert on automated changes so humans know what the system is doing.
Test in staging — automation failures in prod are a new category of incident.

Automation is code; treat it with the same review, testing, and on-call ownership as production services.

30 How do you implement auto-remediation safely?

›

Auto-remediation (system detects + fixes without human) is powerful but carries risk of making a bad situation worse.

Safe implementation pattern:

High-confidence signal only — only trigger on unambiguous conditions (pod OOMKilled, disk full). Avoid triggering on ambiguous metrics.
Limit scope — restart one pod, not the whole deployment. Delete one unhealthy node, not the cluster.
Cooldown period — don't trigger the same remediation repeatedly within a window. If the auto-fix keeps firing, escalate to human.
Audit log every action — auto-remediation that acts silently is a debugging nightmare.
Alert even when it works — success notifications tell oncall "this happened and was handled; review it."
Escalate if remediation fails — if auto-fix runs N times and the problem persists, page a human immediately.

💬

BEHAVIORAL & CULTURE 5 questions

31 How do you balance reliability work vs. feature development?

›

Use the error budget as a shared, objective measure — removes the "SRE vs dev" dynamic and replaces it with a shared constraint both teams can see.

Budget healthy → dev ships freely, risky experiments allowed.
Budget burning → pause non-critical releases, prioritize reliability fixes.
Budget exhausted → feature freeze until reliability is restored.

Frame reliability as investment: "5% of engineering time buys us the ability to ship confidently the other 95% of the time." Track and display error budget burn in a shared dashboard so it's never a surprise.

If you don't have SLOs yet, error budget conversations become subjective. The first step is always: define and measure your SLOs.

32 Tell me about a time you reduced toil or automated a manual process.

›

Use STAR format: Situation, Task, Action, Result.

Strong answer elements:

Quantify the toil — "2 hours per deploy, 3 deploys/week = 6 hours/week = 300+ hours/year."
Describe what you automated and the technical approach — runbook → script → CI job → fully automated.
Measure the outcome — time saved, error rate reduction, deploys per day increased.
Note secondary benefits — faster deploys → faster feature delivery, fewer human errors.

Also highlight: toil reduction often reveals design problems. "I automated the manual restart, then redesigned the service so restarts weren't needed" is a stronger story than just the automation.

33 How do you handle disagreement with a developer about a risky release?

›

Lead with data — show the current error budget, risk model, and historical impact from similar changes.
Propose alternatives — canary deploy, feature flag, off-hours release with a rollback plan. Don't just say no.
Make risk explicit and shared — "If this causes a P1, here is the blast radius and estimated MTTR." Ensure the developer and PM own the decision with full information.
Escalate if needed — if risk is high enough and agreement can't be reached, escalate to engineering leadership with the data.

Never block unilaterally without data — SRE role is to inform and recommend, not to be a gatekeeper without reason. The goal is a good outcome, not winning the argument.

34 How do you build a blameless postmortem culture from scratch?

›

Model from the top — engineering leadership must visibly celebrate blameless postmortems and never punish disclosed mistakes.
Person from the system — "What conditions allowed this to happen?" replaces "Who broke it?"
Make it safe to be the cause — if engineers fear punishment, they hide information critical to root cause analysis.
Make postmortems visible and useful — share broadly; reference them in design reviews as lessons learned.
Require follow-through — a postmortem without completed action items trains people to see them as theater. Track items in your issue tracker.
Celebrate good catches — publicly recognize engineers who find systemic issues or write thorough postmortems.

Culture change is slow. The first blameless postmortem after a serious incident sets the tone. Handle it well and you'll get honest postmortems forever. Handle it badly and you'll get cover-ups.

35 How do you onboard a new service onto the SRE team's support?

›

SRE teams shouldn't accept a service on-call if it isn't production-ready. A production readiness review (PRR) should cover:

Observability — does the service have meaningful SLIs, dashboards, and alerts? Can we tell when it is degraded?
Runbooks — are common failure modes documented with clear remediation steps?
Capacity — is the service load tested? Is there autoscaling? What is the max traffic before degradation?
Deployment — can it be deployed and rolled back safely? Zero-downtime?
Dependencies — are upstream and downstream dependencies documented? Is there graceful degradation?
Security — secrets managed properly? Least-privilege IAM? Encryption at rest/in transit?
Error budget — SLOs defined and error budget tracking in place?

Services that don't meet the bar get a remediation plan before SRE takes on-call ownership.