Site Reliability Engineering
Details
- Full Name
-
Site Reliability Engineering (SRE)
- Also known as
-
"Operations as a software problem", Google SRE
Core Concepts:
- Operations as a software problem
-
Apply software engineering to operations work instead of manual administration.
- SLI / SLO / SLA
-
Service Level Indicators measure behavior; Objectives set internal targets; Agreements are external commitments.
- Error budget
-
100% reliability is the wrong target; the allowed unreliability (1 − SLO) is a budget spent on feature velocity and risk.
- Embrace risk
-
Reliability is balanced against the cost and speed of change, not maximized blindly.
- Eliminate toil
-
Reduce repetitive, manual, automatable operational work; cap operational load (~50%) to protect engineering time.
- Blameless postmortems
-
Learn from incidents by analyzing systems and processes rather than assigning blame.
- Monitoring & observability
-
Measure the four golden signals — latency, traffic, errors, saturation.
- Release & capacity engineering
-
Automate launches, rollouts, and capacity planning to make change safe and repeatable.
- Key Proponents
-
Ben Treynor Sloss (coined the term at Google); Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy ("Site Reliability Engineering", O’Reilly 2016, and "The Site Reliability Workbook")
When to Use:
-
Operating production services where reliability must be measured and managed
-
Defining SLOs and error budgets to balance reliability against feature velocity
-
Establishing on-call, incident response, and blameless postmortem practices
-
Reducing operational toil through automation
-
Distinguishing reliability ownership from generic DevOps culture
Related Anchors:
Current Status:
-
The prior serves the canon well: the 2016 SRE Book and 2018 Workbook (SLOs, error budgets, toil, postmortems) remain authoritative, and Google publishes all three books — including "Building Secure and Reliable Systems" (2020) — free at sre.google/books
-
What moved since the books is the organisational frame: SRE practice increasingly converges with platform engineering, with reliability capabilities embedded into internal platforms rather than delivered solely by standalone SRE teams — reflected in Google’s own platform-engineering guidance