Home > Research > Site Reliability Engineering: What Is It? Why Is It Important for Online Businesses?

Site Reliability Engineering: What Is It? Why Is It Important for Online Businesses?

Hell hath no fury like a customer not being able to access an online service when they want to. They expect the online services to always be on, always be accessible, and always treat them like there’s no one else in the world who matters more. Thank heavens then for giving these online services the ability to use site reliability engineering (SRE) to keep their customers happy, engaged, and most importantly, feeling valued.


Human beings are fickle. We like something one minute and have an intense disagreement with it the next. If we use an online service, we want it to dedicate all its attention (and CPU and threads) to us. The slightest appearance of divided attention (or performance lag) is enough to make us jump ship and look for the next online service. So many relationships between man and online services have been frayed, and eventually destroyed, only because we couldn’t bear the thought of sharing a services processing power with someone else. Of course, online services don’t want us to leave, but how do they make sure they don’t have to mend broken hearts. Fortunately, there is a particularly effective form of relationship therapy they can use. It’s called site reliability engineering.

What is site reliability engineering?

Site reliability engineering (SRE) is an operational model for running online services more reliably by a team of dedicated reliability-focused engineers. These engineers are working across the realm of “Anything-as-a-Service.” Their reliability-focused work does not discriminate between infrastructure, software, networking, or platforms. If a customer's perception of service reliability is going to be impacted by something, site reliability engineers are looking after it.

Site reliability engineering, as a way of ensuring system agility, availability, and performance, attempts to bring as much of the system into a resilient, predictable, and measurable state as is possible. It supports an organization’s capability to sustain an appropriate level of reliability for its services by implementing and continually enhancing data-driven production feedback loops.

Production feedback loops you say?

Webster defines feedback as “the transmission of evaluative or corrective information about an action, event, or process to the original or controlling source.” That’s a mouthful for something that could easily be described as “frequent communication.”

Production feedback loops within the context of an online system is a communication mechanism between teams and the people who constitute them. In a hyper-competitive economic landscape, riddled with more competitors than you can shake a stick at, if the technology teams (development and operations) are not constantly talking to each other, they are hastening the end of their business.

Historically, a poorly configured feedback loop has been a significant reason for development and operations to have no sense of shared ownership. In recent times, philosophies like Agile and DevOps have alleviated some of these concerns, but old habits die hard. People have a natural tendency to fall back into their own cocoon at the first opportunity. Site reliability engineering (and site reliability engineers) play an intermediary role between these cocooned silos. Site reliability engineers’ effectiveness comes from their dedicated purpose of establishing and maintaining production feedback loops. As a part of their responsibility, they collect, aggregate, synthesize, analyze, and report on data from production servers and ensure both development and operations teams are aware of the state the systems are in.

Production feedback loops rely on data, not opinion

Complete power corrupts completely, they say. Experience and familiarity of systems also has a tendency of doing that, but instead of us getting corrupted, we favor instinct over data. For effective site reliability engineering, instinct is derived from proactive data analysis of system performance. Modifications in the environment and deeper understanding of the system will also lead to measurements getting modified to adapt to the changing circumstances.

Appropriate level of reliability

Many online systems are expected to “always be available” and one of the differences between an online service that succeeds versus one that fails is the effort the success stories put into “always being available.” Does this mean that their infrastructure never fails? That’s like saying a barking dog never bites. It does and I have teeth marks to prove it.

Amazon, Google, Alibaba, Facebook: they all have outages but the cloak of invisibility they wrap around their infrastructure's failure makes their outages go unnoticed, except unless they are prolonged and herein lies their promise of providing “appropriate levels of reliability.”

To provide reliability at appropriate levels, an online service must consider the nature of their business, the users they target, and the cost involved with keeping the lights on. To achieve an optimal answer for this triumvirate, site reliability engineering tracks an outage “budget.”

Outage “budget” is primarily determined by service level indicator (SLI) and service level objective (SLO). SLI is a question of what is measured and where, while SLO is the acceptable values for the SLI within a given time period.

*Note: for a detailed note on SLI and SLO, read the note ‘SLO in Site Reliability Engineering.’

Site reliability engineering is just a longer word for DevOps?

In the opinion of this analyst, they have the same fundamental principle supporting them: the setting of proper expectations between all stakeholders to avoid surprises and gotchas. While DevOps focuses on continuous delivery all the way to deployment, SRE focuses on continuous operations at the point of customer value creation.

Our Take

Irrespective of site reliability engineering being a fancy wrapper on DevOps or being its own standalone concept, it is an integral part of an online service’s success. It's no longer just “call help desk to solve your problem” but rather “yes, we know there is a chance of a problem occurring, we know what it is, we know why it is, and before it hits our users, we will resolve it.”


Want to know more?

SLO in Site Reliability Engineering

Implement DevOps Practices That Work