What is Site Reliability Engineering

Site reliability engineering (SRE) has already taken traction in the IT industry. This is a cloud-native software delivery model with the speed of modern IT operations. Furthermore, the Site reliability engineering role is not limited to IT but also expanding well across several other industries. Today SRE has turned into a full-fledged IT domain that leverages automated solutions for IT operations like performance planning, on-call monitoring, disaster response, capacity planning, and so on.

SRE is an initiative of tech lexicon by Google’s VP Benjamin Treynor Sloss and his team. Site reliability engineering’s primary goal is to empower software developers to get ownership of production applications. These make things faster as much as possible from the developers’ end, minimizing the operations team’s chance to blow up in production.

To explain more, SRE is a perfect amalgamation of infrastructure automation with continuous delivery. However, SRE is ideal for cloud-native and SaaS companies. Besides, it offloads many responsibilities related to IT Ops to the development team. Though it sounds quite similar to DevOps, SRE is quite different from DevOps, which we will discuss later.

Site reliability engineering Google definition describes it as “automate their way out a job.” It is an application of software engineering methods on system administration related activities. Thus it works as a bridge between the development and IT operations. From Site reliability engineering Google’s point of view, it is a strict process where they split the total time between development activities and operational / on-call activities. Google emphasizes 50% of the time to spending on operational aspects. Beyond this limit, they consider that the system is with an ill-health.

Also, Site reliability engineering is a specific approach mainly for large scale and cloud-native system and their IT operations. Using SLO (Service Level Objectives), a part of SLA, the SRE model initiates productive interactions between the SRE and development teams.

An error budget is another vital concept and factor that works here, which balances the application’s productivity and makes it reliable. As per Google’s statement, the business must establish the availability target for a system as part of the process. Once the team achieves that, one minus the availability target is considered as the error budget. For example, if the availability target is 99.99%, that means 0.01% unavailability. This .01% unavailability is a budget that the development team can spend on anything they want. However, it should not be an overspend.

SRE is also a collaborative approach where product developers can ensure that the designed solution is responsive to non-functional requirements. This may include –

· Availability

· Performance

· Security

· Maintainability

· Release management like cross-checking the efficiency of software delivery pipeline.

Typical responsibilities that come under site reliability engineering are:

Proactively monitoring application performance
Reviewing application performance
Handling on-call and emergency support
Checking logging and diagnostics
Creating and maintaining operational runbooks
Helping on escalated support tickets
Working on support tickets to resolve defects and other development tasks

How does SRE work?

Site reliability engineering Google team describes it in the following way –

“SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to and have the ability to substitute automation for human labor. We have a bunch of rules of engagement and principles for how SRE teams interact with their environment — not only the production environment, but also the development teams, the testing teams, the users, and so on. Those rules and work practices help us to keep doing primarily engineering work and not operations work.”

In other words, the activities of the SRE team are very much similar to the operation team, which involves –

· Verifying system availability

· Checking performance

· Latency measurement

· System monitoring

· Measuring the efficiency

· Emergency response

· Change management

· Capacity planning.

Site reliability engineering automates the above IT operations using programming languages, algorithms, and data structures. Not to mention, these are the expertise areas of software engineering.

Moreover, the SRE model works in a balanced mode where it maintains several metrics and team dynamics. In a typical SRE model –

Product development teams start their own services, which include on-call for incidents.
When the service reaches an optimum level or high-traffic state, the development team asks for support from the SRE. The SRE team takes on the running service in production.
The product owner defines a service-level objective (SLO) based on the acceptable downtime.
The acceptable and available downtime is the error budget for the service. The development team utilizes and spend this error budget for various purposes. For example, to try new features, improving operational ability, etc. However, if the service downtime goes down more than budget time, the development team can’t perform any further changes.

However, SRE creates a compelling dynamic in the overall process. It not only addresses operational problems rapidly but also keeps product owners honest about the required SLO.

SRE vs. DevOps

In the context of Site reliability engineering, no doubt, SRE vs. DevOps is an unavoidable discussion point as there are many operational similarities between the two, many demands that SRE is a replacement of DevOps. However, there is a sheer difference between the two approaches.

· Traditionally, DevOps is more about a collaboration between developer and operations. Besides, it has also focused more on deployments. On the contrary, Site reliability engineering focuses more on operations and monitoring.

· Where DevOps and SRE go hand in hand, DevOps helps in configuration, deployment, racking, etc., of servers and applications. The site reliability engineers can handle the daily operation after the setup is done.

It is not unusual for a company to use both DevOps and SRE, specifically if it is not using the cloud.

What are the tools used for SRE?

There’s no specific single SRE toolset. But any organization that looks to build out an SRE function should define tools themselves. Both processes and tools are vital along with standardization and automation for scalability, repeatability, and other reasons.

Skills required for SRE

SRE is not a line item, and getting the right skill set for it is also a challenge in the market. It is undoubtedly a high-skill activity, and there is a short supply of SRE experts in the market. It is an unusual mix of talent where you need in-depth technical knowledge along with customer-focused attention to SLO and error budget. Here one important point to remember is – we must consider IT operations as a value center rather than a subject to cost reduction. So, from that context, IT operations can value a company by maximizing revenue and avoiding downtime.

Thus, SRE demands in a professional –

· Overall professional high-skill

· Experience

· Commitment

· Automation skill

· Network engineering and Unix system administration

· Good software skills

· It is a 50% mix of people with more of a software background and people. They have more of a systems engineering background, which seems to be a perfect mix with a customer-centric approach.

SRE in the cloud startup IT Ops

Moving to cloud-based platforms and delivery models bring an array of tasks. These tasks also come with automation options and the range of DevOps ways, which can be confusing. As the SRE model is a clear, specific set of practices with team dynamics, it works well for large organizations. Enterprises with traditional set up want to move to cloud-native IT operations can easily streamline the processing with the help of SRE. Besides, SRE helps to by-pass organizational delivery models.

SRE as a service

SRE comes with lots of benefits, but it demands a specialized skill set, which is expensive. Though Google is using SRE by its in-house team, however, SRE-as-a-service is an emerging area for many large organizations. Some capable outsourced managed service providers are performing this. Not to mention, the SRE-as-a-service model is a little unusual with in-house DevOps approaches.

However, considering many SRE operating procedures, for example, the SLO, it is a cost-effective solution. The SLO and other standard operating procedures that work at the heart of the SRE approach perform well to a commercial contract. However, these contracts are quite different from typical outsourced IT operations contracts. Furthermore, a managed SRE contract includes clear terms.

How managed SRE help in the process? An SRE provider helps the development team improve the operations before production release through a time-and-materials arrangement. Also, sometimes managed SRE uses the tools to automate the standard IT Ops needed to run the production software.

Final thought

SRE is a great way to avoid spending a lot of time chasing bugs. Today the role of a software developer is not limited to coding. Instead, they actively take part in software deployments, application monitoring, and production operations. One of the primary reasons behind it is the availability of tools, making it extremely easy to deploy applications and monitor them.

Today, IT operations exist in most medium to large enterprises, but their types of works are in continuous change. They are moving with the latest technologies like cloud, containers, PaaS, and other technologies. Thus, including SREs in a new release can help understand the project’s latest release changes. In addition to that, it can expedite troubleshooting for the problems associated with a release.