In this 3 part series of articles we’re going to have a look at what SRE (Site Reliability Engineering) or Production Engineering is, what an SRE (Site Reliability Engineer) does, principles of SRE, a real world example of SRE practices in practice and then we’ll look at some other best practices for reliability, kubernetes SLI’s, metrics and what an SRE could monitor and finally we’ll look at some common tools an SRE can have in their toolbox.
We’ll focus on SRE from a Kubernetes platform point of view but obviously these principles and practices can be applied to any systems and services an SRE is responsible for.
What is an SRE?
“SREs work to create procedures, practices, and tools that render software more reliable” — The Google SRE Handbook — https://sre.google/sre-book/table-of-contents/
An SRE (Site Reliability Engineer) is an engineer who’s focus is on increasing or maintaining the reliability of Systems. From a Kubernetes point of view, this concerns clusters and it’s associated infrastructure as well as systems and Infra that Kubernetes Clusters will interact with (i.e Git Repos, Container Registries, Secrets Mgmt etc).
Your SRE is responsible for the full stack of reliability of systems including — Uptime, Performance, Latency, Incident Management, Outages, Monitoring, Change Management, Capacity Management etc
What does an SRE do?
“They automate their job away”
An SRE has some similarities to an Ops or System Admin. They are available (during the day or on call) to mitigate problems that occur in Production Systems to keep them highly available. This can include things like Infrastructure changes, rollbacks, timed restarts etc. While not fire fighting, they are writing code (software engineering) to automate routine tasks and human labor (i.e. hydrate nodes every X days, provisioning new infrastructure, scaling nodes, node upgrades, control plane upgrades, addon upgrades etc) and respond to problems automatically without human intervention.
Monitoring is a crucial piece of an SRE’s job and they will choose specific metrics to monitor and alert on, to inform them (or inform automation) that something is going or is about to go wrong so they can respond in a timely manner.
The usual metric you will see thrown around is that an SRE will usually aim to focus 50% of their time on Ops tasks like we mentioned above and then 50% of their time on automation to reduce Toil. We’ll have a look at Toil in the next section.
Principles of SRE
Google in their SRE handbook have laid out a number of principles of an SRE. Let’s look at these from a relatively high level.
Identify and Measure Key Metrics
Let’s look what SLI, SLO and SLA’s are.
An SLI (Service Level Indicator) is numerical metric that is measured over a time period that indicates the service level of a specific part of a platform or a service.
The Four Golden Signals you will SLI’s been created on commonly are:
- Latency (time taken by service to respond to request)
- Traffic (demand of service e.g. requests per second)
- Errors (percentage of failed requests e.g. 5XX) and
- Saturation (infra components utilized i.e. memory of a node).
A common metric we can use for our SLI could be the latency of a service.
So let’s say we expect the latency of calling a particular service to be under a threshold of 200ms. Now let’s say we want to measure the latency over a 5 minute period and we get the following:
requests equal or under 200ms: 299980
requests over 200ms: 20
total requests: 300000
So using the counting bad requests method, we can see our Latency SLI for our service for that particular 5 minute period is 100 -(20/300000)*100 = 99.993%. By this, we mean that 99.993% of requests were under our 200ms threshold. The problem is using this method, is that we have to have the threshold (200ms) defined up front for our indicator.
An alternative is to use a latency distribution histogram and look at the percentiles. Let’s consider a latency distribution histogram representing our service requests latencies over 5 minutes. So our histogram is sorted showing the number of requests that took x milliseconds from lower to higher. From this we can deduce that the 50th percentile of requests were served in at least 90 ms. The 90th percentile of requests were served in 140ms or less. The 95th percentile were served in 190ms or less and the 99th percentile were served in 195ms or less.
This gives us a lot more information about the performance of our service. For half our users, they are getting served in 90ms or less, so we can focus on why the other half are getting latencies longer than this. Using our histogram we don’t need to make an upfront decision on a threshold for our SLI, but we can look at our histogram and say that the “latency of service X valid requests over the past 5 minutes should be ≤ 200ms”.
SLI’s have to be chosen carefully and ideally will map to a users experience when using our service. For example, a metric counting increasing number of Kubernetes objects in a cluster might not be a direct influence on a users experience so might not be a good candidate for an SLI BUT it it might be a good metric to alert on to indicate that something in the cluster is spiraling out of control which may soon cause issues to our cluster. Whereas the latency of a service can have a direct affect on the users experience when accessing our services.
Based on our SLI, we can now come up with our SLO (Service Level Objective). This is our internal reliability goal for this particular SLI, agreed with our Product Owner for this particular service, that are achievable and that we want to meet over a period of time (week, month, quarter, year). For our service we can say the “99.9th-percentile latency for valid user requests, averaged over a trailing five-minute time window, will be less than 200 milliseconds”.
Lastly our SLA is our agreement with our customers on the availability of our service. Here we can say “99th-percentile latency for good user requests, averaged over a trailing five-minute time window, will be less than 200 milliseconds”. Here we are saying, 99% of requests over the whole year will be under 200ms. As you can see we are giving ourselves some buffer between our internal SLO’s 99.9% and our customer agreement of 99%. Not meeting our SLA for our customers can result in penalties such as money back or service credits and even worse, loss of customers.
Metrics that we usually use as good SLI’s in our kubernetes clusters include latency, availability, throughput and error rates.
Before deciding on the thresholds of your SLI’s, it’s important to perform bench marking at different times (day, year i.e. Black Friday etc) to ensure SLI’s standup to many likely conditions as part of their normal routine.
While we mentioned in the previous section on SLI’s that we want to measure key indicators which map to our customers satisfaction, we also want to monitor anything else that can give us early warning signs that things are about to go bad (memory, cpu, requests per second, max nodes in nodepool/auto-scaling group etc) or we need to provision extra capacity (available ip addresses, reserved instance types). While other metrics may not inform us directly if things are about to go bad, they can help with automation and debugging. Your logs will also be an important part of your debugging technique and some monitoring systems will allow you to drill down from your dashboards into the related logs. Some monitoring systems also have the capabilities to create custom metrics (i.e. status codes from logs) and monitors (i.e. errors in logs) directly from your logs.
An SRE has to be careful when creating monitors and alerting on these monitors, as noise can be a big problem. You don’t want to get bombarded throughout the day with emails from Non Production systems and you do not want to get incidents if say logs and metrics are delayed as a once off. Consider daily digest emails for such monitors and only Incidents (Email/Call/Text) when really needed.
One of the benefits of using an orchestration system like Kubernetes and a managed Kubernetes offering like AKS, EKS, GKE etc is that much of the tedious tasks that Operations traditionally had to perform have been automated away e.g. Application Auto-scaling (HPA and VPA), Node Scaling (Cluster Auto-scaler), Control Plane and Node Upgrades, Pod Disruption Budgets. Add in some community addons such as Velero, and you get other operations tasks like backups already automated. Others like Prometheus give you the ability to monitor your clusters.
So what else does the SRE have to automate? Well they automate for reliability. Let’s look at an example.
Let’s say we’re using a managed cluster and their cluster upgrade process. We’re sometimes seeing cluster upgrade failures, which means the SRE gets pinged to investigate, taking them away from their automation work. The upgrade failures have been attributed to a number of reasons
- Not enough IP’s left in the clusters CIDR block to perform an autoscaling group/nodepool upgrade
- Unrealistic PDB’s set which can never be achieved i.e. maxUnavailable set to 0
- K8s API’s removed
- Pods with no Replica Set
Once the SRE has identified the toil (the work that takes them away from their automation), they can get to work on automation to carry out a number of pre flight checks, before an upgrade even starts. Only if the pre flight checks complete successfully, then can an upgrade start.
The SRE should aim to spend 50% of their time on regular Ops work and 50% automating away toil. If the regular Ops work is taking up more than 50% then more resources should be scheduled to help, until some of the toil has been automated away.
Embracing Risk/Accept Failure
So if we remember how things “used to be”, developers wanted to release new features more frequent, but Ops wanted to slow down releases to increase the reliability of their systems, which in turn caused tensions.
Well now since we’ve come up with a set of reliability indicators and agreements that all the stakeholders have agreed to, we can track the reliability of new features. If new features are less reliable, then we can pause the rollout of more new features until the reliability is been met/exceeded again.
Let’s have a look at this with an example. Let’s say we have a service with an Uptime SLA of 99% (2 9’s) which equates to 3.65 days per year of down time. So our service Error Budget is 3.65 days of downtime a year or 14 minutes of downtime per day. A service is deployed to a Kubernetes cluster and it has resources limits set but only 1 replica. The service has a start up time of 1 minute.
Now let’s say we upgrade this service, which contains a slow memory leak. The slow memory leak takes about 1 hour to hit the memory limits for the pod. Now when the service starts going over this memory limit, k8s sends it a SIGTERM and eventually the container gets terminated (either gracefully or forcefully). Our service is down and kubernetes is going to restart it, and it will take 1 minute to start up and begin accepting traffic again. So our service will be down and not taking traffic for approx 24 minutes per day.
So as long at this service is deployed, we’ll go over our error budget by 10 minutes each day. So we’re now not meeting our SLA.
The SRE will now not agree to anymore features been deployed until the reliability of this service has been brought back into the line.
For arguments sake, let’s say the SRE has a monitor on this service to alert and create an incident whenever there is more than 1 restart every day. So the SRE will now get an incident after the second container restart. The SRE can begin to mitigate and investigate. The SRE may mitigate the downtime by scaling up the replicas of the service in the cluster to 2. This will give the development team the ability to fix the memory leak and still rollout new features BUT this higher availability comes with the extra cost of compute of another pod running.
As you can see from this example, the higher the SLA, the higher the costs (AZ’s, Regional) and also the more risk a new change brings. If we were to increase the SLA to 99.999 (5 9’s), our error budget would only be 8.6 milliseconds per day i.e. Not enough time for an SRE to perform any mitigation's to keep our SLA intact.
Simplicity / Small Releases often
As with most practices in Software Engineering, small changes often are preferable to big drops. In the case of small releases, an SRE may chose to rollback a change as one of their mitigation’s when faced with an issue.
A small change can also be canaried to a small subset of a user base and gradually increasing the user base while building confidence in it’s reliability.
Release Engineering / Engineering Excellence
Release Engineering concerns how developers and SRE’s build and deliver and release software.
In Google’s SRE handbook, they set out a number of principles for Release Engineering which include things like Self Service for development teams to release their own features, high velocity to decide when to roll out new changes, consistent and repeatable builds as well as the policy and procedures for who can release software.
We’ll cover Release Engineering and Engineering Excellence in another article, but for now, let’s just give a high level example of a Release life cycle of a Service Feature.
Let’s say a developer is working on adding a feature to an existing service, which takes the flow of the below. Once all the various tests pass, we can decide if this RC should go to production based on policy. This policy could contain any number of items, such as a Change Management Ticket is currently in place for the release, various quality and security gates have passed, the current SLO for the service is still been met, the various Mean Time To X’s are under various thresholds etc
Once the policies/gates all pass, we can perform a canary deployment to production to a small subset of users and perform some smoke tests. We can then check if our SLO’s are still been met for this particular version of the release and decide if a rollback is required or continue the rest of the canary deployment.
As you can see in the above example, it’s possible a RC can make it all the way to production without human intervention by having the various automation processes and pipelines in place and using policy to decide if this change should go to production.
In the next part, we’ll continue with a real world example of SRE in practice, best practices for reliability and troubleshooting and mitigation.