In the previous section, we set the scene of what SRE is, what an SRE does and the principles of SRE. In this section, we’ll look at a real world example of SRE, best practices for reliability and troubleshooting and mitigation.
SRE Practices Example in the Real World
Ok enough with the Theory, show me an example of what and SRE does in the real world!
Let’s look at this real world incident from a Postmortem point of view.
An relatively large managed Kubernetes cluster running in a DEV environment with several managed community and custom operators/controllers running.
- Customer’s application’s logging and telemetry delayed reaching our logging & telemetry system
- Customer’s having intermittent timeout issues when performing management tasks on their cluster
- Monitoring system firing several alerts such as
a. delayed logs and metrics coming across
b. latency and errors when talking to api server
c. timeout errors with some of our cluster operators that talk to the api server
- Several Namespaces with customer applications were deleted
- Control plane components memory trend going up — visible via Managed Platform UI only
- Control plane component restarts frequently — visible via Managed Platform UI only
- Non Production Environment
- Internal Customers could not deploy their applications reliably and promote to upper environments (small releases often, so slowing production releases)
- Customers applications were up and running ok but monitoring on their apps were “blind” as logs & metrics were delayed
- While custom Operator recreated deleted namespaces, deployments inside these namespaces were deleted
Actions to mitigate
- SRE worked with Cloud Provider of Managed Cluster to increase Memory & CPU on their managed control plane components
- SRE team scripted cronjob to remove orphaned limitranges and resource quotas*
- SRE restored namespaces via latest velero backup
The root cause of this incident coincided with a recent operator upgrade. The operator in question was responsible for creating namespaces, setting resource limits, quotas and defaults and RBAC. Each time the Operator reconciled a namespace, it created a new limit range and resource quota. The old limit range and resource quota were not deleted, instead it got orphaned from the custom namespace CRD BUT were still associated with the Kubernetes namespace CRD.
So each of these Limit Range and Resource Quota objects for each namespace were been loaded into memory and were applied to any new upgrades or deployments that were applied to a given namespaces. Over time, the admission controller control plane component was reaching it’s own memory and cpu limits and was causing a restart of all the control plane components, i.e. api controller.
Customers and applications contacting the api server were now experiencing high latency issues or timeouts (a latency of 60s before returning an error).
Follow up actions to prevent happening again
- Since the deployment upgrade in question had been several weeks old, and was the result of a deterioration over time, a roll forward to a fixed version was preferred. The SRE worked with the dev team to implement the fix.
- The SRE and development team worked on simulating the error on a platform engineering environment, by deploying similar namespaces and workloads, decreasing the reconciliation time of the custom namespace CRD and automating deployment and deletion of apps until the issue could be recreated.
- The fix version was next deployed to the platform engineering environment with similar automations in place and monitoring over a longer period.
- The namespace deletion was a bit of a mystery, as nothing in the custom namespace controller logs or api server controller logs could determine what deleted it. A lock was put in place on the namespace via annotations and a admission controller to prevent any more deletions until a root cause could be identified.
- Moving all non prod clusters to a paid SLA with cloud provider, which included flexible and higher memory, cpu and monitoring.
- Work with Cloud Provider to API some of the diagnostics that could only be sen in the UI
As you can see, the SRE was involved from the start via first internal monitoring and then angry customers. The problem could not be attributed directly to a recent deployment, so mitigation’s to prevent any further damage was to work with the managed cluster provider to increase memory and cpu of the control plane components since these were restarting. The next mitigation was to restore the deployments in the namespaces that were deleted. After stabilizing the platform and restoring the customer deployments, investigations lead to further mitigations to remove orphaned Limit Ranges and Resource Quotas, which were building up over time, this was a scripted cron job.
Best Practices for Reliability
Testing for Reliability — Gamedays
How can you know if your applications and platforms are reliable if you haven’t tested the reliability automations you have put in place? Gamedays not only allow you to test the reliability of your systems and services but the processes and procedures you have in place to manage incidents such as the on call rotation, firecall access to systems to diagnose, troubleshooting techniques, remediation options, incident management, problem management, change management, root cause analysis, blameless postmortems etc
We can use Chaos Engineering in our Gamedays as it gives us the ability to inject failures into our clusters and infrastructure. Using tools such as Chaos Monkey or Gremlin, you can artificially increase CPU and memory, take down nodes, kill containers, blackhole a region, cause a DNS outage, inject latency into a node or service and much more. Using these Chaos Engineering techniques, you can test the reliability of different parts of your platform and how they react to different failures.
You may also simulate failures of other dependencies in your platform e.g. git if you are using a gitops, artifact repository for your artifacts, datastores, secret management systems, on prem connectivity etc
Troubleshooting and Mitigation
When troubleshooting issues, an SRE will need to have a familiarity of infrastructure of the platform, in the case of Kubernetes the addons that are installed, cluster components dependencies within and outside the cluster.
The SRE will have to ask themselves, what is the System/Service doing, what should it be doing and what is the severity of the issue. Do we need to stop the bleeding immediately and make the system/service more reliable than it current is and/or stop any potential cascading failures or do we have the opportunity to look for a root cause.
Metrics and the ability to toggle the verbose-ness of logs in production will also aid an SRE’s troubleshooting. If you are using something like Spring Boot’s Micrometer, it will allow you to expose recent latency of error rates of recent requests as metrics also, allowing you to compare to historical data. Also monitoring tools will allow one to instrument their services so tracing can be performed and viewed across services.
It’s recommended that the SRE should go in with a clean slate and don’t use past problems which can potentially lead you down a rabbit hole. In saying that, if you haven’t automated the toil away of past problems, then past problems and an available list of these past problems can be your friend.
As with the Gameday practices, SRE’s should practice troubleshooting to get more familiar with different techniques.
As mentioned above, if the severity is high, we first want to stop the bleeding in order to maintain our SLA or just bring the system back up. A rollback can be the obvious choice if the incident coincides with a new feature, but other techniques can include manually deleting a pod, scaling a deployment, increasing compute resources etc
In the next part we’ll look at what an SRE might monitor in a kubernetes platform, tools an SRE may use and how SRE’s can track their own reliability performance over time with certain metrics