SRE Practices for Kubernetes Platforms — Part 3

Adrian Hynes
6 min readJan 20, 2021

Introduction

In the previous section we looked at a real world example of SRE, best practices for reliability and troubleshooting and mitigation. In this section we’ll look at what an SRE might monitor in a kubernetes platform, tools an SRE may use and how SRE’s can track their own reliability performance over time with certain metrics.

What should you measure and monitor as a kubernetes Platform SRE?

SLI’s
Remember your SLI’s should relate to your customers experience, so consider the following metrics in you cluster that could be potential candidates for SLI’s:

  • API Server — Response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope and component — apiserver_request_duration_seconds
  • Node — Excessive packet drops
  • Node — Excessive network errors
  • Addon — Nginx Ingress Controller/App Gateway/ALB etc — Throughput, Latency, Error Rate

Also external facing applications deployed onto the cluster will need their own SLI’s, which may be overseen by a different SRE. Applications written in Spring Boot, can potentially use the Spring Boot Actuator and expose metrics like latency, error rate etc which can be ideal candidates for SLI’s which can then be monitored on.

Other Metrics

At a high level, let’s look at some of the other metrics we might want to ingest into our monitoring solution. These can be used for monitors or may just help with debugging

  • API Server CPU, Memory, Request Count
  • ETCD object count
  • Host CPU, Memory, Network Errors, Packets Dropped etc
  • Kube State Metrics — K8s Objects statuses
  • Ingress Metrics — Azure Standard Load Balancer, Nginx Ingress Controller, App Gateway etc
  • IP Addresses available in cluster
  • Tracing between Applications deployed on the cluster

Try using kubectl get — raw /metrics for a list of raw cluster metrics from the metrics server addon

Monitors

Based on the above metrics, here are some sample monitors we can put in place to monitor our cluster and applications running on our cluster

  • Missing logs/data over a time frame. No logs or metrics are coming across to our monitoring solution for a set amount of time. This could indicate a networking issue, host issue, and issue with the addon that collects the logs/metrics etc
  • Various Management pods (nginx ingress controller, kube state metrics, kube proxy, metrics server etc) availability status other than Ready (e.g. Pending, CrashLoopBackoff etc)
  • Various Management Jobs failed greater than a threshold
  • Various Management pod’s containers are undergoing frequent restarts. This could indicate a memory leak, an application issue that can not deal with the failure of dependent component (i.e. datastore)
  • Various Management pods displaying errors (non fatal) in their logs. Let’s take EnternalDNS as an example. If ExterndalDNS cannot perform CRUD operations on it’s DNS, then errors will just be displayed in the logs, so in order to pick up this issue we may need to create a monitor on some logging regex
  • Load Balancer Health Probe to backend nodes over a certain time greater than a certain threshold
  • Nodes in a NotReady condition for more than a certain threshold.
  • Node Memory is over threshold for certain threshold
  • API Server memory exceeding threshold over a certain time frame
  • Not enough IP Addresses available to scale to another node or upgrade the cluster (via either node by node approach or auotscaling group/nodepool approach)
  • Autoscaling Group/Nodepool has a max nodes equal to current, so no more room for node scaling

From an application point of view, you may want to monitor

  • Latency and Error Rate of your Service (if serving content)
  • Deployment Replicas Available
  • State of Job, Deployment, Pod etc

You may also want to monitor cluster operations such as

  • Failed Cluster Upgrades
  • Failed Node Hydration
  • Failed Addon Upgrades
  • Failed Node Scaling

This is not a full list of everything you need or should monitor in your kubernetes cluster, but it’s just to give you some ideas.

Tools an SRE may use

Depending on your role and company, there’s a wide range of tools an SRE may use. He’s some examples.

Monitoring

Your monitoring solution should be able to collect logs and metrics from both the Kubernetes cluster and Cloud Provider and aggregate them. Should be at least near real time. Should provide alerting and email digests. It should have the ability to integrate with Incident Management Systems. Some common examples can include: Prometheus, Grafana, ELK, Datadog and New Relic.

Incident management and Problem Management

There are many incident management tools out there, but they should be able to integrate with monitoring and perform Service Desk activities, On Call rotations and paging, Team communications, Customer Communications, Incident command center activities, postmortem activities and analysis and issue tracking. Most tools you may use for Incident Management can also be used for problem management. Your problem management tool will allow you to track related incidents, status of known problems, RCA of problems, if resolutions have been performed and problem closed. Can also be used when debugging issues, where the problem is still open or has not yet been fixed. Some common examples can include: Service Now, JIRA Service Desk, Pager Duty and Victorops.

Troubleshooting -

  • Kubernetes tools like kubectl, helm, velero, docker
  • DNS tools like dig and nslookup
  • Networking tools like tcptraceroute, psping and others from the cloud provider (Network Watcher etc)
  • Cloud Provider CLI
  • Cluster and Management Addons Metrics and Logs
  • Cloud Provider Security Groups and Firewall Logs
  • Application Frameworks like Spring Boot and their metrics outputs etc

Automation

  • Scripting and programming: Python, Shell Scripting, Go
  • CI/CD tools: Jenkins, Concourse, CircleCI, Github Actions, Azure DevOps
  • Infra As Code: Terraform/Cloud Formation, ARM
  • Configuration management: Ansible, Chef, Puppet
  • Testing: JMetre, Datadog Synthetics etc

SRE Metrics

An SRE, Team of SRE’s and indeed an SRE’s management should be interested in the following metrics to track the performance of SRE in an Organisation. These can be for a Sprint, a Month, a Quarter or a year. This will allow the Organisation to track how their SRE’s are performing, are they just doing fire fighting toil, does the Org need more SRE’s, how is the reliability of the Org as a whole improving etc

  • How much time an SRE has spent on Toil/Automation?
  • Mean Time To Failure — The mean of the time between platform or application failures
  • Mean Time To Repair — The mean of the time between when repair/mitigation begins and the platform or application if back up and running
  • Mean Time To Recovery — The mean of the time between when an incident occurs and the platform or application if back up and running
  • Mean Time To Resolve — The mean of the time between the incident occurring until a permanent fix is implemented
  • Mean Time to Respond — The mean of time of how long it takes for an SRE to engage an incident
  • Mean Time to Failure — The mean of time between initial incidents
  • Mean Time between failures — The mean of time between a recovery from an incident to a new incident

Conclusion

I hope you found this high level condensed set of articles useful on your SRE journey with Kubernetes. I would highly recommend reading Google’s SRE handbook as well as some of the reference materials below for deeper discussions on the topics I’ve presented above.

References

Reliability Engineering Concepts — https://linuxacademy.com/cp/modules/view/id/731

--

--

Adrian Hynes

Cloud Platform Architect. Opinions and articles on medium are my own.