SRE Practices for Kubernetes Platforms — Part 3

6 min readJan 20, 2021

Introduction

In the previous section we looked at a real world example of SRE, best practices for reliability and troubleshooting and mitigation. In this section we’ll look at what an SRE might monitor in a kubernetes platform, tools an SRE may use and how SRE’s can track their own reliability performance over time with certain metrics.

What should you measure and monitor as a kubernetes Platform SRE?

SLI’s
Remember your SLI’s should relate to your customers experience, so consider the following metrics in you cluster that could be potential candidates for SLI’s:

API Server — Response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope and component — apiserver_request_duration_seconds
Node — Excessive packet drops
Node — Excessive network errors
Addon — Nginx Ingress Controller/App Gateway/ALB etc — Throughput, Latency, Error Rate

Also external facing applications deployed onto the cluster will need their own SLI’s, which may be overseen by a different SRE. Applications written in Spring Boot, can potentially use the Spring Boot Actuator and expose metrics like latency, error rate etc which can be ideal candidates for SLI’s which can then be monitored on.

Other Metrics

At a high level, let’s look at some of the other metrics we might want to ingest into our monitoring solution. These can be used for monitors or may just help with debugging

API Server CPU, Memory, Request Count
ETCD object count
Host CPU, Memory, Network Errors, Packets Dropped etc
Kube State Metrics — K8s Objects statuses
Ingress Metrics — Azure Standard Load Balancer, Nginx Ingress Controller, App Gateway etc
IP Addresses available in cluster
Tracing between Applications deployed on the cluster

Try using kubectl get — raw /metrics for a list of raw cluster metrics from the metrics server addon

Monitors

Based on the above metrics, here are some sample monitors we can put in place to monitor our cluster and applications running on our cluster

Missing logs/data over a time frame. No logs or metrics are coming across to our monitoring solution for a set amount of time. This could indicate a networking issue, host issue, and issue with the addon that collects the logs/metrics etc
Various Management pods (nginx ingress controller, kube state metrics, kube proxy, metrics server etc) availability status other than Ready (e.g. Pending, CrashLoopBackoff etc)
Various Management Jobs failed greater than a threshold
Various Management pod’s containers are undergoing frequent restarts. This could indicate a memory leak, an application issue that can not deal with the failure of dependent component (i.e. datastore)
Various Management pods displaying errors (non fatal) in their logs. Let’s take EnternalDNS as an example. If ExterndalDNS cannot perform CRUD operations on it’s DNS, then errors will just be displayed in the logs, so in order to pick up this issue we may need to create a monitor on some logging regex
Load Balancer Health Probe to backend nodes over a certain time greater than a certain threshold
Nodes in a NotReady condition for more than a certain threshold.
Node Memory is over threshold for certain threshold
API Server memory exceeding threshold over a certain time frame
Not enough IP Addresses available to scale to another node or upgrade the cluster (via either node by node approach or auotscaling group/nodepool approach)
Autoscaling Group/Nodepool has a max nodes equal to current, so no more room for node scaling

From an application point of view, you may want to monitor

Latency and Error Rate of your Service (if serving content)
Deployment Replicas Available
State of Job, Deployment, Pod etc

You may also want to monitor cluster operations such as

Failed Cluster Upgrades
Failed Node Hydration
Failed Addon Upgrades
Failed Node Scaling

This is not a full list of everything you need or should monitor in your kubernetes cluster, but it’s just to give you some ideas.

Tools an SRE may use

Depending on your role and company, there’s a wide range of tools an SRE may use. He’s some examples.

Monitoring

Your monitoring solution should be able to collect logs and metrics from both the Kubernetes cluster and Cloud Provider and aggregate them. Should be at least near real time. Should provide alerting and email digests. It should have the ability to integrate with Incident Management Systems. Some common examples can include: Prometheus, Grafana, ELK, Datadog and New Relic.

Incident management and Problem Management

There are many incident management tools out there, but they should be able to integrate with monitoring and perform Service Desk activities, On Call rotations and paging, Team communications, Customer Communications, Incident command center activities, postmortem activities and analysis and issue tracking. Most tools you may use for Incident Management can also be used for problem management. Your problem management tool will allow you to track related incidents, status of known problems, RCA of problems, if resolutions have been performed and problem closed. Can also be used when debugging issues, where the problem is still open or has not yet been fixed. Some common examples can include: Service Now, JIRA Service Desk, Pager Duty and Victorops.

Troubleshooting -

Kubernetes tools like kubectl, helm, velero, docker
DNS tools like dig and nslookup
Networking tools like tcptraceroute, psping and others from the cloud provider (Network Watcher etc)
Cloud Provider CLI
Cluster and Management Addons Metrics and Logs
Cloud Provider Security Groups and Firewall Logs
Application Frameworks like Spring Boot and their metrics outputs etc

Automation

Scripting and programming: Python, Shell Scripting, Go
CI/CD tools: Jenkins, Concourse, CircleCI, Github Actions, Azure DevOps
Infra As Code: Terraform/Cloud Formation, ARM
Configuration management: Ansible, Chef, Puppet
Testing: JMetre, Datadog Synthetics etc

SRE Metrics

An SRE, Team of SRE’s and indeed an SRE’s management should be interested in the following metrics to track the performance of SRE in an Organisation. These can be for a Sprint, a Month, a Quarter or a year. This will allow the Organisation to track how their SRE’s are performing, are they just doing fire fighting toil, does the Org need more SRE’s, how is the reliability of the Org as a whole improving etc

How much time an SRE has spent on Toil/Automation?
Mean Time To Failure — The mean of the time between platform or application failures
Mean Time To Repair — The mean of the time between when repair/mitigation begins and the platform or application if back up and running
Mean Time To Recovery — The mean of the time between when an incident occurs and the platform or application if back up and running
Mean Time To Resolve — The mean of the time between the incident occurring until a permanent fix is implemented
Mean Time to Respond — The mean of time of how long it takes for an SRE to engage an incident
Mean Time to Failure — The mean of time between initial incidents
Mean Time between failures — The mean of time between a recovery from an incident to a new incident

Source: https://www.atlassian.com/incident-management/kpis/common-metrics

Conclusion

I hope you found this high level condensed set of articles useful on your SRE journey with Kubernetes. I would highly recommend reading Google’s SRE handbook as well as some of the reference materials below for deeper discussions on the topics I’ve presented above.

References

Reliability Engineering Concepts — https://linuxacademy.com/cp/modules/view/id/731

Monitoring services and setting SLAs with Datadog

SLAs give concrete form to a worthy but amorphous goal: you should always be trying to improve the performance and…

www.datadoghq.com

Site Reliability Engineering: Measuring and Managing Reliability

This course teaches the theory of Service Level Objectives (SLOs), a principled way of describing and measuring the…

www.pluralsight.com

Google — Site Reliability Engineering

2. The Production Environment at Google, from the Viewpoint of an SRE

sre.google

Site Reliability Engineering for Kubernetes

Over the last 4.5 years, Kubernetes has dramatically improved in terms of usability and it’s now easier than ever to…

tammybutow.medium.com

Kubernetes Data Collected

Metrics collected by the Agent when deployed on your Kubernetes cluster: Note: The set of metrics collected by the…

docs.datadoghq.com

Kubernetes in Production: The Ultimate Guide to Monitoring Resource Metrics with Prometheus

In this instalment of the Kubernetes in Production blog series, we take a deep dive into monitoring Kubernetes resource…

www.replex.io

kubernetes-apiserver — SignalFx documentation

Monitor Type: kubernetes-apiserver ( Source) Multiple Instances Allowed: Yes This monitor queries the Kubernetes API…

docs.signalfx.com

Monitoring NGINX Ingress Controller

GitLab has support for automatically detecting and monitoring the Kubernetes NGINX Ingress controller. This is provided…

docs.gitlab.com

SRE Tools & Automation Course | Cloud Academy

Welcome back. In this course, I’m going to review the subject of SRE Tools and Automation. Automation is done in SRE to…

cloudacademy.com