SRE Practices for Kubernetes Platforms — Part 3

Introduction

In the previous section we looked at a real world example of SRE, best practices for reliability and troubleshooting and mitigation. In this section we’ll look at what an SRE might monitor in a kubernetes platform, tools an SRE may use and how SRE’s can track their own reliability performance over time with certain metrics.

What should you measure and monitor as a kubernetes Platform SRE?

SLI’s
Remember your SLI’s should relate to your customers experience, so consider the following metrics in you cluster that could be potential candidates for SLI’s:

  • API Server — Response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope and component — apiserver_request_duration_seconds

Also external facing applications deployed onto the cluster will need their own SLI’s, which may be overseen by a different SRE. Applications written in Spring Boot, can potentially use the Spring Boot Actuator and expose metrics like latency, error rate etc which can be ideal candidates for SLI’s which can then be monitored on.

Other Metrics

At a high level, let’s look at some of the other metrics we might want to ingest into our monitoring solution. These can be used for monitors or may just help with debugging

  • API Server CPU, Memory, Request Count

Try using kubectl get — raw /metrics for a list of raw cluster metrics from the metrics server addon

Monitors

Based on the above metrics, here are some sample monitors we can put in place to monitor our cluster and applications running on our cluster

  • Missing logs/data over a time frame. No logs or metrics are coming across to our monitoring solution for a set amount of time. This could indicate a networking issue, host issue, and issue with the addon that collects the logs/metrics etc

From an application point of view, you may want to monitor

  • Latency and Error Rate of your Service (if serving content)

You may also want to monitor cluster operations such as

  • Failed Cluster Upgrades

This is not a full list of everything you need or should monitor in your kubernetes cluster, but it’s just to give you some ideas.

Tools an SRE may use

Depending on your role and company, there’s a wide range of tools an SRE may use. He’s some examples.

Monitoring

Your monitoring solution should be able to collect logs and metrics from both the Kubernetes cluster and Cloud Provider and aggregate them. Should be at least near real time. Should provide alerting and email digests. It should have the ability to integrate with Incident Management Systems. Some common examples can include: Prometheus, Grafana, ELK, Datadog and New Relic.

Incident management and Problem Management

There are many incident management tools out there, but they should be able to integrate with monitoring and perform Service Desk activities, On Call rotations and paging, Team communications, Customer Communications, Incident command center activities, postmortem activities and analysis and issue tracking. Most tools you may use for Incident Management can also be used for problem management. Your problem management tool will allow you to track related incidents, status of known problems, RCA of problems, if resolutions have been performed and problem closed. Can also be used when debugging issues, where the problem is still open or has not yet been fixed. Some common examples can include: Service Now, JIRA Service Desk, Pager Duty and Victorops.

Troubleshooting -

  • Kubernetes tools like kubectl, helm, velero, docker

Automation

  • Scripting and programming: Python, Shell Scripting, Go

SRE Metrics

An SRE, Team of SRE’s and indeed an SRE’s management should be interested in the following metrics to track the performance of SRE in an Organisation. These can be for a Sprint, a Month, a Quarter or a year. This will allow the Organisation to track how their SRE’s are performing, are they just doing fire fighting toil, does the Org need more SRE’s, how is the reliability of the Org as a whole improving etc

  • How much time an SRE has spent on Toil/Automation?

Conclusion

I hope you found this high level condensed set of articles useful on your SRE journey with Kubernetes. I would highly recommend reading Google’s SRE handbook as well as some of the reference materials below for deeper discussions on the topics I’ve presented above.

References

Reliability Engineering Concepts — https://linuxacademy.com/cp/modules/view/id/731

Cloud Platform Architect. Opinions and articles on medium are my own.