In this article I wanted to share my experience, filing a new feature request for a CNCF open source project, going ahead and implementing a solution for this feature, the pull request process and then it’s actual release.

The Open Source Project

Chaos Mesh is a CNCF open source sandbox project which allows you to introduce Chaos into your Kubernetes clusters in different ways via a growing set of Chaos experiments. …


In this article, we’re going to take the AKS Diagnose & Solve Problems Feature (which is only currently available to view via the AKS UI) and create a kubernetes addon, which will expose all this good health information from your AKS clusters via logs & metrics. We can then scrape these metrics using something like Prometheus or Datadog & alert when certain thresholds are exceeded.

Diagnose & Solve Problems

Let’s first look at what Diagnose & solve problems is, & how we access it.

Diagnose & Solve problems is a useful feature (in preview as of writing) within the AKS UI. It gives us…


In the previous section we looked at a real world example of SRE, best practices for reliability and troubleshooting and mitigation. In this section we’ll look at what an SRE might monitor in a kubernetes platform, tools an SRE may use and how SRE’s can track their own reliability performance over time with certain metrics.

What should you measure and monitor as a kubernetes Platform SRE?

Remember your SLI’s should relate to your customers experience, so consider the following metrics in you cluster that could be potential candidates for SLI’s:

  • API Server — Response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope and…


In the previous section, we set the scene of what SRE is, what an SRE does and the principles of SRE. In this section, we’ll look at a real world example of SRE, best practices for reliability and troubleshooting and mitigation.

SRE Practices Example in the Real World

Ok enough with the Theory, show me an example of what and SRE does in the real world!

Let’s look at this real world incident from a Postmortem point of view.


An relatively large managed Kubernetes cluster running in a DEV environment with several managed community and custom operators/controllers running.


  1. Customer’s application’s logging and telemetry delayed reaching…


In this 3 part series of articles we’re going to have a look at what SRE (Site Reliability Engineering) or Production Engineering is, what an SRE (Site Reliability Engineer) does, principles of SRE, a real world example of SRE practices in practice and then we’ll look at some other best practices for reliability, kubernetes SLI’s, metrics and what an SRE could monitor and finally we’ll look at some common tools an SRE can have in their toolbox.

We’ll focus on SRE from a Kubernetes platform point of view but obviously these principles and practices can be applied to any systems…


In this article, we’re going to show how we could use a GSLB (Global Server Load Balancer) like Azure Traffic Manager or AVI to provide regional high availability for our AKS workloads using an active active fail-over strategy.

In previous articles we’ve exposed our workloads at a regional level and load balanced traffic over our cluster nodes which were spread over availability zones in a region. So in the event of a data center going down, we usually have at least 2 more availability zones to handle our workloads.

Now if an entire region were to go down, we want…


In a recent azure ansible article I wrote (, I came across a challenge, whereby listing resources in a resource group wasn’t real-time.

In that article I improvised by waiting and checking every minute until the resources eventually appeared. BUT that loop and sleep took ~20 minutes.

It turns out, the “/resources API is served from a regional cache from Azure Resource Manager, which is eventual consistent, and not real time”

This means, in order to get a real time view of the resources in our resource group, we need to hit the regional azure management endpoint i.e. https://<region>


In the last section we’ve made some initial design decisions which will influence our solution.

In this section we’ll pick a database, review and insert some questions, review the apis and a sample sequence flow, review our overall architecture and finally deploy our kubernetes applications.


We are going to simply pick Sqlite for our relational database needs, due it’s small nature and ease of use and it’s file based nature. We’ll mount the Sqlite database onto a PVC (Azure Disk) to maintain state.

Question and Levels

New Level

Inserting a new level is straight forward

replace into level (id, name, description) values(3, 'Level…


In this series of articles, I want to show you how you can create your own custom competitive Kubernetes Training experience for your colleagues, on AKS. You could also use it as part of an interview test for potential employees.


First let’s look at the “finished” (more like a POC) product and then we’ll work back to how we got there.

Demo of finished (POC) Application


Ok let’s look at the initial goals we set for this application.

Self Service and On Demand- The ability to take part should be Self Service. …


In this article, I’ll show you how easy it is to integrate Azure AD as an Authentication mechanism for your React Application.

We’ll use this authentication mechanism for future Articles.

Here’s a simplified high level sequence flow.


Create a AD User for testing purposes

App Registration

Search for App Registration in Azure, and create a new App Registration. Fill in the fields below. For this example we’ll leave the callback URL as localhost over HTTP.

Adrian Hynes

Cloud Platform Architect. Opinions and articles on medium are my own.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store