In this article I wanted to share my experience, filing a new feature request for a CNCF open source project, going ahead and implementing a solution for this feature, the pull request process and then it’s actual release.
Chaos Mesh is a CNCF open source sandbox project which allows you to introduce Chaos into your Kubernetes clusters in different ways via a growing set of Chaos experiments. …
In this article, we’re going to take the AKS Diagnose & Solve Problems Feature (which is only currently available to view via the AKS UI) and create a kubernetes addon, which will expose all this good health information from your AKS clusters via logs & metrics. We can then scrape these metrics using something like Prometheus or Datadog & alert when certain thresholds are exceeded.
Let’s first look at what Diagnose & solve problems is, & how we access it.
Diagnose & Solve problems is a useful feature (in preview as of writing) within the AKS UI. It gives us…
In the previous section we looked at a real world example of SRE, best practices for reliability and troubleshooting and mitigation. In this section we’ll look at what an SRE might monitor in a kubernetes platform, tools an SRE may use and how SRE’s can track their own reliability performance over time with certain metrics.
Remember your SLI’s should relate to your customers experience, so consider the following metrics in you cluster that could be potential candidates for SLI’s:
In the previous section, we set the scene of what SRE is, what an SRE does and the principles of SRE. In this section, we’ll look at a real world example of SRE, best practices for reliability and troubleshooting and mitigation.
Ok enough with the Theory, show me an example of what and SRE does in the real world!
Let’s look at this real world incident from a Postmortem point of view.
An relatively large managed Kubernetes cluster running in a DEV environment with several managed community and custom operators/controllers running.
In this 3 part series of articles we’re going to have a look at what SRE (Site Reliability Engineering) or Production Engineering is, what an SRE (Site Reliability Engineer) does, principles of SRE, a real world example of SRE practices in practice and then we’ll look at some other best practices for reliability, kubernetes SLI’s, metrics and what an SRE could monitor and finally we’ll look at some common tools an SRE can have in their toolbox.
We’ll focus on SRE from a Kubernetes platform point of view but obviously these principles and practices can be applied to any systems…
In this article, we’re going to show how we could use a GSLB (Global Server Load Balancer) like Azure Traffic Manager or AVI to provide regional high availability for our AKS workloads using an active active fail-over strategy.
In previous articles we’ve exposed our workloads at a regional level and load balanced traffic over our cluster nodes which were spread over availability zones in a region. So in the event of a data center going down, we usually have at least 2 more availability zones to handle our workloads.
Now if an entire region were to go down, we want…
In a recent azure ansible article I wrote (https://adrianhynes.medium.com/orchestrating-azure-resources-with-ansible-fa82f4e3dfd6), I came across a challenge, whereby listing resources in a resource group wasn’t real-time.
In that article I improvised by waiting and checking every minute until the resources eventually appeared. BUT that loop and sleep took ~20 minutes.
It turns out, the “/resources API is served from a regional cache from Azure Resource Manager, which is eventual consistent, and not real time” https://github.com/Azure/AKS/issues/1964
This means, in order to get a real time view of the resources in our resource group, we need to hit the regional azure management endpoint i.e. https://<region>.management.azure.com/
In the last section we’ve made some initial design decisions which will influence our solution.
In this section we’ll pick a database, review and insert some questions, review the apis and a sample sequence flow, review our overall architecture and finally deploy our kubernetes applications.
We are going to simply pick Sqlite for our relational database needs, due it’s small nature and ease of use and it’s file based nature. We’ll mount the Sqlite database onto a PVC (Azure Disk) to maintain state.
Inserting a new level is straight forward
replace into level (id, name, description) values(3, 'Level…
In this series of articles, I want to show you how you can create your own custom competitive Kubernetes Training experience for your colleagues, on AKS. You could also use it as part of an interview test for potential employees.
First let’s look at the “finished” (more like a POC) product and then we’ll work back to how we got there.
Ok let’s look at the initial goals we set for this application.
Self Service and On Demand- The ability to take part should be Self Service. …
In this article, I’ll show you how easy it is to integrate Azure AD as an Authentication mechanism for your React Application.
We’ll use this authentication mechanism for future Articles.
Here’s a simplified high level sequence flow.
Create a AD User for testing purposes
Search for App Registration in Azure, and create a new App Registration. Fill in the fields below. For this example we’ll leave the callback URL as localhost over HTTP.
Cloud Platform Architect. Opinions and articles on medium are my own.