Regional High Availability for AKS Workloads

Introduction

In this article, we’re going to show how we could use a GSLB (Global Server Load Balancer) like Azure Traffic Manager or AVI to provide regional high availability for our AKS workloads using an active active fail-over strategy.

In previous articles we’ve exposed our workloads at a regional level and load balanced traffic over our cluster nodes which were spread over availability zones in a region. So in the event of a data center going down, we usually have at least 2 more availability zones to handle our workloads.

Now if an entire region were to go down, we want to ensure that our customers are not affected.

I won’t talk about SLA (Service Level Agreements) here, we’ll leave that for another article, but let’s just say we have an SLA with our customers that our workloads downtime will be very minimal at any point in time (i.e. 0.01% per day). In order to achieve this, we’re going to design our architecture availability in an active-active mode, where in the case of a region going down, our GSLB will just flip from our workloads in one region to the other.

Active Active in this case usually means we have the same workloads deployed to both regions, and the flipping can be done automatically via health checks and DNS.

Architecture

In our above architecture, we have 2 AKS clusters, one in region North Europe (Primary) and one in West Europe.

Our sample application in North Europe is available via appgw.northeurope.hynes.pri, and our sample application in West Europe is available via appgw.westeurope.hynes.pri.

Our customers will hit the domain name appgw.hynes.pri, which will in turn resolve to the IP Address of our primary region appgw.northeurope.hynes.pri.

Setup

As usual we’re going to use Azure Cloudshell. Setup your Cloudshell environment with the following commands. We’ll install ansible for python3, install openshift for ansible and finally adding our own real time azure ansible script for list resources in a resource group (https://adrianhynes.medium.com/implementing-your-own-ansible-azure-collection-f2d4e0334502).

pip install ansible
pip install openshift
ansible-galaxy collection install community.kubernetes
git clone https://github.com/ansible-collections/azure.git
git clone https://github.com/aido123/ansible.git
cp ansible/azure_rm_resource_info_rt.py azure/plugins/modules/azure_rm_resource_info_rt.py
pip install -r azure/requirements-azure.txt
cd azure
ansible-galaxy collection build --force
ansible-galaxy collection install azure-azcollection-*.tar.gz --force

Clone down my Ansible Playbooks

git clone https://github.com/aido123/ansible/tree/main

The following playbook will setup all our Azure Resources and Configuration as well as install App Gateway, Nginx Ingress Controller and External DNS.

ansible-playbook ansible/azure_ansible_gtm.yaml --extra-vars "resource_group_name=myrsg subscription_id=ABC123 tenant_id=DEF456"

This next playbook, will deploy our sample applications to each cluster.

ansible-playbook ansible/azure_ansible_gtm_sample_apps.yaml --extra-vars "resource_group_name=myrsg"

Failover Demo

So in this demo, unfortunately we can’t use Azure Traffic Manager as it doesn’t yet support private dns domains (and AVI is a few articles in itself), so instead, we’re going to mimic our GSLB using another private DNS hynes.pri which we’ll just flip over manually.

Ok so now we can access our sample applications via their LTM (Local Traffic Management) domain names http://appgw.northeurope.hynes.pri and http://appgw.westeurope.hynes.pri .

Our GTM (Global Traffic Management) domain name http://appgw.hynes.pri that our customers will access is just a CNAME alias in private DNS which points to our main region LTM http://appgw.northeurope.hynes.pri.

Now let’s say north europe goes down, we will simulate a GSLB switching over to west europe, by updating our “GTM” private dns CNAME alias for http://appgw.hynes.pri from http://appgw.northeurope.hynes.pri to http://appgw.westeurope.hynes.pri.

Conclusion

I hope you have found this article useful and gives you some ideas for high availability of your AKS workloads.

Cloud Platform Architect. Opinions and articles on medium are my own.