AKS Detector — “Diagnose and Solve Problems” as a custom addon

Introduction

Diagnose & Solve Problems

Design

https://management.azure.com/subscriptions/<subscription id>/resourcegroups/<resource group name>/providers/microsoft.containerservice/managedclusters/<cluster name>/detectors/<detector name>?startTime=YYYY-MM-DD%20HH:MM&endTime=YYYY-MM-DD%20HH:MM&api-version=YYYY-MM-DD
curl -X GET -H "Authorization: bearer TOKEN" https://management.azure.com/subscriptions/123/resourcegroups/rsg1/providers/microsoft.containerservice/managedclusters/cluster1/detectors/node-drain-failures?startTime=2021-01-25%2021:59&endTime=2021-01-25%2021:59&api-version=2019-04-01

Test

az acr create -n reg1009855 -g rsg1 --sku Standard
az ad sp create-for-rbac --name detectorspn --skip-assignment{
"appId": "a5a32677-xyz",
"displayName": "detectorspn",
"name": "http://detectorspn",
"password": "ABC123",
"tenant": "TENANT-123"
}
{
"Name": "AKS Cluster Read Only",
"IsCustom": true,
"Description": "Read Only access to AKS Clusters",
"Actions": [
"Microsoft.ContainerService/managedClusters/detectors/read"
],
"NotActions": [],
"DataActions": [],
"NotDataActions": [],
"AssignableScopes": [
"/subscriptions/SUB-123"
]
}
az role definition create --role-definition aksread.json
az role assignment create --role "AKS Cluster Read Only" --assignee "a5a32677-xyz"
az acr update -n reg1009855 --admin-enabled true
az acr credential show -n reg1009855
kubectl create secret docker-registry regcred --docker-server reg1009855.azurecr.io --docker-username=reg1009855 --docker-password=<secret from above> --docker-email=test@test.com
apiVersion: apps/v1
kind: Deployment
metadata:
name: detector
labels:
app: detector
spec:
replicas: 1
selector:
matchLabels:
app: detector
template:
metadata:
labels:
app: detector
spec:
imagePullSecrets:
- name: regcred
containers:
- name: detector
image: reg1009855.azurecr.io/detector:X
ports:
- containerPort: 2112
env:
- name: AZURE_TENANT_ID
value: <AZURE_TENANT_ID>
- name: AZURE_CLIENT_ID
value: <AZURE_CLIENT_ID>
- name: AZURE_CLIENT_SECRET
value: <AZURE_CLIENT_SECRET>
- name: AZURE_SUBSCRIPTION_ID
value: <AZURE_SUBSCRIPTION_ID>
- name: DETECTOR_IDS
value: node-drain-failures,appdev,aad-issues
- name: RESOURCE_GROUP
value: rsg1
- name: CLUSTER
value: cluster1
- name: POLL_DELAY
value: "1800"
- name: API_TIMEOUT
value: "60"
kubectl apply -f deployment.yaml
kubectl logs detector-655d79d44d-x4fwv{"detectid":"node-drain-failures","level":"debug","msg":"Success We found no obvious issues with Node Drain Failures   False null","time":"2021-01-30T14:51:05Z"}{"detectid":"appdev","level":"debug","msg":"Success Our analysis did not find any issues in this category. Please click for recommended next steps. Recommended Documents \u003cmarkdown\u003e\n\n* [Tutorial: Using Azu... False null","time":"2021-01-30T14:51:05Z"}
kubectl exec -it detector-c4949d96c-qvsnv -- /bin/sh# curl http://localhost:2112/metrics
# HELP detector_aad_issues Detector metric detector_aad_issues
# TYPE detector_aad_issues gauge
detector_aad_issues 2
# HELP detector_appdev Detector metric detector_appdev
# TYPE detector_appdev gauge
detector_appdev 1
# HELP detector_node_drain_failures Detector metric detector_node_drain_failures
# TYPE detector_node_drain_failures gauge
detector_node_drain_failures 1
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 1.31e-05
go_gc_duration_seconds{quantile="0.25"} 2.0901e-05
...

Challenges

Conclusion

--

--

Cloud Platform Architect. Opinions and articles on medium are my own.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Adrian Hynes

Cloud Platform Architect. Opinions and articles on medium are my own.