Troubleshooting Guide

Common issues users might run into when using Cluster API Provider for Azure. This list is work-in-progress. Feel free to open a PR to add to it if you find that useful information is missing.

Examples of troubleshooting real-world issues

No Azure resources are getting created

This is likely due to missing or invalid Azure credentials.

Check the CAPZ controller logs on the management cluster:

kubectl logs deploy/capz-controller-manager -n capz-system manager

If you see an error similar to this:

azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/123/providers/Microsoft.Compute/skus?%24filter=location+eq+%27eastus2%27&api-version=2019-04-01: StatusCode=401 -- Original Error: adal: Refresh request failed. Status Code = '401'. Response body: {\"error\":\"invalid_client\",\"error_description\":\"AADSTS7000215: Invalid client secret is provided.

Make sure the provided Service Principal client ID and client secret are correct and that the password has not expired.

The AzureCluster infrastructure is provisioned but no virtual machines are coming up

Your Azure subscription might have no quota for the requested VM size in the specified Azure location.

Check the CAPZ controller logs on the management cluster:

kubectl logs deploy/capz-controller-manager -n capz-system manager

If you see an error similar to this:

"error"="failed to reconcile AzureMachine: failed to create virtual machine: failed to create VM capz-md-0-qkg6m in resource group capz-fkl3tp: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=\u003cnil\u003e Code=\"OperationNotAllowed\" Message=\"Operation could not be completed as it results in exceeding approved standardDSv3Family Cores quota.

Follow the these steps. Alternatively, you can specify another Azure location and/or VM size during cluster creation.

A virtual machine is running but the k8s node did not join the cluster

Check the AzureMachine (or AzureMachinePool if using a MachinePool) status:

kubectl get azuremachines -o wide

If you see an output like this:

NAME READY STATE default-template-md-0-w78jt false Updating

This indicates that the bootstrap script has not yet succeeded. Check the AzureMachine status.conditions field for more information.

Take a look at the cloud-init logs for further debugging.

One or more control plane replicas are missing

Take a look at the KubeadmControlPlane controller logs and look for any potential errors:

kubectl logs deploy/capi-kubeadm-control-plane-controller-manager -n capi-kubeadm-control-plane-system manager

In addition, make sure all pods on the workload cluster are healthy, including pods in the kube-system namespace.

Nodes are in NotReady state

Make sure you have installed a CNI on the workload cluster and that all the pods on the workload cluster are in running state.

Load Balancer service fails to come up

Check the cloud-controller-manager logs on the workload cluster.

If running the Azure cloud provider in-tree:

kubectl logs kube-controller-manager-<control-plane-node-name> -n kube-system

If running the Azure cloud provider out-of-tree:

kubectl logs cloud-controller-manager -n kube-system

Watching Kubernetes resources

To watch progression of all Cluster API resources on the management cluster you can run:

kubectl get cluster-api

Looking at controller logs

To check the CAPZ controller logs on the management cluster, run:

kubectl logs deploy/capz-controller-manager -n capz-system manager

Checking cloud-init logs (Ubuntu)

Cloud-init logs can provide more information on any issues that happened when running the bootstrap script.

Option 1: Using the Azure Portal

Located in the virtual machine blade (if enabled for the VM), the boot diagnostics option is under the Support and Troubleshooting section in the Azure portal.

For more information, see here

Option 2: Using the Azure CLI

az vm boot-diagnostics get-boot-log --name MyVirtualMachine --resource-group MyResourceGroup

For more information, see here.

Option 3: With SSH

Using the ssh information provided during cluster creation (environment variable AZURE_SSH_PUBLIC_KEY_B64):

connect to first control node - capi is default linux user created by deployment
API_SERVER=$(kubectl get azurecluster capz-cluster -o jsonpath='{.spec.controlPlaneEndpoint.host}') ssh capi@${API_SERVER}
list nodes
kubectl get azuremachines NAME READY STATE capz-cluster-control-plane-2jprg true Succeeded capz-cluster-control-plane-ck5wv true Succeeded capz-cluster-control-plane-w4tv6 true Succeeded capz-cluster-md-0-s52wb false Failed capz-cluster-md-0-w8xxw true Succeeded
pick node name from output above:
node=$(kubectl get azuremachine capz-cluster-md-0-s52wb -o jsonpath='{.status.addresses[0].address}') ssh -J capi@${apiserver} capi@${node}
look at cloud-init logs

less /var/log/cloud-init-output.log

Automated log collection

As part of CI there is a log collection tool which you can also leverage to pull all the logs for machines which will dump logs to ${PWD}/_artifacts} by default. The following works if your kubeconfig is configured with the management cluster. See the tool for more settings.

go run -tags e2e ./test/logger.go --name <workload-cluster-name> --namespace <workload-cluster-namespace>

There are also some provided scripts that can help automate a few common tasks.