Troubleshooting Guide

Common issues users might run into when using Cluster API Provider for Azure. This list is work-in-progress. Feel free to open a PR to add to it if you find that useful information is missing.

Examples of troubleshooting real-world issues

No Azure resources are getting created

This is likely due to missing or invalid Azure credentials.

Check the CAPZ controller logs on the management cluster:

kubectl logs deploy/capz-controller-manager -n capz-system manager

If you see an error similar to this:

azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/123/providers/Microsoft.Compute/skus?%24filter=location+eq+%27eastus2%27&api-version=2019-04-01: StatusCode=401 -- Original Error: adal: Refresh request failed. Status Code = '401'. Response body: {\"error\":\"invalid_client\",\"error_description\":\"AADSTS7000215: Invalid client secret is provided.

Make sure the provided Service Principal client ID and client secret are correct and that the password has not expired.

The AzureCluster infrastructure is provisioned but no virtual machines are coming up

Your Azure subscription might have no quota for the requested VM size in the specified Azure location.

Check the CAPZ controller logs on the management cluster:

kubectl logs deploy/capz-controller-manager -n capz-system manager

If you see an error similar to this:

"error"="failed to reconcile AzureMachine: failed to create virtual machine: failed to create VM capz-md-0-qkg6m in resource group capz-fkl3tp: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=\u003cnil\u003e Code=\"OperationNotAllowed\" Message=\"Operation could not be completed as it results in exceeding approved standardDSv3Family Cores quota.

Follow the these steps. Alternatively, you can specify another Azure location and/or VM size during cluster creation.

A virtual machine is running but the k8s node did not join the cluster

Check the AzureMachine (or AzureMachinePool if using a MachinePool) status:

kubectl get azuremachines -o wide

If you see an output like this:

NAME                                       READY   STATE
default-template-md-0-w78jt                false   Updating

This indicates that the bootstrap script has not yet succeeded. Check the AzureMachine status.conditions field for more information.

Take a look at the cloud-init logs for further debugging.

One or more control plane replicas are missing

Take a look at the KubeadmControlPlane controller logs and look for any potential errors:

kubectl logs deploy/capi-kubeadm-control-plane-controller-manager -n capi-kubeadm-control-plane-system manager

In addition, make sure all pods on the workload cluster are healthy, including pods in the kube-system namespace.

Nodes are in NotReady state

Make sure you have installed a CNI on the workload cluster and that all the pods on the workload cluster are in running state.

Load Balancer service fails to come up

Check the cloud-controller-manager logs on the workload cluster.

If running the Azure cloud provider in-tree:

kubectl logs kube-controller-manager-<control-plane-node-name> -n kube-system 

If running the Azure cloud provider out-of-tree:

kubectl logs cloud-controller-manager -n kube-system 

Watching Kubernetes resources

To watch progression of all Cluster API resources on the management cluster you can run:

kubectl get cluster-api

Looking at controller logs

To check the CAPZ controller logs on the management cluster, run:

kubectl logs deploy/capz-controller-manager -n capz-system manager

Checking cloud-init logs (Ubuntu)

Cloud-init logs can provide more information on any issues that happened when running the bootstrap script.

Option 1: Using the Azure Portal

Located in the virtual machine blade, the boot diagnostics option is under the Support and Troubleshooting section in the Azure portal.

For more information, see here

Option 2: Using the Azure CLI

az vm boot-diagnostics get-boot-log --name MyVirtualMachine --resource-group MyResourceGroup

For more information, see here.

Option 3: With SSH

Using the ssh information provided during cluster creation (environment variable AZURE_SSH_PUBLIC_KEY_B64):

connect to first control node - capi is default linux user created by deployment
API_SERVER=$(kubectl get azurecluster capz-cluster -o jsonpath='{.spec.controlPlaneEndpoint.host}')
ssh capi@${API_SERVER}
list nodes
kubectl get azuremachines
NAME                               READY   STATE
capz-cluster-control-plane-2jprg   true    Succeeded
capz-cluster-control-plane-ck5wv   true    Succeeded
capz-cluster-control-plane-w4tv6   true    Succeeded
capz-cluster-md-0-s52wb            false   Failed
capz-cluster-md-0-w8xxw            true    Succeeded
pick node name from output above:
node=$(kubectl get azuremachine capz-cluster-md-0-s52wb -o jsonpath='{.status.addresses[0].address}')
ssh -J capi@${apiserver} capi@${node}
look at cloud-init logs

less /var/log/cloud-init-output.log

Automated log collection

As part of CI there is a log collection script which you can also leverage to pull all the logs for machines which will dump logs to ${PWD}/_artifacts} by default:

./hack/log/log-dump.sh

There are also some provided scripts that can help automate a few common tasks.