Developing Cluster API Provider Azure


Setting up

Base requirements

  1. Install go
    • Get the latest patch version for go v1.16.
  2. Install jq
    • brew install jq on macOS.
    • sudo apt install jq on Windows + WSL2
    • sudo apt install jq on Ubuntu Linux.
  3. Install gettext package
    • brew install gettext && brew link --force gettext on macOS.
    • sudo apt install gettext on Windows + WSL2.
    • sudo apt install gettext on Ubuntu Linux.
  4. Install KIND
    • GO111MODULE="on" go get
  5. Install Kustomize
  6. Install Python 3.x or 2.7.x, if neither is already installed.
  7. Install make.
    • brew install make on MacOS.
    • sudo apt install make on Windows + WSL2.
    • sudo apt install make on Linux.
  8. Install timeout
    • brew install coreutils on macOS.

When developing on Windows, it is suggested to set up the project on Windows + WSL2 and the file should be checked out on as wsl file system for better results.

Get the source

go get -d
cd "$(go env GOPATH)/src/"

Get familiar with basic concepts

This provider is modeled after the upstream Cluster API project. To get familiar with Cluster API resources, concepts and conventions (such as CAPI and CAPZ), refer to the Cluster API Book.

Dev manifest files

Part of running cluster-api-provider-azure is generating manifests to run. Generating dev manifests allows you to test dev images instead of the default releases.

Dev images

Container registry

Any public container registry can be leveraged for storing cluster-api-provider-azure container images.


Change some code!

Modules and dependencies

This repositories uses Go Modules to track and vendor dependencies.

To pin a new dependency:

  • Run go get <repository>@<version>.
  • (Optional) Add a replace statement in go.mod.

Makefile targets and scripts are offered to work with go modules:

  • make verify-modules checks whether go module files are out of date.
  • make modules runs go mod tidy to ensure proper vendoring.
  • hack/ checks that the Go version and environment variables are properly set.

Setting up the environment

Your environment must have the Azure credentials as outlined in the getting started prerequisites section.

Tilt Requirements

Install Tilt:

  • brew install tilt-dev/tap/tilt on macOS or Linux
  • scoop bucket add tilt-dev & scoop install tilt on Windows
  • for alternatives you can follow the installation instruction for macOS, Linux or Windows

After the installation is done, verify that you have installed it correctly with: tilt version

Install Helm:

  • brew install helm on MacOS
  • choco install kubernetes-helm on Windows
  • Install Instruction on Linux

You would require installation of Helm for succesfully setting up Tilt.

Using Tilt

Both of the Tilt setups below will get you started developing CAPZ in a local kind cluster. The main difference is the number of components you will build from source and the scope of the changes you’d like to make. If you only want to make changes in CAPZ, then follow CAPZ instructions. This will save you from having to build all of the images for CAPI, which can take a while. If the scope of your development will span both CAPZ and CAPI, then follow the CAPI and CAPZ instructions.

Tilt for dev in CAPZ

If you want to develop in CAPZ and get a local development cluster working quickly, this is the path for you.

From the root of the CAPZ repository and after configuring the environment variables, you can run the following to generate your tilt-settings.json file:

cat <<EOF > tilt-settings.json
  "kustomize_substitutions": {
      "AZURE_SUBSCRIPTION_ID_B64": "$(echo "${AZURE_SUBSCRIPTION_ID}" | tr -d '\n' | base64 | tr -d '\n')",
      "AZURE_TENANT_ID_B64": "$(echo "${AZURE_TENANT_ID}" | tr -d '\n' | base64 | tr -d '\n')",
      "AZURE_CLIENT_SECRET_B64": "$(echo "${AZURE_CLIENT_SECRET}" | tr -d '\n' | base64 | tr -d '\n')",
      "AZURE_CLIENT_ID_B64": "$(echo "${AZURE_CLIENT_ID}" | tr -d '\n' | base64 | tr -d '\n')"

To build a kind cluster and start Tilt, just run:

make tilt-up

By default, the Cluster API components deployed by Tilt have experimental features turned off. If you would like to enable these features, add extra_args as specified in The Cluster API Book.

Once your kind management cluster is up and running, you can deploy a workload cluster.

To tear down the kind cluster built by the command above, just run:

make kind-reset

Tilt for dev in both CAPZ and CAPI

If you want to develop in both CAPI and CAPZ at the same time, then this is the path for you.

To use Tilt for a simplified development workflow, follow the instructions in the cluster-api repo. The instructions will walk you through cloning the Cluster API (CAPI) repository and configuring Tilt to use kind to deploy the cluster api management components.

you may wish to checkout out the correct version of CAPI to match the version used in CAPZ

Note that tilt up will be run from the cluster-api repository directory and the tilt-settings.json file will point back to the cluster-api-provider-azure repository directory. Any changes you make to the source code in cluster-api or cluster-api-provider-azure repositories will automatically redeployed to the kind cluster.

After you have cloned both repositories, your folder structure should look like:

|-- src/cluster-api-provider-azure
|-- src/cluster-api (run `tilt up` here)

After configuring the environment variables, run the following to generate your tilt-settings.json file:

cat <<EOF > tilt-settings.json
  "default_registry": "${REGISTRY}",
  "provider_repos": ["../cluster-api-provider-azure"],
  "enable_providers": ["azure", "docker", "kubeadm-bootstrap", "kubeadm-control-plane"],
  "kustomize_substitutions": {
      "AZURE_SUBSCRIPTION_ID_B64": "$(echo "${AZURE_SUBSCRIPTION_ID}" | tr -d '\n' | base64 | tr -d '\n')",
      "AZURE_TENANT_ID_B64": "$(echo "${AZURE_TENANT_ID}" | tr -d '\n' | base64 | tr -d '\n')",
      "AZURE_CLIENT_SECRET_B64": "$(echo "${AZURE_CLIENT_SECRET}" | tr -d '\n' | base64 | tr -d '\n')",
      "AZURE_CLIENT_ID_B64": "$(echo "${AZURE_CLIENT_ID}" | tr -d '\n' | base64 | tr -d '\n')"

$REGISTRY should be in the format<dockerhub-username>

The cluster-api management components that are deployed are configured at the /config folder of each repository respectively. Making changes to those files will trigger a redeploy of the management cluster components.

Deploying a workload cluster

⚠️ Note that when developing with tilt as described above, some clusterctl commands won’t work. Specifically, clusterctl config and clusterctl generate may fail. These commands expect specific releases of CAPI and CAPZ to be installed, but the tilt environment dynamically updates and installs these components from your local code. clusterctl get kubeconfig will still work, however.

After your kind management cluster is up and running with Tilt, you can deploy a workload cluster by opening the tilt web UI and clicking the clockwise arrow icon ⟳ on a resource listed, such as “aks-aad,” “ipv6,” or “windows.”

Deploying a workload cluster from Tilt UI is also termed as flavor cluster deployment. Note that each time a flavor is deployed, it deploys a new workload cluster in addition to the existing ones. All the workload clusters must be manually deleted by the user. Please refer to Running flavor clusters as a tilt resource to learn more about this.

Or you can configure workload cluster settings and deploy a workload cluster with the following command:

make create-workload-cluster

To delete the cluster:

make delete-workload-cluster

Check out the troubleshooting guide for common errors you might run into.

Viewing Telemetry

The CAPZ controller emits tracing and metrics data. When run in Tilt, the KinD management cluster is provisioned with development deployments of OpenTelemetry for collecting distributed traces, Jaeger for viewing traces, and Prometheus for scraping and visualizing metrics.

The OpenTelemetry, Jaeger, and Prometheus deployments are for development purposes only. These illustrate the hooks for tracing and metrics, but lack the robustness of production cluster deployments. For example, the Jaeger “all-in-one” component only keeps traces in memory, not in a persistent store.

To view traces in the Jaeger interface, wait until the Tilt cluster is fully initialized. Then open the Tilt web interface, select the “traces: jaeger-all-in-one” resource, and click “View traces” near the top of the screen. Or visit http://localhost:16686/ in your browser.

To view traces in App Insights, follow the tracing documentation before running make tilt-up. Then open the Azure Portal in your browser. Find the App Insights resource you specified in AZURE_INSTRUMENTATION_KEY, choose “Transaction search” on the left, and click “Refresh” to see recent trace data.

To view metrics in the Prometheus interface, open the Tilt web interface, select the “metrics: prometheus-operator” resource, and click “View metrics” near the top of the screen. Or visit http://localhost:9090/ in your browser.

Manual Testing

Creating a dev cluster

The steps below are provided in a convenient script in hack/ Be sure to set AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, AZURE_SUBSCRIPTION_ID, and AZURE_TENANT_ID before running. Optionally, you can override the different cluster configuration variables. For example, to override the workload cluster name:

CLUSTER_NAME=<my-capz-cluster-name> ./hack/

NOTE: CLUSTER_NAME can only include letters, numbers, and hyphens and can’t be longer than 44 characters.

Building and pushing dev images
  1. To build images with custom tags, run the make docker-build as follows:

    export REGISTRY="<container-registry>"
    export MANAGER_IMAGE_TAG="<image-tag>" # optional - defaults to `dev`.
    PULL_POLICY=IfNotPresent make docker-build
  2. (optional) Push your docker images:

    2.1. Login to your container registry using docker login.

    e.g., docker login

    2.2. Push to your custom image registry:


    NOTE: make create-cluster will fetch the manager image locally and load it onto the kind cluster if it is present.

Customizing the cluster deployment

Here is a list of required configuration parameters (the full list is available in templates/cluster-template.yaml):

# Cluster settings.
export CLUSTER_NAME="capz-cluster"

# Azure settings.
export AZURE_LOCATION="southcentralus"
export AZURE_SUBSCRIPTION_ID_B64="$(echo -n "$AZURE_SUBSCRIPTION_ID" | base64 | tr -d '\n')"
export AZURE_TENANT_ID_B64="$(echo -n "$AZURE_TENANT_ID" | base64 | tr -d '\n')"
export AZURE_CLIENT_ID_B64="$(echo -n "$AZURE_CLIENT_ID" | base64 | tr -d '\n')"
export AZURE_CLIENT_SECRET_B64="$(echo -n "$AZURE_CLIENT_SECRET" | base64 | tr -d '\n')"

# Machine settings.
export AZURE_NODE_MACHINE_TYPE="Standard_D2s_v3"
export KUBERNETES_VERSION="v1.22.1"

# Identity secret.
export AZURE_CLUSTER_IDENTITY_SECRET_NAME="cluster-identity-secret" 
export CLUSTER_IDENTITY_NAME="cluster-identity" 

# Generate SSH key.
# If you want to provide your own key, skip this step and set AZURE_SSH_PUBLIC_KEY_B64 to your existing file.
rm -f "${SSH_KEY_FILE}" 2>/dev/null
ssh-keygen -t rsa -b 2048 -f "${SSH_KEY_FILE}" -N '' 1>/dev/null
echo "Machine SSH key generated in ${SSH_KEY_FILE}"
# For Linux the ssh key needs to be b64 encoded because we use the azure api to set it
# Windows doesn't support setting ssh keys so we use cloudbase-init to set which doesn't require base64
export AZURE_SSH_PUBLIC_KEY_B64=$(cat "${SSH_KEY_FILE}.pub" | base64 | tr -d '\r\n')
export AZURE_SSH_PUBLIC_KEY=$(cat "${SSH_KEY_FILE}.pub" | tr -d '\r\n')

⚠️ Please note the generated templates include default values and therefore require the use of clusterctl to create the cluster or the use of envsubst to replace these values

Creating the cluster

⚠️ Make sure you followed the previous two steps to build the dev image and set the required environment variables before proceding.

Ensure dev environment has been reset:

make clean kind-reset

Create the cluster:

make create-cluster

Check out the troubleshooting guide for common errors you might run into.

Instrumenting Telemetry

Telemetry is the key to operational transparency. We strive to provide insight into the internal behavior of the system through observable traces and metrics.

Distributed Tracing

Distributed tracing provides a hierarchical view of how and why an event occurred. CAPZ is instrumented to trace each controller reconcile loop. When the reconcile loop begins, a trace span begins and is stored in loop context.Context. As the context is passed on to functions below, new spans are created, tied to the parent span by the parent span ID. The spans form a hierarchical representation of the activities in the controller.

These spans can also be propagated across service boundaries. The span context can be passed on through metadata such as HTTP headers. By propagating span context, it creates a distributed, causal relationship between services and functions.

For tracing, we use OpenTelemetry.

Here is an example of staring a span in the beginning of a controller reconcile.

ctx, logger, done := tele.StartSpanWithLogger(ctx, "controllers.AzureMachineReconciler.Reconcile",
   tele.KVP("namespace", req.Namespace),
   tele.KVP("name", req.Name),
   tele.KVP("kind", "AzureMachine"),
defer done()

The code above creates a context with a new span stored in the context.Context value bag. If a span already existed in the ctx arguement, then the new span would take on the parentID of the existing span, otherwise the new span becomes a “root span”, one that does not have a parent. The span is also created with labels, or tags, which provide metadata about the span and can be used to query in many distributed tracing systems.

It also creates a logger that logs messages both to the span and STDOUT. The span is not returned directly, but closure of the span is handled by the final done value. This is a simple nil-ary function (func()) that should be called as appropriate. Most likely, this should be done in a defer -- as shown in the above code sample -- to ensure that the span is closed at the end of your function or scope.

Consider adding tracing if your func accepts a context.


Metrics provide quantitative data about the operations of the controller. This includes cumulative data like counters, single numerical values like guages, and distributions of counts / samples like histograms & summaries.

In CAPZ we expose metrics using the Prometheus client. The Kubebuilder project provides a guide for metrics and for exposing new ones.

Submitting PRs and testing

Pull requests and issues are highly encouraged! If you’re interested in submitting PRs to the project, please be sure to run some initial checks prior to submission:

make lint # Runs a suite of quick scripts to check code structure
make test # Runs tests on the Go code

Executing unit tests

make test executes the project’s unit tests. These tests do not stand up a Kubernetes cluster, nor do they have external dependencies.

Automated Testing


Mocks for the services tests are generated using GoMock.

To generate the mocks you can run

make generate-go

E2E Testing



You can optionally set the following variables:

E2E_CONF_FILEThe path of the E2E configuration file.${GOPATH}/src/
SKIP_CLEANUPSet to true if you do not want the bootstrap and workload clusters to be cleaned up after running E2E tests.false
SKIP_CREATE_MGMT_CLUSTERSkip management cluster creation. If skipping managment cluster creation you must specify KUBECONFIG and SKIP_CLEANUPfalse
LOCAL_ONLYUse Kind local registry and run the subset of tests which don’t require a remotely pushed controller image.true
REGISTRYRegistry to push the controller
CLUSTER_NAMEName of an existing workload cluster. Must be set to run specs against existing workload cluster. Use in conjunction with SKIP_CREATE_MGMT_CLUSTER, GINKGO_FOCUS, CLUSTER_NAMESPACE and KUBECONFIG. Must specify only one e2e spec to run against with GINKGO_FOCUS such as export GINKO_FOCUS=Creating.a.VMSS.cluster.with.a.single.control.plane.node.
CLUSTER_NAMESPACENamespace of an existing workload cluster. Must be set to run specs against existing workload cluster. Use in conjunction with SKIP_CREATE_MGMT_CLUSTER, GINKGO_FOCUS, CLUSTER_NAME and KUBECONFIG. Must specify only one e2e spec to run against with GINKGO_FOCUS such as export GINKO_FOCUS=Creating.a.VMSS.cluster.with.a.single.control.plane.node.
KUBECONFIGUsed with SKIP_CREATE_MGMT_CLUSTER set to true. Location of kubeconfig for the management cluster you would like to use. Use kind get kubeconfig --name capz-e2e > kubeconfig.capz-e2e to get the capz e2e kind cluster config‘~/.kube/config’

You can also customize the configuration of the CAPZ cluster created by the E2E tests (except for CLUSTER_NAME, AZURE_RESOURCE_GROUP, AZURE_VNET_NAME, CONTROL_PLANE_MACHINE_COUNT, and WORKER_MACHINE_COUNT, since they are generated by individual test cases). See Customizing the cluster deployment for more details.

Conformance Testing

To run the Kubernetes Conformance test suite locally, you can run


Optional settings are:

Environment VariableDefault ValueDescription
WINDOWSfalseRun conformance against Windows nodes
CONFORMANCE_NODES1Number of parallel ginkgo nodes to run

With the following environment variables defined, you can build a CAPZ cluster from the HEAD of Kubernetes main branch or release branch, and run the Conformance test suite against it. This is not enabled for Windows currently.

Environment VariableValue
KUBERNETES_VERSIONlatest - extract Kubernetes version from (main’s HEAD)
latest-1.21 - extract Kubernetes version from (release branch’s HEAD)

With the following environment variables defined, CAPZ runs ./scripts/ as part of ./scripts/, which allows developers to build Kubernetes from source and run the Kubernetes Conformance test suite against a CAPZ cluster based on the custom build:

Environment VariableValue
AZURE_STORAGE_ACCOUNTYour Azure storage account name
AZURE_STORAGE_KEYYour Azure storage key
JOB_NAMEtest (an enviroment variable used by CI, can be any non-empty string)
REGISTRYYour Registry

Running custom test suites on CAPZ clusters

To run a custom test suite on a CAPZ cluster locally, set AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, AZURE_SUBSCRIPTION_ID, AZURE_TENANT_ID and run:

./scripts/ bash -c "cd ${GOPATH}/src/ && make e2e"

You can optionally set the following variables:

SKIP_CLEANUPSkip deleting the cluster after the tests finish running.
KUBECONFIGProvide your existing cluster kubeconfig filepath. If no kubeconfig is provided, ./kubeconfig will be used.
USE_CI_ARTIFACTSUse a CI version of Kubernetes, ie. not a released version (eg. v1.19.0-alpha.1.426+0926c9c47677e9)
CI_VERSIONProvide a custom CI version of Kubernetes. By default, the latest master commit will be used.
TEST_CCMBuild a cluster that uses custom versions of the Azure cloud-provider cloud-controller-manager and node-controller-manager images
EXP_MACHINE_POOLUse Machine Pool for worker machines.
TEST_WINDOWSBuild a cluster that has Windows worker nodes.
REGISTRYRegistry to push any custom k8s images or cloud provider images built.
CLUSTER_TEMPLATEUse a custom cluster template. By default, the script will choose the appropriate cluster template based on existing environment variabes.

You can also customize the configuration of the CAPZ cluster (assuming that SKIP_CREATE_WORKLOAD_CLUSTER is not set). See Customizing the cluster deployment for more details.