Developing Cluster API Provider Azure


Setting up

Base requirements

  1. Install go
    • Get the latest patch version for go v1.16.
  2. Install jq
    • brew install jq on macOS.
    • chocolatey install jq on Windows.
    • sudo apt install jq on Ubuntu Linux.
  3. Install gettext package
    • brew install gettext && brew link --force gettext on macOS.
    • install instructions on Windows.
    • sudo apt install gettext on Ubuntu Linux.
  4. Install KIND
    • GO111MODULE="on" go get
  5. Install Kustomize
    • brew install kustomize on macOS.
    • choco install kustomize on Windows.
    • install instructions on Linux
  6. Install Python 3.x or 2.7.x, if neither is already installed.
  7. Install make.
  8. Install timeout
    • brew install coreutils on macOS.

Get the source

go get -d
cd "$(go env GOPATH)/src/"

Get familiar with basic concepts

This provider is modeled after the upstream Cluster API project. To get familiar with Cluster API resources, concepts and conventions (such as CAPI and CAPZ), refer to the Cluster API Book.

Dev manifest files

Part of running cluster-api-provider-azure is generating manifests to run. Generating dev manifests allows you to test dev images instead of the default releases.

Dev images

Container registry

Any public container registry can be leveraged for storing cluster-api-provider-azure container images.


Change some code!

Modules and dependencies

This repositories uses Go Modules to track and vendor dependencies.

To pin a new dependency:

  • Run go get <repository>@<version>.
  • (Optional) Add a replace statement in go.mod.

Makefile targets and scripts are offered to work with go modules:

  • make verify-modules checks whether go module files are out of date.
  • make modules runs go mod tidy to ensure proper vendoring.
  • hack/ checks that the Go version and environment variables are properly set.

Setting up the environment

Your environment must have the Azure credentials as outlined in the getting started prerequisites section.

Using Tilt

Both of the Tilt setups below will get you started developing CAPZ in a local kind cluster. The main difference is the number of components you will build from source and the scope of the changes you’d like to make. If you only want to make changes in CAPZ, then follow CAPZ instructions. This will save you from having to build all of the images for CAPI, which can take a while. If the scope of your development will span both CAPZ and CAPI, then follow the CAPI and CAPZ instructions.

Tilt for dev in CAPZ

If you want to develop in CAPZ and get a local development cluster working quickly, this is the path for you.

From the root of the CAPZ repository and after configuring the environment variables, you can run the following to generate your tilt-settings.json file:

cat <<EOF > tilt-settings.json
  "kustomize_substitutions": {
      "AZURE_SUBSCRIPTION_ID_B64": "$(echo "${AZURE_SUBSCRIPTION_ID}" | tr -d '\n' | base64 | tr -d '\n')",
      "AZURE_TENANT_ID_B64": "$(echo "${AZURE_TENANT_ID}" | tr -d '\n' | base64 | tr -d '\n')",
      "AZURE_CLIENT_SECRET_B64": "$(echo "${AZURE_CLIENT_SECRET}" | tr -d '\n' | base64 | tr -d '\n')",
      "AZURE_CLIENT_ID_B64": "$(echo "${AZURE_CLIENT_ID}" | tr -d '\n' | base64 | tr -d '\n')",

To build a kind cluster and start Tilt, just run:

make tilt-up

By default, the Cluster API components deployed by Tilt have experimental features turned off. If you would like to enable these features, add extra_args as specified in The Cluster API Book.

Once your kind management cluster is up and running, you can deploy a workload cluster.

You can also deploy a flavor cluster as a local tilt resource.

To tear down the kind cluster built by the command above, just run:

make kind-reset

Tilt for dev in both CAPZ and CAPI

If you want to develop in both CAPI and CAPZ at the same time, then this is the path for you.

To use Tilt for a simplified development workflow, follow the instructions in the cluster-api repo. The instructions will walk you through cloning the Cluster API (CAPI) repository and configuring Tilt to use kind to deploy the cluster api management components.

you may wish to checkout out the correct version of CAPI to match the version used in CAPZ

Note that tilt up will be run from the cluster-api repository directory and the tilt-settings.json file will point back to the cluster-api-provider-azure repository directory. Any changes you make to the source code in cluster-api or cluster-api-provider-azure repositories will automatically redeployed to the kind cluster.

After you have cloned both repositories, your folder structure should look like:

|-- src/cluster-api-provider-azure
|-- src/cluster-api (run `tilt up` here)

After configuring the environment variables, run the following to generate your tilt-settings.json file:

cat <<EOF > tilt-settings.json
  "default_registry": "${REGISTRY}",
  "provider_repos": ["../cluster-api-provider-azure"],
  "enable_providers": ["azure", "docker", "kubeadm-bootstrap", "kubeadm-control-plane"],
  "kustomize_substitutions": {
      "AZURE_SUBSCRIPTION_ID_B64": "$(echo "${AZURE_SUBSCRIPTION_ID}" | tr -d '\n' | base64 | tr -d '\n')",
      "AZURE_TENANT_ID_B64": "$(echo "${AZURE_TENANT_ID}" | tr -d '\n' | base64 | tr -d '\n')",
      "AZURE_CLIENT_SECRET_B64": "$(echo "${AZURE_CLIENT_SECRET}" | tr -d '\n' | base64 | tr -d '\n')",
      "AZURE_CLIENT_ID_B64": "$(echo "${AZURE_CLIENT_ID}" | tr -d '\n' | base64 | tr -d '\n')",

$REGISTRY should be in the format<dockerhub-username>

The cluster-api management components that are deployed are configured at the /config folder of each repository respectively. Making changes to those files will trigger a redeploy of the management cluster components.

Deploying a workload cluster

After your kind management cluster is up and running with Tilt, you can configure workload cluster settings and deploy a workload cluster with the following:

make create-workload-cluster

To delete the cluster:

make delete-workload-cluster

Check out the troubleshooting guide for common errors you might run into.

Viewing Telemetry

The CAPZ controller emits tracing and metrics data. When run in Tilt, the KinD cluster is provisioned with a development deployment of Jaeger, for distributed tracing, and Prometheus for metrics scraping and visualization.

The Jaeger and Prometheus deployments are for development purposes only. These illustrate the hooks for tracing and metrics, but lack the robustness of production cluster deployments. For example, Jaeger in “all-in-one” mode with only in-memory persistence of traces.

After the Tilt cluster has been initialized, to view distributed traces in Jaeger open a browser to http://localhost:8080.

To view metrics, run kubectl port-forward -n capz-system prometheus-prometheus-0 9090 and open http://localhost:9090 to see the Prometheus UI.

Manual Testing

Creating a dev cluster

The steps below are provided in a convenient script in hack/ Be sure to set AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, AZURE_SUBSCRIPTION_ID, and AZURE_TENANT_ID before running. Optionally, you can override the different cluster configuration variables. For example, to override the workload cluster name:

CLUSTER_NAME=<my-capz-cluster-name> ./hack/

NOTE: CLUSTER_NAME can only include letters, numbers, and hyphens and can’t be longer than 44 characters.

Building and pushing dev images
  1. To build images with custom tags, run the make docker-build as follows:

    export REGISTRY="<container-registry>"
    export MANAGER_IMAGE_TAG="<image-tag>" # optional - defaults to `dev`.
    PULL_POLICY=IfNotPresent make docker-build
  2. (optional) Push your docker images:

    2.1. Login to your container registry using docker login.

    e.g., docker login

    2.2. Push to your custom image registry:

    REGISTRY="<container-registry>" MANAGER_IMAGE_TAG="<image-tag>" make docker-push

    NOTE: make create-cluster will fetch the manager image locally and load it onto the kind cluster if it is present.

Customizing the cluster deployment

Here is a list of required configuration parameters (the full list is available in templates/cluster-template.yaml):

# Cluster settings.
export CLUSTER_NAME="capz-cluster"

# Azure settings.
export AZURE_LOCATION="southcentralus"
export AZURE_SUBSCRIPTION_ID_B64="$(echo -n "$AZURE_SUBSCRIPTION_ID" | base64 | tr -d '\n')"
export AZURE_TENANT_ID_B64="$(echo -n "$AZURE_TENANT_ID" | base64 | tr -d '\n')"
export AZURE_CLIENT_ID_B64="$(echo -n "$AZURE_CLIENT_ID" | base64 | tr -d '\n')"
export AZURE_CLIENT_SECRET_B64="$(echo -n "$AZURE_CLIENT_SECRET" | base64 | tr -d '\n')"

# Machine settings.
export AZURE_NODE_MACHINE_TYPE="Standard_D2s_v3"
export KUBERNETES_VERSION="v1.19.7"

# Generate SSH key.
# If you want to provide your own key, skip this step and set AZURE_SSH_PUBLIC_KEY_B64 to your existing file.
rm -f "${SSH_KEY_FILE}" 2>/dev/null
ssh-keygen -t rsa -b 2048 -f "${SSH_KEY_FILE}" -N '' 1>/dev/null
echo "Machine SSH key generated in ${SSH_KEY_FILE}"
# For Linux the ssh key needs to be b64 encoded because we use the azure api to set it
# Windows doesn't support setting ssh keys so we use cloudbase-init to set which doesn't require base64 
export AZURE_SSH_PUBLIC_KEY_B64=$(cat "${SSH_KEY_FILE}.pub" | base64 | tr -d '\r\n')
export AZURE_SSH_PUBLIC_KEY=$(cat "${SSH_KEY_FILE}.pub" | tr -d '\r\n')

⚠️ Please note the generated templates include default values and therefore require the use of clusterctl to create the cluster or the use of envsubst to replace these values

Creating the cluster

⚠️ Make sure you followed the previous two steps to build the dev image and set the required environment variables before proceding.

Ensure dev environment has been reset:

make clean kind-reset

Create the cluster:

make create-cluster

Check out the troubleshooting guide for common errors you might run into.

Instrumenting Telemetry

Telemetry is the key to operational transparency. We strive to provide insight into the internal behavior of the system through observable traces and metrics.

Distributed Tracing

Distributed tracing provides a hierarchical view of how and why an event occurred. CAPZ is instrumented to trace each controller reconcile loop. When the reconcile loop begins, a trace span begins and is stored in loop context.Context. As the context is passed on to functions below, new spans are created, tied to the parent span by the parent span ID. The spans form a hierarchical representation of the activities in the controller.

These spans can also be propagated across service boundaries. The span context can be passed on through metadata such as HTTP headers. By propagating span context, it creates a distributed, causal relationship between services and functions.

For tracing, we use OpenTelemetry.

Here is an example of staring a span in the beginning of a controller reconcile.

ctx, span := tele.Tracer().Start(ctx, "controllers.AzureMachineReconciler.Reconcile",
        label.String("namespace", req.Namespace),
        label.String("name", req.Name),
        label.String("kind", "AzureMachine"),
defer span.End()

The code above creates a context with a new span stored in the context.Context value bag. If a span already existed in the ctx arguement, then the new span would take on the parentID of the existing span, otherwise the new span becomes a “root span”, one that does not have a parent. The span is also created with labels, or tags, which provide metadata about the span and can be used to query in many distributed tracing systems.

You should consider adding tracing if your func accepts a context.


Metrics provide quantitative data about the operations of the controller. This includes cumulative data like counters, single numerical values like guages, and distributions of counts / samples like histograms & summaries.

In CAPZ we expose metrics using the Prometheus client. The Kubebuilder project provides a guide for metrics and for exposing new ones.

Submitting PRs and testing

Pull requests and issues are highly encouraged! If you’re interested in submitting PRs to the project, please be sure to run some initial checks prior to submission:

make lint # Runs a suite of quick scripts to check code structure
make test # Runs tests on the Go code

Executing unit tests

make test executes the project’s unit tests. These tests do not stand up a Kubernetes cluster, nor do they have external dependencies.

Automated Testing


Mocks for the services tests are generated using GoMock.

To generate the mocks you can run

make generate-go

E2E Testing



You can optionally set the following variables:

E2E_CONF_FILEThe path of the E2E configuration file.${GOPATH}/src/
SKIP_CLEANUPSet to true if you do not want the bootstrap and workload clusters to be cleaned up after running E2E tests.false
SKIP_CREATE_MGMT_CLUSTERSkip management cluster creation. If skipping managment cluster creation you must specify KUBECONFIG and SKIP_CLEANUPfalse
LOCAL_ONLYUse Kind local registry and run the subset of tests which don’t require a remotely pushed controller image.true
REGISTRYRegistry to push the controller
CLUSTER_NAMEName of an existing workload cluster. Will run specs against existing workload cluster. Use in conjunction with SKIP_CREATE_MGMT_CLUSTER, GINKGO_FOCUS and KUBECONFIG. Must specify only one e2e spec to run against with GINKGO_FOCUS such as export GINKO_FOCUS=Creating.a.VMSS.cluster.with.a.single.control.plane.node.
KUBECONFIGUsed with SKIP_CREATE_MGMT_CLUSTER set to true. Location of kubeconfig for the management cluster you would like to use. Use kind get kubeconfig --name capz-e2e > kubeconfig.capz-e2e to get the capz e2e kind cluster config‘~/.kube/config’

You can also customize the configuration of the CAPZ cluster created by the E2E tests (except for CLUSTER_NAME, AZURE_RESOURCE_GROUP, AZURE_VNET_NAME, CONTROL_PLANE_MACHINE_COUNT, and WORKER_MACHINE_COUNT, since they are generated by individual test cases). See Customizing the cluster deployment for more details.

Conformance Testing

To run the Kubernetes Conformance test suite locally, you can run


With the following environment variables defined, you can build a CAPZ cluster from the HEAD of Kubernetes main branch or release branch, and run the Conformance test suite against it:

Environment VariableValue
KUBERNETES_VERSIONlatest - extract Kubernetes version from (main’s HEAD)
latest-1.21 - extract Kubernetes version from (release branch’s HEAD)

With the following environment variables defined, CAPZ runs ./scripts/ as part of ./scripts/, which allows developers to build Kubernetes from source and run the Kubernetes Conformance test suite against a CAPZ cluster based on the custom build:

Environment VariableValue
AZURE_STORAGE_ACCOUNTYour Azure storage account name
AZURE_STORAGE_KEYYour Azure storage key
JOB_NAMEtest (an enviroment variable used by CI, can be any non-empty string)
REGISTRYYour Registry

Running custom test suites on CAPZ clusters

To run a custom test suite on a CAPZ cluster locally, set AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, AZURE_SUBSCRIPTION_ID, AZURE_TENANT_ID and run:

./scripts/ bash -c "cd ${GOPATH}/src/ && make e2e"

You can optionally set the following variables:

SKIP_CLEANUPSkip deleting the cluster after the tests finish running.
KUBECONFIGProvide your existing cluster kubeconfig filepath. If no kubeconfig is provided, ./kubeconfig will be used.
USE_CI_ARTIFACTSUse a CI version of Kubernetes, ie. not a released version (eg. v1.19.0-alpha.1.426+0926c9c47677e9)
CI_VERSIONProvide a custom CI version of Kubernetes. By default, the latest master commit will be used.
TEST_CCMBuild a cluster that uses custom versions of the Azure cloud-provider cloud-controller-manager and node-controller-manager images
EXP_MACHINE_POOLUse Machine Pool for worker machines.
REGISTRYRegistry to push any custom k8s images or cloud provider images built.

You can also customize the configuration of the CAPZ cluster (assuming that SKIP_CREATE_WORKLOAD_CLUSTER is not set). See Customizing the cluster deployment for more details.