GPU-enabled clusters


With CAPZ you can create GPU-enabled Kubernetes clusters on Microsoft Azure.

Before you begin, be aware that:

To deploy a cluster with support for GPU nodes, use the nvidia-gpu flavor.

An example GPU cluster

Let’s create a CAPZ cluster with an N-series node and run a GPU-powered vector calculation.

Generate an nvidia-gpu cluster template

Use the clusterctl config cluster command to generate a manifest that defines your GPU-enabled workload cluster.

Remember to use the nvidia-gpu flavor with N-series nodes.

AZURE_LOCATION=southcentralus \
clusterctl config cluster azure-gpu \
  --kubernetes-version=v1.19.7 \
  --worker-machine-count=1 \
  --flavor=nvidia-gpu > azure-gpu-cluster.yaml

Create the cluster

Apply the manifest from the previous step to your management cluster to have CAPZ create a workload cluster:

$ kubectl apply -f azure-gpu-cluster.yaml created created created created created created created

Wait until the cluster and nodes are finished provisioning. The GPU nodes may take several minutes to provision, since each one must install drivers and supporting software.

$ kubectl get cluster azure-gpu
azure-gpu   Provisioned
$ kubectl get machines
NAME                             PROVIDERID                                                                                                                                     PHASE     VERSION
azure-gpu-control-plane-t94nm    azure:////subscriptions/<subscription_id>/resourceGroups/azure-gpu/providers/Microsoft.Compute/virtualMachines/azure-gpu-control-plane-nnb57   Running   v1.19.7
azure-gpu-md-0-f6b88dd78-vmkph   azure:////subscriptions/<subscription_id>/resourceGroups/azure-gpu/providers/Microsoft.Compute/virtualMachines/azure-gpu-md-0-gcc8v            Running   v1.19.7

You can run these commands against the workload cluster to verify that the NVIDIA device plugin has initialized and the resource is available:

$ clusterctl get kubeconfig azure-gpu > azure-gpu-cluster.conf
$ export KUBECONFIG=azure-gpu-cluster.conf
$ kubectl -n kube-system get po | grep nvidia
kube-system   nvidia-device-plugin-daemonset-d5dn6                    1/1     Running   0          16m
$ kubectl get nodes
NAME                            STATUS   ROLES    AGE   VERSION
azure-gpu-control-plane-nnb57   Ready    master   42m   v1.19.7
azure-gpu-md-0-gcc8v            Ready    <none>   38m   v1.19.7
$ kubectl get node azure-gpu-md-0-gcc8v -o jsonpath={.status.allocatable} | jq
  "attachable-volumes-azure-disk": "12",
  "cpu": "6",
  "ephemeral-storage": "119716326407",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "115312060Ki",
  "": "1",
  "pods": "110"

Run a test app

Let’s create a pod manifest for the cuda-vector-add example from the Kubernetes documentation and deploy it:

$ cat > cuda-vector-add.yaml << EOF
apiVersion: v1
kind: Pod
  name: cuda-vector-add
  restartPolicy: OnFailure
    - name: cuda-vector-add
      image: ""
 1 # requesting 1 GPU
$ kubectl apply -f cuda-vector-add.yaml

The container will download, run, and perform a CUDA calculation with the GPU.

$ kubectl get po cuda-vector-add
cuda-vector-add   0/1     Completed   0          91s
$ kubectl logs cuda-vector-add
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory

If you see output like the above, your GPU cluster is working!