Skip to content
Edouard Topin
vks kubernetes vcf-9 platform-engineering vmware

Deploying your first VKS cluster on VCF 9: An architect's guide

VKS is not TKG renamed. Architecture, consumption paths, annotated YAML, day-2 ops, and real limitations — the architect's guide to VCF 9.

Edouard Topin

2 min read 2 min de lecture

I’m often asked if VKS is just TKG with a new name. Short answer: no. Slightly longer answer: that’s what this article is about.

The goal here isn’t to walk through Broadcom slides. It’s to decode how the CNCF ecosystem has been wired into vSphere, identify the architectural decisions that commit you for years, and provide a reference YAML manifest you can adapt to your context on day one.

Assumed audience: you know Kubernetes, you’ve already operated vSphere, and you want to understand the VKS model before committing — or before convincing your team to commit.

What is VKS in 2026

The story starts with Tanzu Kubernetes Grid. TKG was a separate project, deployed on vSphere but independent of it — its own lifecycle, versions, and operational complexity.

TKG Service marked a first integration: Kubernetes provisioned from the vSphere Supervisor, with common abstractions. But the seams were still visible.

VKS is the next step. The rename isn’t cosmetic: it codifies functional expansion. VKS is now a first-class Supervisor Service, upgradable independently of the Supervisor itself. That’s the change that changes everything in terms of lifecycle.

Tanzu Kubernetes GridTKG ServiceVKS (2025–)

Positioning in VCF 9. VKS is the native Kubernetes runtime of the platform. It’s CNCF-conformant — meaning your standard Kubernetes manifests work without modification. The VKr distribution (VMware Kubernetes release) is a signed and versioned OVA, the conceptual equivalent of an AMI for your control plane and nodes.

Against competition. OpenShift brings more opinions and integrated operational tooling, at the cost of stronger lock-in. Rancher excels at multi-cloud deployments and heterogeneous clusters. EKS Anywhere targets teams already in the AWS ecosystem. VKS targets organizations that have already invested in vSphere and want Kubernetes without leaving vCenter governance. Trade-offs are different — not a matter of one being better, but choosing by context.

The feature worth remembering: VKS is upgradable without touching the Supervisor. For an organization managing vSphere update cycles constrained by change windows, that’s a concrete operational argument.

Architecture under the hood

Understanding VKS means understanding how six components articulate. We’ll walk through them quickly, then revisit the two that really matter.

Supervisor — the control plane embedded in vCenter. It’s the Kubernetes API server running directly on the ESXi hosts of the designated vSphere cluster. It’s the entry point for all CAPI operations.

vSphere Namespace — the resource and security boundary. Each namespace carries CPU, RAM, storage, and cluster count quotas. It’s the isolation unit between teams or projects.

Cluster API (CAPI) — the declarative engine. VKS relies on CAPI to provision and reconcile workload clusters. That’s why the YAML manifests look like standard CAPI: because they are.

VKr (VMware Kubernetes release) — the VMware-signed Kubernetes “build”. Each VKr is a versioned OVA containing the Kubernetes distribution with VMware-backported patches. You choose the VKr version like you’d choose an AMI version.

Cloud Provider Plugin (CNS) — the layer that materializes a PersistentVolumeClaim into a VMDK automatically provisioned on the datastore covered by the chosen storage policy. Transparent to workloads, opaque to the storage admin.

Antrea — the default CNI. NodePortLocal, NetworkPolicies, native mTLS between pods. Native integration with NSX for environments that need it.

Pinniped — authentication. Delegates to vCenter SSO or an external OIDC provider. kubectl tokens are aligned with enterprise identities without a custom layer.

Two components deserve attention.

The Supervisor is the central point of failure. It hosts the Kubernetes API server and orchestrates all VKS clusters on the vSphere cluster. If it breaks, all clusters on the same vSphere cluster are impacted at the control plane level — workloads keep running, but no Kubernetes operations pass. The Supervisor must be on a vSphere HA topology. That’s not a recommendation, it’s an operational requirement.

VKr fundamentally changes lifecycle. Historically, upgrading Kubernetes under vSphere often meant upgrading the platform. Not with VKr. The Supervisor can host multiple Kubernetes versions simultaneously. A cluster in v1.30 and another in v1.32 can coexist on the same Supervisor. Upgrading one VKS cluster doesn’t touch the others. That decoupling gives VKS its primary operational argument.

Prerequisites

What the official docs don’t always make crystal clear.

Supervisor enabled on a vSphere cluster — with NSX or VDS networking depending on context. This is the foundation without which VKS doesn’t exist.

Storage policies configured for namespaces. Without a policy attached to the namespace, no PVC can be provisioned by workloads.

Content Library synchronized for VKr images. VKr OVAs are downloaded to a Content Library subscribed from the Broadcom repository. Initial sync can take time depending on bandwidth.

IP pools sized for control plane and worker nodes. Each node gets a fixed IP. Plan generously: each VKS cluster consumes at minimum one IP per node plus one IP for the control plane VIP.

Namespace quotas calibrated — CPU, memory, storage, cluster count. Quotas too tight block provisioning. Quotas too loose eliminate governance. Calibrate by team profile.

Load Balancer available — NSX Advanced LB (Avi) or equivalent for LoadBalancer type services. Without an LB, application services remain exposed only as NodePort.

The three consumption paths

VKS offers three ways to provision a cluster. They’re not equivalent — each targets a specific user profile.

For whom — vSphere ops, exploration, POCs, teams without GitOps culture.

Strengths — guided graphical interface, visual YAML generation, zero kubectl required for first clusters. LCI (Local Consumption Interface) is itself a Supervisor Service, installable from the Broadcom Support Portal (My Downloads → Free Downloads → vSphere Supervisor Services → Local Consumption Interface).

Limitations — no native GitOps, manual workflow, poor auditability. Acceptable for POC, insufficient for production multi-team.

For whom — platform engineers, GitOps pipelines, teams with Kubernetes maturity.

Strengths — fully declarative, versionable in Git, CI/CD friendly. The YAML manifest is standard CAPI with VMware extensions — a Kubernetes engineer immediately recognizes it.

Limitations — CAPI learning curve if the team doesn’t know it. Connection to the Supervisor requires the kubectl vsphere plugin. Package it in team toolboxes from the start.

For whom — internal cloud providers, multi-tenant environments, organizations with strong governance over resource allocation.

Strengths — complete self-service, integrated governance, quotas by organization and project, service catalog. The platform team publishes templates, app teams consume without vCenter access.

Limitations — significant configuration overhead, requires full VCF Automation stack. Don’t deploy VCFA just for the first VKS cluster. Industrialization via Terraform and VCFA deserves a dedicated article.

Recommended approach. LCI for discovery and POCs. kubectl + YAML for production single-tenant or small platform teams with GitOps culture. VCF Automation for multi-tenant with formal governance. Don’t over-engineer from day one: start with kubectl, migrate to VCFA when demand justifies it.

Walkthrough LCI in 6 steps

The fastest path to a first cluster. Based on William Lam’s walkthrough, referenced at the end.

  1. Install the LCI service as a Supervisor Service from the Broadcom Support Portal. The manifest is in Free Downloads → vSphere Supervisor Services. Installation is from vSphere UI → Workload Management → Supervisor Services.

  2. Access LCI from vSphere UI (dedicated tab in Workload Management) or as a standalone interface at the service URL. The interface auto-connects to the Supervisor of the current vSphere cluster.

  3. Guided creation — select target namespace, cluster class (small/medium/large), VKr version, and node count. LCI proposes validated combinations and warns if namespace quotas are insufficient.

  4. Generate YAML — LCI exports the CAPI manifest matching your selections. This YAML is exactly what you’d write by hand. It’s also the quickest way to understand CAPI structure without starting from a blank page.

  5. Apply via kubectl or VCF CLI. The manifest can be versioned in Git at this point to bootstrap a GitOps repo.

  6. Download kubeconfig from LCI or via kubectl once the cluster is provisioned. Fetch command:

# Connect to the Supervisor
kubectl vsphere login --server=supervisor.example.com --insecure-skip-tls-verify

# Switch to target namespace
kubectl config use-context platform-team

# Apply the manifest
kubectl apply -f cluster.yaml

# Verify provisioning
kubectl get clusters -n platform-team

# Fetch workload cluster kubeconfig
kubectl vsphere login --server=supervisor.example.com \
  --tanzu-kubernetes-cluster-namespace platform-team \
  --tanzu-kubernetes-cluster-name production-cluster-01

The YAML manifest decoded

The CAPI manifest for a VKS cluster is brief. Every line has a reason.

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: production-cluster-01
  namespace: platform-team       # Target vSphere Namespace
spec:
  clusterNetwork:
    services:
      cidrBlocks: ["10.96.0.0/12"]      # Kubernetes services IP range
    pods:
      cidrBlocks: ["192.168.0.0/16"]    # Pod IP range
    serviceDomain: cluster.local
  topology:
    class: tanzukubernetescluster       # ClusterClass installed by Supervisor
    version: v1.32.0+vmware.1           # VKr version to use
    controlPlane:
      replicas: 3                       # HA mandatory — never 1 in prod
    workers:
      machineDeployments:
        - class: node-pool
          name: workers
          replicas: 3
    variables:
      - name: vmClass
        value: guaranteed-medium        # VM class for nodes
      - name: storageClass
        value: vsan-default-storage-policy
      - name: defaultStorageClass
        value: vsan-default-storage-policy

apiVersion and kind. We use Cluster API v1beta1 — the same upstream as any other CAPI deployment. VKS doesn’t reinvent the spec, it specializes it via ClusterClasses and variables. A Kubernetes engineer immediately recognizes the structure.

clusterNetwork. CIDR ranges must align with the organization’s global addressing plan. In multi-environment (dev/staging/prod), use disjoint ranges to avoid collisions if clusters ever need to communicate via VPN or peering. Never leave defaults if you operate multiple clusters.

topology.class. The tanzukubernetescluster reference points to the ClusterClass installed by the Supervisor. That’s what differentiates a VKS cluster from a generic CAPI cluster: the ClusterClass embeds all vSphere integration logic (machine templates, bootstrap, CNI). Don’t modify this value.

topology.version. The +vmware.1 suffix is VMware’s notation. It identifies the exact VKr to deploy, with its backported patches. Always use a version available in the synchronized Content Library — provisioning fails silently if the OVA isn’t present locally.

controlPlane.replicas. 3 minimum, always. etcd quorum works on odd numbers. With a single control plane node, losing an ESXi host renders the cluster inoperable. With 3, you lose a host and your cluster keeps running normally.

vmClass. The VM class determines CPU and memory reservations for nodes. guaranteed-* classes reserve resources guaranteeing performance. best-effort-* classes reserve nothing — acceptable for dev/test, strictly avoid in production. guaranteed-medium (4 vCPU / 16 GB) is a good starting point for a standard app cluster.

storageClass. The vSphere storage policy used for all PersistentVolumes in the cluster. This policy must exist in vSphere and be attached to the target namespace. In vSAN ESA, the RAID-6 policy with encryption is a good production default.

Day-2 operations

The checklist doesn’t stop at provisioning. What happens after determines if the cluster lasts.

Persistent Volumes. CNS (Cloud Native Storage) materializes each PVC into a VMDK automatically provisioned on the datastore covered by the cluster’s storage policy. From the workload view: a standard Kubernetes PVC. From the storage admin view: a VMDK visible in vCenter, snapshotable, replicable with standard vSphere tools. Both worlds are reconciled without custom abstraction.

LoadBalancer Services. Native integration with NSX Advanced LB (Avi) allocates an IP from a dedicated pool for each LoadBalancer service. The IP pool is configured in the vSphere namespace. No MetalLB to deploy, no BGP configuration to maintain — integration is transparent to developers.

Ingress. VKS doesn’t deploy an Ingress controller by default. The choice belongs to the platform team: Contour for NSX-integrated environments, NGINX for teams who already know it. Deploy via Helm or as a VKS package depending on version. External-DNS can be layered in to automate DNS registration against Microsoft DNS or AWS Route 53.

Observability. The standard stack works: Prometheus Operator + Grafana + Loki deployed in the cluster, with logs and metrics forwarded to VCF Operations for Logs to correlate vSphere infra and Kubernetes workloads. A dedicated Broadcom article (April 2026) details VKS + VCF Operations integration.

External-DNS. Automating DNS record registration from Kubernetes is supported via standard External-DNS charts with the appropriate provider for your organization’s DNS. Configure with a dedicated service account, not admin credentials.

Service mesh. Istio is available via the standard VKS package since version 3.4 — deployable without custom modifications. For teams without service mesh culture, don’t force it: Antrea NetworkPolicies + Antrea mTLS cover most network security needs without the operational complexity of a full service mesh.

Async upgrade: the underestimated feature

Historically, upgrading Kubernetes under vSphere meant upgrading the platform. Not in VCF 9.

Supervisor ↔ VKS version decoupling. The Supervisor hosts multiple VKr versions simultaneously. A v1.30 cluster and a v1.32 cluster can coexist on the same Supervisor, in the same vSphere cluster. Workload cluster Kubernetes versions are no longer tied to the vCenter lifecycle.

Sync vs async registration. VKS supports two modes for registering new VKr versions. Sync mode (default) auto-downloads new VKrs from the Content Library. Async mode lets you manually control which versions are available — essential for environments with internal validation or change management constraints. In air-gapped mode, the VKr must be imported manually via the offline VKr relocation procedure documented by Broadcom.

In practice. The Kubernetes patch cycle (typically 3 months) can now be decoupled from the vSphere update cycle (quarterly or semi-annual depending on organization). Security teams can impose a Kubernetes patch SLA without blocking on infrastructure cycles.

Gotchas and real limitations

What marketing slides don’t say.

VKS is not vanilla Kubernetes
The VKr distribution ships with VMware-backported patches that aren't always documented in detail. Some behaviors can diverge slightly from upstream. Before filing a Kubernetes bug, verify the behavior is reproducible on a pure upstream cluster. The actual version (with patches) is visible in the VKr release notes on Broadcom TechDocs.
Antrea NodePortLocal limitations
NodePortLocal improves performance by mapping ports directly to pod IPs, but it only applies to ClusterIP services. NodePort and ExternalName services don't benefit from NodePortLocal. If your app relies on NodePort patterns for service discovery, validate behavior before migration.
Third-party CNIs: limited support
Antrea is the recommended and only fully supported CNI by VMware. Calico can work but requires specific configuration, especially with F5 or NSX network overlays. Cilium is not officially supported. If your organization has a different CNI standard, validate the support matrix with Broadcom before committing.
Supervisor dependency: blast radius
The Supervisor is the central point of failure for all VKS clusters on the same vSphere cluster. If the Supervisor degrades, all Kubernetes control plane operations are impacted: workloads keep running, but no kubectl passes. The vSphere HA protects the Supervisor VMs, but the Supervisor itself is a criticality target to treat accordingly in your resilience plan.
No version downgrade
Once a cluster is upgraded to a higher VKr version, there's no path back. Neither for the full cluster nor for individual nodes. This means upgrades must be seriously prepared: validate in staging, read the target VKr release notes, and plan for application (not infra) rollback if regressions are detected post-upgrade.
Licensing cost: clarify upfront
VKS is included in VCF licensing, but the consumption model for cores used by Kubernetes clusters isn't always transparent in historical contracts. Some organizations discovered surprise costs at renewal. Explicitly clarify with your Broadcom contact how VKS clusters are counted in your contract before rolling out to production at scale.

Conclusion and next steps

Key takeaways. VKS rests on three pillars: the Supervisor as a vSphere-native control plane, CAPI as a standard declarative engine, and VKr as a versioned, decoupled distribution. The three consumption paths (LCI, kubectl, VCFA) address different maturities. The YAML manifest is standard CAPI with VMware variables — accessible to any Kubernetes engineer. Day-2 (volumes, LB, observability) relies on known patterns, integrated into the vSphere stack.

Next step. The following article covers industrialization: how to use Terraform and VCF Automation to provision VKS clusters in multi-tenant self-service with formal governance policies and a service catalog. It’s the natural next step once your first kubectl-provisioned cluster runs in production.

Resources.

For further reading:

Back to Blog
Share:

Follow along

Stay in the loop — new articles, thoughts, and updates.