Skip to content
Edouard Topin
vcf-9-1 vks kubernetes platform-engineering broadcom

VCF 9.1: Kubernetes & self-service, the platform takes over

VKS linked clones, 500 clusters per Supervisor, simplified Container-as-a-Service and Tech Preview object storage: how VCF 9.1 closes the self-service gap.

Edouard Topin

2 min read 2 min de lecture
VMware Cloud Foundation 9.1 — Kubernetes and self-service

In 9.0, VKS was an excellent vSphere-native Kubernetes runtime, but operating the platform was still platform-team work. Provisioning a clean namespace, attaching quotas, registry, ingress and identity to it — that was a sequence of manual steps application teams could not trigger on their own. The gap between “operated cluster” and “self-service platform” was bridged with homegrown scripts.

VCF 9.1 attacks that gap head-on. Four changes — linked clones for VKS, scale to 500 clusters per Supervisor, simplified Container-as-a-Service, native object storage in Tech Preview — shift the boundary: what the platform team did by hand becomes a consumable construct. This article decodes each one, with the 9.0 → 9.1 delta and the concrete architecture impact.

VKS 9.0Linked clones + scaleSelf-service 9.1

VKS & VM Fast-Deploy: linked clones change the scale

Up to 9.0, every VKS node — control plane and worker alike — was a full clone of the VKr OVA: a complete copy of the source disk to the datastore before first boot. On a 6-node cluster, that’s six multi-GB copies to provision, sequentially constrained by datastore throughput. Cluster deployment time — and even more so rolling upgrade time (every node recreated) — was dominated by that copy.

9.1 introduces Fast-Deploy via linked clones for VKS and VM clusters. The mechanics: a read-only parent disk (from the VKr) is instantiated once, then each node creates a delta disk that stores only the blocks changed relative to the parent. The node boots in seconds instead of waiting for a full copy. Conceptually, it’s copy-on-write applied to Kubernetes node provisioning.

AspectVKS 9.0 (full clone)VKS 9.1 (linked clone)
Node creationFull source-disk copyDelta disk over a shared parent
6-node cluster deployDominated by N sequential copiesNear-parallel, drastically reduced time
Rolling upgradeEach node = full clone recreatedEach node = delta over existing parent
Initial storage footprintN × image size1 × parent + N deltas (growing)
I/O couplingLong copy, datastore saturatedShared parent read, delta write

Architect impact. Linked clones aren’t just a speed optimization: they make upgrades operationally trivial. When recreating 200 nodes costs minutes instead of hours, the change window shrinks and the Kubernetes patch SLA becomes tenable without negotiating long maintenance windows. The trade-off to account for: the shared parent disk becomes a read hotspot, and delta disks grow over time. Storage sizing now reasons in parent + delta growth, not in N × a fixed image.

Scale: 500 Kubernetes clusters per Supervisor

In 9.0, the Supervisor topped out well below a hundred comfortably usable workload clusters — enough for one team, tight for an internal cloud provider serving dozens of tenants. 9.1 raises the bar to 500 Kubernetes clusters per Supervisor.

DimensionVKS 9.0VKS 9.1
Clusters / SupervisorLow limit (tens)Up to 500
Target modelTeam / small multi-tenantLarge-scale internal cloud provider
Density per vSphere cluster1 platform = 1 narrow Supervisor1 Supervisor = full Kubernetes fleet

What it unlocks: a single Supervisor can now carry an entire organization’s Kubernetes fleet, instead of fragmenting into several Supervisors (and therefore several governance perimeters, several control planes to operate). For a multi-tenant model where each tenant gets one or more dedicated clusters, 500 clusters/Supervisor changes the map: fewer Supervisors, simpler governance, but a wider blast radius.

Architect impact. 500 clusters is not a free quota. The Supervisor remains a shared control plane: its etcd, its CAPI controllers, its namespace quotas now absorb a far higher reconciliation load. Five hundred clusters means five hundred CAPI reconciliation loops, hundreds of thousands of objects in the Supervisor API server, and an etcd whose latency becomes critical. The rule: treat the figure as a validated architectural ceiling, not a target. Size the Supervisor (HA control plane, etcd IOPS, control-plane observability) before approaching density, and keep headroom.

Simplified Container-as-a-Service

In 9.0, making a namespace available to an application team looked like this: create the vSphere Namespace, manually attach storage policies, configure CPU/RAM/storage quotas, wire in a registry, deploy or attach an ingress controller, then plumb identity (SSO mapping, RBAC). Six steps, six chances of divergence between tenants, and a platform team in the critical path of every onboarding.

9.1 turns this into a true self-service Container-as-a-Service. Provisioning a namespace becomes a consumable action that automatically inherits VCF constructs: registry, ingress, quotas and identity are derived from the tenant’s VCF perimeter rather than rewired by hand. The application team requests a namespace; it receives an already-governed namespace, with its registry, its ingress entry point and its identity aligned to enterprise SSO.

Self-service Container-as-a-Service in VCF 9.1

Source : Broadcom — VCF Blog

Namespace onboarding step9.0 flow (manual)9.1 flow (CaaS)
Namespace creationPlatform-team actionConsumable self-service
Storage policyManual attachmentInherited from VCF construct
RegistrySeparate wiringProvisioned with the namespace
IngressManual deploy / attachIncluded in the construct
QuotasDefined by hand per profileDerived from tenant perimeter
Identity / RBACManual SSO mappingInherited from VCF identity

Architect impact. The platform team steps out of the onboarding critical path without losing governance: guardrails (quotas, identity, policies) are encoded in the construct, not applied after the fact. The platform team’s role shifts from repetitive execution to defining tenant profiles. This is exactly the move we were trying to script in 9.0 — except here it’s native and consistent by default.

Native Object Storage (Tech Preview)

Block storage (PVC → VMDK via CNS) and file storage were already self-service in 9.0. The big absentee: S3-compatible object storage, which developers want for artifacts, application backups, datasets and cloud-native application state. In 9.0 you had to leave the platform (external bucket, self-managed MinIO) — and therefore break the governance model.

9.1 introduces native S3-compatible object storage in self-service, in Tech Preview. Developers provision buckets through the same deploy / scale / manage workflow as block and file; IT keeps the guardrails (quotas, policies, identity) without becoming a bottleneck. The promise: close the last missing storage category so the platform covers all three axes (block, file, object) under unified governance.

Storage categoryVKS 9.0VKS 9.1
Block (PVC)Self-service via CNSSelf-service via CNS
FileSelf-serviceSelf-service
Object (S3)Off-platformNative self-service (Tech Preview)

Architect impact. The point isn’t to use this in production now — it’s to scope the target today. If your teams consume external S3 or self-managed MinIO, the Tech Preview lets you prototype the migration and measure the governance delta, so you’re ready on GA day without rewriting access patterns.

From operated cluster to self-service platform

Taken in isolation, each of these four changes is an improvement. Taken together, they close the self-service gap end to end.

Linked clones make provisioning and upgrades fast enough to be self-service — without them, exposing cluster creation to teams would saturate the datastore. Scale to 500 clusters/Supervisor makes multi-tenant scale reachable without fragmenting governance — it’s the Kubernetes counterpart to the networking & scale work covered in the previous article. Simplified CaaS encodes governance into the construct rather than into runbooks. Object storage completes the coverage so developers no longer have to leave the platform.

The result: what we built by hand on top of VKS in 9.0 — a journey described step by step in Deploying your first VKS cluster on VCF 9 — becomes native platform behavior in 9.1. The platform team doesn’t disappear; it moves from repetitive execution to defining profiles, guardrails and quotas. This is the shift from the operated cluster to the self-service platform.

Pitfalls & points of attention

Linked clones: storage / I/O coupling to watch
The shared parent disk becomes a read hotspot when many nodes boot simultaneously, and delta disks grow with usage. Size the datastore on parent + delta growth, not on N × a fixed image. Watch parent read latency during mass rolling upgrades: the speed gain can shift into I/O pressure if the datastore is undersized.
500 clusters is not free etcd
The figure is a validated architectural ceiling, not a target to aim for. Five hundred clusters means as many CAPI reconciliation loops and a massive etcd load on the Supervisor. Size the Supervisor control plane (HA, etcd IOPS, API server memory) and instrument the control plane before approaching density. Keep headroom — a Supervisor at 95% of its reconciliation capacity is fragile.
CaaS: quota governance is defined upfront
Self-service inherits quotas from the tenant construct. If tenant profiles are poorly calibrated, self-service propagates quotas that are too wide (loss of governance) or too tight (blocked provisioning) at scale. Scoping the profiles becomes critical work: that's now where governance is decided, not in manual execution.
Object Storage: Tech Preview, not GA
No production support, unstabilized API, functionality liable to change or be removed before GA. Use only to evaluate and prepare the target. Don't carry critical workloads or data on it without a fallback to a supported object solution. Track the release notes for the GA date and any API changes.
Identity inheritance: SSO consistency to validate
The CaaS namespace inherits the identity of the tenant's VCF perimeter. If the enterprise SSO mapping (groups, roles) isn't clean upstream, inheritance propagates inconsistent permissions to every new namespace. Validate identity federation and group mapping before opening self-service — fixing it afterward across dozens of namespaces is expensive.
Blast radius: fewer Supervisors = wider perimeter
Consolidating the fleet onto a single Supervisor at 500 clusters simplifies governance but widens the blast radius: a control-plane incident potentially impacts the organization's entire Kubernetes fleet. Arbitrate consolidation vs isolation explicitly, and treat the Supervisor as a maximum-criticality target in the resilience plan.

Conclusion

Self-service-able provisioning

Linked clones make deploy and upgrade fast enough to expose to teams without saturating storage. Upgrades become operationally trivial.

Multi-tenant scale

500 clusters per Supervisor reach internal-cloud-provider scale without fragmenting governance — provided etcd and the control plane are sized for it.

Encoded governance

CaaS and object storage (Tech Preview) move governance into the construct. The platform team defines profiles, no longer executes them.

Next step. The fourth and final article in the series covers VCF 9.1 security & resilience — the defensive counterpart to this self-service opening: the more you expose the platform, the more security guardrails and resilience mechanisms become structural. Read it alongside the networking & scale article, which lays the network foundations of this same fleet.

For further reading.

Resources:

Back to Blog
Share:

Follow along

Stay in the loop — new articles, thoughts, and updates.