The security of a private platform isn’t decided by a spectacular feature: it’s decided by your patching cadence, by the configuration drift you don’t see, and by the day ransomware encrypts the production datastore. VCF 9.1 attacks these three fronts with changes that directly modify your runbooks — not just your slides.
If you’ve ever deferred a critical ESX patch because the maintenance window didn’t exist, or discovered a host had drifted from baseline three months after the audit, this article is for you. We look at what actually changes, and what it implies for architecture and the recovery plan.
Series 'What's new in VCF 9.1' — 4/4
A mini-series on what’s new in VMware Cloud Foundation 9.1:
- Infrastructure efficiency & TCO
- Networking & scale
- Kubernetes & self-service
- Security & resilience (this article — the last in the series)
Visual credits
This article is based on the official VCF 9.1 documentation and blog (links at the end). Synthesis and analysis are my own.
Live Patching for ESX hosts (TPM)
The hidden cost of ESX patching isn’t applying the patch: it’s the evacuation. To patch a host today you put it in maintenance mode, vMotion every VM elsewhere, reboot, exit maintenance, then rebalance. Multiply that across a fleet of several hundred hosts and you get weeks of negotiated maintenance windows, and a patching cadence that falls behind the CVE rhythm.
VCF 9.1 introduces Live Patching for ESX hosts equipped with a TPM. The principle: the patch is applied directly into kernel memory on the running host, with no reboot and no VM evacuation. The TPM acts as the trust anchor that validates the integrity of the patched code before it is applied live — which is exactly why TPM hardware is a non-negotiable prerequisite, not a convenience option.
| Aspect | VCF 9.0 | VCF 9.1 |
|---|---|---|
| Applying an ESX patch | Maintenance mode required | Live patch in kernel memory (TPM hosts) |
| VM evacuation | Systematic (full vMotion) | None for ~80% of patches |
| Host reboot | Almost always | Avoided for ~80% of patches |
| Hardware prerequisite | None specific | TPM enabled and provisioned |
| Realistic patching cadence | Constrained by windows | Decoupled from windows for the majority |
The exact scope, no embellishment. Live Patching covers roughly 80% of fixes — typically kernel and module security fixes that don’t touch the low-level structures requiring a reinitialization. The remaining ~20% — microcode changes, firmware updates, deep structural kernel changes, major version upgrades — still require a reboot and therefore an evacuation. Live Patching doesn’t remove the maintenance window: it reserves it for the cases that genuinely need it.
Fleet-scale impact. This is where the value shows. On a fleet of 300 hosts, moving 80% of patches to live means dividing the volume of maintenance windows to coordinate by five. Concretely: a critical security patch can be deployed across the whole fleet during business hours, on the entire estate, without touching workload SLAs. The strategic consequence is a cadence change: you can target weekly patching of your security posture instead of a negotiated quarterly cycle.
The heterogeneous estate trap
If part of your fleet has no TPM (older hardware, poorly equipped edge sites), you end up with two patching regimes: live for TPM hosts, classic window for the rest. Mapping the TPM estate is a prerequisite to designing the patching strategy, not a production discovery.
Continuous compliance (Advanced Cyber Compliance)
In VCF 9.0, compliance is a snapshot. You run a scan, you get a report, you remediate, and between two scans configuration drift lives its own life. The structural problem: the window between two audits is exactly where a mishandled change, an undocumented modification, or a policy regression settles in with no alert.
VCF 9.1 moves compliance from a point-in-time model to a continuous one, with unified security posture management (Advanced Cyber Compliance) covering the whole VCF stack — vCenter, ESX, NSX, vSAN — from a single control plane. Compliance is no longer a photo taken at audit time: it’s a stream. Remediation becomes continuous: a detected drift triggers a remediation action instead of waiting for the next scan cycle.
| Aspect | VCF 9.0 | VCF 9.1 |
|---|---|---|
| Compliance model | On-demand point-in-time scan | Continuous assessment |
| Drift detection | At next scan | Continuous |
| Remediation | Manual after report | Continuous remediation triggered |
| Scope | Component by component | Unified multi-component posture |
| Risk window | Between two audits | Reduced to the detection delay |
What it changes for architects. Continuous compliance shifts the work: less time orchestrating scan campaigns, more time defining accurate baselines and managing noise. The point-in-time model produced a report every quarter; the continuous model produces a permanent event stream. The value depends entirely on baseline quality: a baseline that’s too strict generates a flood of false positives that drowns the real signal; one that’s too lax misses the drifts that matter.
Migration note. If you already run point-in-time compliance scans on 9.0, don’t migrate by turning everything on at once. Continuous remediation against an unrefined baseline can trigger unwanted corrective actions on production. The healthy sequence: enable in observation mode, refine baselines over two to four weeks, then switch automatic remediation on component by component.
On-prem anti-ransomware recovery
The scenario that loses companies: ransomware encrypts the production datastores, and the backups — connected to the same domain, the same network, sometimes the same credentials — get encrypted too. A backup restored into a still-compromised environment gets re-encrypted within the hour. Recovery without a clean isolated environment isn’t recovery: it’s a relapse.
VCF 9.1 integrates cyber recovery directly into the on-prem platform, built around the concept of an Isolated Recovery Environment (IRE) — often called a “clean room.” The idea: a restore environment physically and logically isolated from the production network and production identity, in which you restore, validate, and remediate before any reconnection. This is the pillar that turns a backup into a real recovery capability against ransomware.
Three technical building blocks make up the capability:
vSAN for Recovery — a recovery storage tier built on native vSAN snapshots. The snapshots serve as immutable restore points, independent of the compromised primary backup chain.
Isolated Recovery Environment (IRE) — the clean room: a restore environment cut off from the production network and identity. You restore and validate there with no re-encryption risk.
CrowdStrike EDR integration — the recovery workflow integrates a CrowdStrike EDR scan on restored workloads, to validate that a restored load is clean before reintroducing it to production.
| Aspect | VCF 9.0 | VCF 9.1 |
|---|---|---|
| Cyber recovery | External third-party solution to integrate | Integrated into the on-prem platform |
| Isolated environment (IRE) | Build/operate it yourself | Native clean room concept |
| Restore points | Classic backup chain | Immutable native vSAN snapshots |
| Workload validation | Manual / off-platform | CrowdStrike EDR in the workflow |
| Isolation guarantee | Depends on the in-house design | Network and identity isolation by design |
The key point about the IRE. A clean room isn’t a “second datacenter.” It’s an environment whose isolation is a disciplined property: no network route to production, no shared credentials, no common identity domain. Isolation discipline matters more than the technology: a poorly isolated IRE gives a false sense of security, which is worse than no IRE at all, because the crisis runbook will rely on it on D-day.
Resilience by design: what changes for the recovery plan
Taken individually, these three changes are features. Taken together, they redraw the recovery plan. Here’s how your runbooks must evolve.
The patching runbook changes nature. Before: a quarterly project with negotiated windows, business communication, host-by-host rollback plan. After: a continuous process for 80% of patches, and a window runbook reserved for the ~20% that still reboot. The consequence is that you need two distinct runbooks, not one adapted — plus a TPM map that determines which host follows which regime.
The compliance runbook moves from campaign to supervision. The skill to hire or train is no longer “knowing how to run and interpret a quarterly scan” but “knowing how to build accurate baselines and manage a drift event stream without drowning.” This is detection engineering work, not point-in-time auditing.
The cyber-recovery runbook becomes testable. The major contribution of the integrated IRE isn’t the technology: it’s that a cyber-recovery exercise becomes a repeatable process rather than an ad hoc project. The ransomware recovery plan must now include: clean room sizing, an audited isolation discipline, an EDR validation procedure before reintroduction, and — above all — a schedule of regular exercises. An IRE that’s never tested isn’t a recovery capability, it’s a hypothesis.
Resilience is not a checkbox
Enabling Live Patching, continuous compliance, and the IRE is not enough. These capabilities are only worth something if the runbooks are rewritten, baselines refined, and recovery exercises actually run. Technology shifts the work, it doesn’t remove it.
Pitfalls & points of attention
Live Patching requires a TPM — and doesn't cover everything
Continuous compliance: noise and false positives
Continuous remediation on an unrefined production
Clean room sizing and isolation discipline
vSAN for Recovery snapshot capacity
CrowdStrike licensing and integration prerequisites
Conclusion
Patching decoupled from windows
~80% of ESX patches applied live on TPM hosts, with no evacuation or reboot. The security cadence detaches from maintenance windows — but two runbooks to maintain.
Compliance as a continuous stream
The risk window between two audits disappears. The value depends entirely on baseline quality and noise management — this is detection engineering.
Testable recovery
The integrated IRE, vSAN for Recovery, and CrowdStrike EDR turn cyber recovery into a repeatable process — provided you actually exercise it.
End of the series « What's new in VCF 9.1 »
The four parts cover the essentials of VCF 9.1: efficiency & TCO, networking & scale, Kubernetes & self-service, and security. For the underlying architectural framework, see The new VCF 9 architecture and Deploying your first VKS cluster.
Further reading.
- VCF 9.1 Release Notes — the official detail of the security and resilience features
- VCF 9.1: secure, cost-effective private cloud for production AI — official VCF blog
- Announcing VCF 9.1 — official announcement
- William Lam — community technical deep-dives