VCF 9.1: security & resilience — live patching and anti-ransomware

The security of a private platform isn’t decided by a spectacular feature: it’s decided by your patching cadence, by the configuration drift you don’t see, and by the day ransomware encrypts the production datastore. VCF 9.1 attacks these three fronts with changes that directly modify your runbooks — not just your slides.

If you’ve ever deferred a critical ESX patch because the maintenance window didn’t exist, or discovered a host had drifted from baseline three months after the audit, this article is for you. We look at what actually changes, and what it implies for architecture and the recovery plan.

Zero-downtime patchingContinuous complianceOn-prem anti-ransomware

Series 'What's new in VCF 9.1' — 4/4

A mini-series on what’s new in VMware Cloud Foundation 9.1:

Infrastructure efficiency & TCO
Networking & scale
Kubernetes & self-service
Security & resilience (this article — the last in the series)

Visual credits

This article is based on the official VCF 9.1 documentation and blog (links at the end). Synthesis and analysis are my own.

Live Patching for ESX hosts (TPM)

The hidden cost of ESX patching isn’t applying the patch: it’s the evacuation. To patch a host today you put it in maintenance mode, vMotion every VM elsewhere, reboot, exit maintenance, then rebalance. Multiply that across a fleet of several hundred hosts and you get weeks of negotiated maintenance windows, and a patching cadence that falls behind the CVE rhythm.

VCF 9.1 introduces Live Patching for ESX hosts equipped with a TPM. The principle: the patch is applied directly into kernel memory on the running host, with no reboot and no VM evacuation. The TPM acts as the trust anchor that validates the integrity of the patched code before it is applied live — which is exactly why TPM hardware is a non-negotiable prerequisite, not a convenience option.

Aspect	VCF 9.0	VCF 9.1
Applying an ESX patch	Maintenance mode required	Live patch in kernel memory (TPM hosts)
VM evacuation	Systematic (full vMotion)	None for ~80% of patches
Host reboot	Almost always	Avoided for ~80% of patches
Hardware prerequisite	None specific	TPM enabled and provisioned
Realistic patching cadence	Constrained by windows	Decoupled from windows for the majority

The exact scope, no embellishment. Live Patching covers roughly 80% of fixes — typically kernel and module security fixes that don’t touch the low-level structures requiring a reinitialization. The remaining ~20% — microcode changes, firmware updates, deep structural kernel changes, major version upgrades — still require a reboot and therefore an evacuation. Live Patching doesn’t remove the maintenance window: it reserves it for the cases that genuinely need it.

Fleet-scale impact. This is where the value shows. On a fleet of 300 hosts, moving 80% of patches to live means dividing the volume of maintenance windows to coordinate by five. Concretely: a critical security patch can be deployed across the whole fleet during business hours, on the entire estate, without touching workload SLAs. The strategic consequence is a cadence change: you can target weekly patching of your security posture instead of a negotiated quarterly cycle.

The heterogeneous estate trap

If part of your fleet has no TPM (older hardware, poorly equipped edge sites), you end up with two patching regimes: live for TPM hosts, classic window for the rest. Mapping the TPM estate is a prerequisite to designing the patching strategy, not a production discovery.

Continuous compliance (Advanced Cyber Compliance)

In VCF 9.0, compliance is a snapshot. You run a scan, you get a report, you remediate, and between two scans configuration drift lives its own life. The structural problem: the window between two audits is exactly where a mishandled change, an undocumented modification, or a policy regression settles in with no alert.

VCF 9.1 moves compliance from a point-in-time model to a continuous one, with unified security posture management (Advanced Cyber Compliance) covering the whole VCF stack — vCenter, ESX, NSX, vSAN — from a single control plane. Compliance is no longer a photo taken at audit time: it’s a stream. Remediation becomes continuous: a detected drift triggers a remediation action instead of waiting for the next scan cycle.

Aspect	VCF 9.0	VCF 9.1
Compliance model	On-demand point-in-time scan	Continuous assessment
Drift detection	At next scan	Continuous
Remediation	Manual after report	Continuous remediation triggered
Scope	Component by component	Unified multi-component posture
Risk window	Between two audits	Reduced to the detection delay

What it changes for architects. Continuous compliance shifts the work: less time orchestrating scan campaigns, more time defining accurate baselines and managing noise. The point-in-time model produced a report every quarter; the continuous model produces a permanent event stream. The value depends entirely on baseline quality: a baseline that’s too strict generates a flood of false positives that drowns the real signal; one that’s too lax misses the drifts that matter.

Migration note. If you already run point-in-time compliance scans on 9.0, don’t migrate by turning everything on at once. Continuous remediation against an unrefined baseline can trigger unwanted corrective actions on production. The healthy sequence: enable in observation mode, refine baselines over two to four weeks, then switch automatic remediation on component by component.

On-prem anti-ransomware recovery

The scenario that loses companies: ransomware encrypts the production datastores, and the backups — connected to the same domain, the same network, sometimes the same credentials — get encrypted too. A backup restored into a still-compromised environment gets re-encrypted within the hour. Recovery without a clean isolated environment isn’t recovery: it’s a relapse.

VCF 9.1 integrates cyber recovery directly into the on-prem platform, built around the concept of an Isolated Recovery Environment (IRE) — often called a “clean room.” The idea: a restore environment physically and logically isolated from the production network and production identity, in which you restore, validate, and remediate before any reconnection. This is the pillar that turns a backup into a real recovery capability against ransomware.

Three technical building blocks make up the capability:

vSAN for Recovery — a recovery storage tier built on native vSAN snapshots. The snapshots serve as immutable restore points, independent of the compromised primary backup chain.

Isolated Recovery Environment (IRE) — the clean room: a restore environment cut off from the production network and identity. You restore and validate there with no re-encryption risk.

CrowdStrike EDR integration — the recovery workflow integrates a CrowdStrike EDR scan on restored workloads, to validate that a restored load is clean before reintroducing it to production.

Aspect	VCF 9.0	VCF 9.1
Cyber recovery	External third-party solution to integrate	Integrated into the on-prem platform
Isolated environment (IRE)	Build/operate it yourself	Native clean room concept
Restore points	Classic backup chain	Immutable native vSAN snapshots
Workload validation	Manual / off-platform	CrowdStrike EDR in the workflow
Isolation guarantee	Depends on the in-house design	Network and identity isolation by design

The key point about the IRE. A clean room isn’t a “second datacenter.” It’s an environment whose isolation is a disciplined property: no network route to production, no shared credentials, no common identity domain. Isolation discipline matters more than the technology: a poorly isolated IRE gives a false sense of security, which is worse than no IRE at all, because the crisis runbook will rely on it on D-day.

Resilience by design: what changes for the recovery plan

Taken individually, these three changes are features. Taken together, they redraw the recovery plan. Here’s how your runbooks must evolve.

The patching runbook changes nature. Before: a quarterly project with negotiated windows, business communication, host-by-host rollback plan. After: a continuous process for 80% of patches, and a window runbook reserved for the ~20% that still reboot. The consequence is that you need two distinct runbooks, not one adapted — plus a TPM map that determines which host follows which regime.

The compliance runbook moves from campaign to supervision. The skill to hire or train is no longer “knowing how to run and interpret a quarterly scan” but “knowing how to build accurate baselines and manage a drift event stream without drowning.” This is detection engineering work, not point-in-time auditing.

The cyber-recovery runbook becomes testable. The major contribution of the integrated IRE isn’t the technology: it’s that a cyber-recovery exercise becomes a repeatable process rather than an ad hoc project. The ransomware recovery plan must now include: clean room sizing, an audited isolation discipline, an EDR validation procedure before reintroduction, and — above all — a schedule of regular exercises. An IRE that’s never tested isn’t a recovery capability, it’s a hypothesis.

Resilience is not a checkbox

Enabling Live Patching, continuous compliance, and the IRE is not enough. These capabilities are only worth something if the runbooks are rewritten, baselines refined, and recovery exercises actually run. Technology shifts the work, it doesn’t remove it.

Pitfalls & points of attention

Live Patching requires a TPM — and doesn't cover everything

Live Patching is only available on hosts equipped with and provisioned for a TPM. On a heterogeneous estate, hosts without a TPM stay on the classic patching regime. In addition, about 20% of patches (microcode, firmware, structural kernel changes, major version upgrades) still require a reboot and an evacuation. Mapping the TPM estate and classifying patches by regime is a prerequisite, not a production discovery.

Continuous compliance: noise and false positives

A baseline that's too strict generates a continuous flood of false positives that drowns the real signal and ends up ignored by teams. One that's too lax misses the drifts that matter. Enable in observation mode, refine baselines over two to four weeks before any automatic remediation, and treat baseline management as permanent engineering work, not an initial setup.

Continuous remediation on an unrefined production

Enabling automatic remediation against an unvalidated baseline can trigger unwanted corrective actions on production workloads. Switch automatic remediation on component by component, never all at once across the whole stack, and keep an observation mode active on any recently changed component.

Clean room sizing and isolation discipline

An undersized IRE cannot restore the full critical perimeter within the target RTO — size it on the real recovery scenario, not a convenient subset. More critical still: isolation is a discipline, not a checked box. No network route to production, no shared credentials, no common identity domain. A poorly isolated IRE gives a false sense of security worse than no IRE.

vSAN for Recovery snapshot capacity

The native vSAN snapshots used as restore points consume capacity, and that consumption grows with workload change rate and retention depth. Size vSAN for Recovery capacity on the real change rate and target retention, and monitor consumption as a finite resource to avoid the silent snapshot-creation failure on incident day.

CrowdStrike licensing and integration prerequisites

EDR scanning in the recovery workflow assumes an operational CrowdStrike integration: licenses, sensor deployment, and connectivity from the isolated IRE without breaking isolation. Validate licensing prerequisites and test the EDR integration from the clean room before relying on it in the crisis runbook — not during the incident.

Conclusion

Patching decoupled from windows

~80% of ESX patches applied live on TPM hosts, with no evacuation or reboot. The security cadence detaches from maintenance windows — but two runbooks to maintain.

Compliance as a continuous stream

The risk window between two audits disappears. The value depends entirely on baseline quality and noise management — this is detection engineering.

Testable recovery

The integrated IRE, vSAN for Recovery, and CrowdStrike EDR turn cyber recovery into a repeatable process — provided you actually exercise it.

End of the series « What's new in VCF 9.1 »

The four parts cover the essentials of VCF 9.1: efficiency & TCO, networking & scale, Kubernetes & self-service, and security. For the underlying architectural framework, see The new VCF 9 architecture and Deploying your first VKS cluster.

Further reading.

VCF 9.1 Release Notes — the official detail of the security and resilience features
VCF 9.1: secure, cost-effective private cloud for production AI — official VCF blog
Announcing VCF 9.1 — official announcement
William Lam — community technical deep-dives

VCF 9.1: security & resilience — live patching and anti-ransomware

Series 'What's new in VCF 9.1' — 4/4

Visual credits

Live Patching for ESX hosts (TPM)

The heterogeneous estate trap

Continuous compliance (Advanced Cyber Compliance)

On-prem anti-ransomware recovery

Resilience by design: what changes for the recovery plan

Resilience is not a checkbox

Pitfalls & points of attention

Conclusion

End of the series « What's new in VCF 9.1 »

Articles similaires

VCF 9.1: security & resilience — live patching and anti-ransomware

VCF 9.1 : sécurité & résilience — patching à chaud et anti-ransomware

VCF 9.1: Kubernetes & self-service, the platform takes over

Follow along