Overview
vSphere High Availability restarts VMs after a host failure — it takes minutes and results in a cold reboot for the protected workload. For most applications, that is acceptable. For a small class of workloads, it is not: payment processing systems, real-time control applications, and tier-0 databases where even a brief interruption is commercially or operationally unacceptable.
vSphere Fault Tolerance (FT) addresses this by running a live shadow copy of the protected VM on a separate host at all times. The shadow VM — called the secondary VM — runs in perfect lockstep with the primary VM. Every CPU instruction, every memory write, every I/O operation is replicated to the secondary in real time. If the primary host fails, the secondary VM takes over instantly — no restart, no data loss, no interruption. The failover is transparent to the guest OS and to any clients connected to the VM.
How FT Works — vLockstep
FT uses a technology called vLockstep to keep the primary and secondary VMs in synchrony. The primary VM runs normally on the source host. Every non-deterministic event — interrupts, timer ticks, DMA results — is captured and replayed on the secondary host so that the secondary produces identical execution to the primary at all times.
The replication stream flows over a dedicated VMkernel network called the FT Logging network. This network carries a continuous stream of execution events and memory updates from the primary to the secondary. The secondary host is constantly processing this stream, maintaining a VM that is always at most a few milliseconds behind the primary’s current state.
When the secondary takes over after a primary host failure, it is already running and fully current. Clients experience a brief pause at most — the duration of the last unacknowledged network packet — rather than the minutes of a VM restart.
FT vs vSphere HA
| Characteristic | vSphere HA | Fault Tolerance |
|---|---|---|
| Recovery mechanism | VM restarts on another host | Secondary VM takes over instantly |
| RPO (data loss) | Data in flight at the time of failure is lost | Zero — secondary is always current |
| RTO (downtime) | Minutes (VM restart + boot time) | Near-zero (sub-second) |
| Resource overhead | Minimal — no duplicate VM | High — secondary consumes equal CPU and memory |
| Scope of protection | Host failure | Host failure |
| Configuration complexity | Low | High |
| Licence requirement | Included with vSphere | Enterprise Plus for more than 2 vCPUs |
FT is not a replacement for HA — it is a complement. HA remains enabled on the cluster and handles FT’s own recovery orchestration (choosing where to create a new secondary if the current secondary fails, for example).
Licensing
The number of vCPUs a Fault Tolerance-protected VM can have depends on the vSphere licence:
- Standard licence: FT protects VMs with up to 2 vCPUs
- Enterprise Plus licence: FT protects VMs with up to 8 vCPUs
Legacy FT (pre-vSphere 6.5) was limited to 1 vCPU regardless of licence. SMP FT, introduced in vSphere 6.0 and matured in 6.5, extended protection to multi-vCPU VMs. vSphere 8.0 maintains the 8 vCPU maximum with Enterprise Plus.
FT Requirements
FT imposes stricter prerequisites than HA or DRS because the replication mechanism is more demanding:
- vSphere HA must be enabled on the cluster — FT relies on HA for secondary VM management and orchestration after a failover
- Shared storage: VM files (configuration, logs, swap) must be on shared storage accessible by both the primary and secondary host
- Hardware virtualisation: The host CPU must support hardware-assisted memory management (Intel EPT or AMD RVI/NPT). This is enabled in the host BIOS and required by the vLockstep mechanism
- CPU compatibility: Both the primary and secondary host CPUs must be vMotion-compatible. FT can work without EVC but EVC is recommended in mixed-generation clusters
- FT Logging VMkernel: A dedicated VMkernel adapter tagged for FT Logging traffic must exist on each participating host. 10 GbE is required; the network should be dedicated and low-latency
- SSL certificate checking: Must be enabled in vCenter Server settings
- No active snapshots: FT cannot be enabled on a VM that currently has snapshots
Unsupported Features
The lockstep replication mechanism imposes constraints on which VM features can be active while FT is enabled:
| Unsupported feature | Reason |
|---|---|
| Snapshots | State diverges between primary and secondary when a snapshot is taken |
| Storage vMotion | Disk location changes cannot be replicated via vLockstep |
| Linked clones | Not compatible with FT’s disk requirements |
| vVols datastores | FT requires VMFS or NFS shared storage |
| Disk encryption | I/O filter used for encryption is incompatible with FT |
| TPM (Trusted Platform Module) | Not supported |
| VBS (Virtualization-based Security) | Not supported |
| Physical RDMs | Only virtual RDMs are supported (and only for legacy 1-vCPU FT) |
| USB/sound/serial/parallel ports | Device pass-through cannot be replicated |
| NPIV | N-Port ID Virtualisation incompatible |
| Virtual disks larger than 2 TB | Not supported |
| Hot-plug of devices | Device state changes cannot be replicated mid-flight |
Note that VADP (vSphere Storage APIs for Data Protection) disk-only snapshots are supported and can be used for backup of FT-protected VMs, provided the backup software uses VADP rather than native snapshot management.
FT and DRS
DRS does not automatically migrate FT secondary VMs. The secondary is placed by vCenter when FT is enabled and remains on that host unless an administrator intervenes or FT itself triggers a new secondary placement after a failover event. DRS does manage the primary VM normally.
Anti-affinity between the primary and secondary is maintained automatically — vCenter ensures they never run on the same host. If DRS would otherwise want to migrate the primary to the host running the secondary, it will not do so.
Operational Considerations
Resource overhead: Every FT-protected VM effectively doubles the CPU and memory consumption in the cluster — the secondary consumes equal resources on a second host even though it produces no output while the primary is running. This cost is the trade-off for zero RPO and near-zero RTO.
Bandwidth consumption: The FT Logging network carries a continuous real-time stream. A VM with high CPU activity or high memory write rates generates more FT traffic. 10 GbE dedicated to FT Logging is the minimum recommendation; in environments with multiple FT-protected VMs, bandwidth planning for the FT network is essential.
Failure of the secondary: If the secondary host fails (rather than the primary), the primary VM continues running normally. vCenter detects the secondary failure and automatically creates a new secondary on another host in the cluster, re-establishing protection. During the period between secondary failure and new secondary placement, the primary VM is unprotected — HA will restart it if the primary host then also fails, but without FT’s zero-RPO guarantee.
Summary
vSphere Fault Tolerance is the strongest availability option in the vSphere toolbox, delivering zero RPO and sub-second RTO by continuously running a synchronised shadow VM on a separate host. The cost is proportional resource consumption and strict operational constraints — no snapshots, no storage migrations, limited vCPU count, and dedicated high-bandwidth network. FT is appropriate for a small number of critical workloads where availability requirements genuinely exceed what HA can provide, rather than as a general-purpose protection mechanism for an entire cluster.