High Availability

Configure and manage Proxmox VE HA: groups, resource policies, and failover behavior.

High Availability

Proxmox VE HA automatically restarts VMs on a healthy node when a node failure is detected. Cloud-PVE pre-configures the HA stack (Corosync, fencing, watchdog) for you.

How HA works

  1. Corosync monitors cluster heartbeats between nodes.
  2. If a node misses heartbeats beyond the timeout, it is declared offline.
  3. Fencing (STONITH) isolates the failed node (power-off via IPMI/iDRAC) to prevent split-brain.
  4. HA Manager restarts the VMs that were running on the failed node on surviving nodes.

The entire process takes 20–60 seconds depending on your watchdog and fencing configuration.

Enabling HA for a VM

  1. Go to Datacenter → HA → Resources
  2. Click Add
  3. Select the VM and set:
    • Max Restart: number of restart attempts (default: 1)
    • Max Relocate: number of migration attempts before restart (default: 1)
    • Group: assign to an HA group (optional)

HA Groups

HA groups define node preferences for VM placement. Go to Datacenter → HA → Groups:

Group: production
Nodes: node1:3, node2:2, node3:1

Higher priority numbers mean the node is preferred. VMs in this group will prefer node1, fall back to node2, then node3.

Resource states

StateMeaning
startedVM should be running, HA ensures it stays running
stoppedVM should be stopped, HA won’t restart it
disabledHA management disabled for this VM
ignoredHA ignores this VM

Testing failover

To test HA without real hardware failure:

# On the node to test (run as root)
systemctl stop pve-cluster corosync

Watch the Datacenter → HA view, within ~30 seconds, your VMs should appear on another node.

Important: Only simulate failure on one node at a time. With a 3-node cluster, losing 2 nodes simultaneously breaks quorum.

Monitoring HA

Check the HA status:

ha-manager status

View HA logs:

journalctl -u pve-ha-lrm -n 50
journalctl -u pve-ha-crm -n 50