Proxmox VE — Monitoring and Maintenance

MONITORING

Monitoring Proxmox — built-in node/VM metrics, RRD graphs, external monitoring with InfluxDB and Grafana, log management, and routine maintenance tasks.

proxmoxmonitoringgrafanainfluxdbmaintenancelogsmetrics

What Proxmox Exposes Out of the Box

Proxmox VE includes built-in monitoring at every level of the management hierarchy — datacenter, node, and individual VM. No external tools are required to get a working picture of resource utilisation, but the built-in tooling has limitations in depth and retention that make external monitoring systems worth adding to any production environment.

Datacenter-Level Dashboard

The Datacenter | Summary view provides an aggregate health overview of the entire cluster:

The Datacenter | Search view lists all nodes, VMs, and containers with real-time sortable columns: CPU usage, memory usage, disk usage, and uptime. This gives a quick ranking of the highest-utilisation workloads in the cluster at any given moment.

Node-Level Metrics

Selecting any node and clicking its Summary tab shows per-node resource graphs:

All graphs are RRD-based (Round Robin Database), the same technology used by tools like Cacti and MRTG. RRD graphs are stored locally on each node and retained at progressively coarser granularity over time:

Time RangeGranularity
Last hourFull resolution (~1 sample per 60 seconds)
Last dayAveraged to ~30-minute intervals
Last weekAveraged to ~2-hour intervals
Last monthAveraged to ~8-hour intervals
Last yearDaily averages

This means Proxmox’s built-in graphs answer “what is happening right now and in the past few hours” effectively, but lose detail for longer-term capacity planning. An external metrics system is needed for fine-grained historical analysis.

VM and Container Metrics

Individual VM and container Summary tabs show the same CPU, memory, network, and disk graphs for that specific workload. For KVM VMs with the QEMU guest agent installed, additional metrics become available:

Install the QEMU guest agent inside the VM:

# On Debian/Ubuntu guests
apt-get install qemu-guest-agent
systemctl enable qemu-guest-agent
systemctl start qemu-guest-agent

Enable it in the Proxmox VM settings at VM | Options | QEMU Guest Agent | Enabled.

External Metrics — InfluxDB and Grafana

Proxmox VE has a built-in mechanism to ship metrics to external time-series databases. Navigate to Datacenter | Metric Server | Add and select either InfluxDB or Graphite as the target.

For InfluxDB v2 (the current version), configure:

Once saved, Proxmox begins shipping node and VM metrics at regular intervals to InfluxDB. No agent installation is required on the Proxmox nodes — the metrics are pushed directly from the Proxmox API.

Grafana is typically paired with InfluxDB as the visualisation layer. The Proxmox community maintains pre-built Grafana dashboards available for import by dashboard ID:

The combination of InfluxDB + Grafana fills the gaps in Proxmox’s built-in monitoring: long-term retention, custom alerting, threshold-based notifications, and correlation of Proxmox metrics with application-level metrics from other systems.

SMART Disk Health Monitoring

Disk failures are a leading cause of unplanned downtime in physical server environments. Proxmox supports S.M.A.R.T. monitoring via the smartmontools package:

apt-get install smartmontools

Once installed, smartd runs as a background daemon and polls physical disks for S.M.A.R.T. attributes. When a disk reports errors (reallocated sectors, pending sectors, uncorrectable errors, or temperature exceedance), smartd sends an email to the root user’s configured email address.

Email notifications include: node name, device ID, serial number, and the nature of the error. If the error persists, notification repeats every 24 hours.

Configure the root email address at Datacenter | Permissions | Users | root | Edit | Email.

Manual SMART checks:

# Check SMART status of a specific disk
smartctl -a /dev/sda

# Run a short self-test
smartctl -t short /dev/sda

# Run a long self-test
smartctl -t long /dev/sda

# View test results
smartctl -l selftest /dev/sda

Log Management

Proxmox generates logs at multiple levels:

Cluster-level logs live in /var/log/pve/:

Log FileContents
/var/log/pve/pvedaemon.logProxmox API daemon activity; useful when the GUI is inaccessible
/var/log/pve/pvestatd.logStatistics daemon log
/var/log/pve/pveproxy.logWeb proxy log for GUI access

Systemd service logs are accessed via journalctl:

# Corosync cluster engine log
journalctl -u corosync -f

# Proxmox cluster daemon
journalctl -u pve-cluster -f

# HA manager logs
journalctl -u pve-ha-crm -f
journalctl -u pve-ha-lrm -f

# Proxmox API daemon
journalctl -u pvedaemon -f

Node syslog is accessible from the GUI: Node | Syslog shows a live scrolling view of the system log. This is useful for quick inspection without needing shell access.

Task History under each node (Node | Task History) shows a log of all GUI-initiated operations — VM starts, stops, migrations, backup jobs — with success/failure status and duration.

pveperf — Hardware Benchmark

pveperf is a built-in benchmark utility that quickly measures the hardware capabilities of a Proxmox node:

pveperf

Output includes:

Run pveperf on a fresh node before deploying workloads to establish a baseline, then run it periodically to detect performance degradation. A significant drop in buffered reads or seek time often indicates storage hardware beginning to fail.

Node Maintenance Procedure

Before taking a node offline for hardware maintenance, firmware updates, or kernel upgrades, migrate all running VMs and containers off the node first:

# List all running VMs on the current node
qm list | grep running

# Migrate a specific running VM to another node
qm migrate 100 pmx-02 --online

# For containers (LXC cannot live-migrate; must stop first)
pct stop 101
pct migrate 101 pmx-02

Via GUI, right-click each running VM and select Migrate. Select the target node and click Migrate. For containers, stop them first, then use offline migration.

Once all VMs are off the node, apply the maintenance changes safely. When complete, verify the node rejoins the cluster and check quorum:

pvecm status    # Verify all nodes online and quorum is healthy

Updating Proxmox

Via GUI: Node | Updates | Refresh to fetch the current package list, then Upgrade to apply available updates. The GUI shows a list of packages pending upgrade before applying changes.

Via CLI:

apt-get update && apt-get dist-upgrade

Use dist-upgrade rather than upgrade to ensure that packages that require other packages to be removed or installed (common in Proxmox kernel updates) are handled correctly.

After a kernel update, the node requires a reboot. Always migrate VMs off the node before rebooting to avoid the HA system triggering an automatic failover (which takes 60+ seconds). A planned, manual migration is faster and cleaner than an HA-driven recovery.

Kernel Cleanup

Each Proxmox kernel update installs a new kernel while leaving the previous versions in place. Over time, old kernel packages accumulate and consume disk space in the /boot partition. Clean them up:

# Check which kernels are installed and which is active
proxmox-boot-tool status

# List installed Proxmox kernel packages
dpkg --list | grep pve-kernel

# Remove a specific old kernel
apt-get remove pve-kernel-5.15.30-1-pve

# Remove all unused kernels automatically
apt-get autoremove

After removing old kernels, refresh the boot configuration:

proxmox-boot-tool refresh

Keep at least one previous kernel installed as a fallback in case the new kernel has issues.

Zabbix Agent for Deep Monitoring

For environments that use Zabbix as their central monitoring platform, install the Zabbix agent on each Proxmox node:

apt-get install zabbix-agent

Configure /etc/zabbix/zabbix_agentd.conf:

Server=<zabbix_server_ip>
ServerActive=<zabbix_server_ip>:10051
Hostname=pmx-01

Restart the agent:

service zabbix-agent restart

In Zabbix, add the node as a host with the Template OS Linux template. This auto-discovers disk, network, memory, and CPU items with pre-built triggers and graphs. For Proxmox-specific metrics, community Zabbix templates are available that query the Proxmox API directly.

Zabbix’s trigger system enables threshold-based alerting: alert when CPU I/O wait exceeds 20% for five minutes, when a node drops out of cluster membership, when available storage falls below 10%, or when a disk SMART attribute deteriorates. This proactive alerting is not available in Proxmox’s built-in monitoring.