Proxmox VE — Ceph Distributed Storage

CEPH

Ceph on Proxmox — deploying a hyper-converged Ceph cluster, OSDs, monitors, pools, and using Ceph RBD for VM storage.

proxmoxcephdistributed-storagehyper-convergedrbdceph-osd

Overview

Ceph is a distributed storage system designed with one fundamental goal: no single point of failure. Unlike NFS (which has a single server) or iSCSI (which has a single target), Ceph distributes data across many nodes and many disks simultaneously. Any node or disk can fail and the cluster continues to serve data normally, automatically rebuilding redundant copies on remaining hardware.

Proxmox VE has supported Ceph since version 3 and deeply integrated it starting with version 5. As of Proxmox VE 5.0, you can run Ceph directly on the same nodes that run your VMs — a configuration called hyper-converged infrastructure (HCI). This eliminates the need for separate dedicated storage servers, making Ceph accessible to smaller environments that cannot justify dedicated storage hardware.

Ceph integrates with Proxmox at two levels. Ceph RBD (RADOS Block Device) provides block storage for VM disk images, functioning like a shared SAN that all Proxmox nodes can access simultaneously — enabling live migration and high availability without NFS. CephFS provides a POSIX-compliant shared filesystem, useful for ISOs, backups, and container templates.


Ceph Architecture

Ceph is composed of several daemon types. Understanding what each one does is essential before deploying.

OSD — Object Storage Daemon

An OSD (Object Storage Daemon) manages a single physical disk. Every drive in the Ceph cluster runs one OSD daemon. OSDs are responsible for:

In a healthy Ceph cluster, each OSD handles reads, writes, and background replication concurrently. Adding more disks (more OSDs) linearly increases both capacity and throughput.

Monitor (MON)

Monitors maintain the authoritative cluster map — the definitive record of which OSDs exist, which are up or down, and how data is distributed across them. Monitors use a Paxos-based quorum to agree on the cluster state. A three-monitor cluster can tolerate one monitor failure; a five-monitor cluster can tolerate two.

Monitors do not store data. They are lightweight processes that coordinate cluster state. In small Proxmox clusters, monitors typically co-exist with OSD nodes.

Manager (MGR)

The Ceph Manager daemon (MGR) handles cluster-wide metrics collection, the Ceph dashboard, and module-based extensions (including the Proxmox integration modules). At least one MGR must run in a healthy cluster. The Proxmox GUI Ceph management panel communicates through the MGR.

MDS — Metadata Server

The Metadata Server is only required for CephFS — the POSIX-compliant shared filesystem layer. It stores filesystem directory trees and file metadata. If you are using Ceph only for RBD block storage (the most common Proxmox configuration), you do not need MDS.

CRUSH Map

The CRUSH (Controlled Replication Under Scalable Hashing) algorithm determines where data is placed in the cluster without any centralized lookup. Given an object ID and a CRUSH map (describing the physical topology of the cluster — racks, hosts, OSDs), CRUSH computes which OSDs should store replicas of that object. This means no OSD ever needs to ask a central server where to write or read data.

The CRUSH map also defines failure domains. In a properly configured cluster, Ceph ensures that replicas of the same data land on different hosts (or different racks, if the map is configured that way). This means losing an entire physical host does not cause data loss as long as replicas are on other hosts.


Placement Groups (PGs)

Ceph does not track individual objects directly against OSDs. Instead, objects are grouped into Placement Groups (PGs). Each PG is assigned to a set of OSDs by CRUSH. When a new OSD is added or an existing one fails, Ceph rebalances PGs across OSDs, moving groups of objects rather than individual files.

The number of PGs per pool affects cluster performance. Too few PGs causes uneven data distribution. Too many PGs wastes RAM (each PG has overhead on every OSD). The Ceph documentation provides a calculator, and the general formula is:

Total PGs = (Number of OSDs × 100) / Replica count
Round up to the nearest power of 2

Example for a 6-OSD cluster with 3 replicas:

(6 × 100) / 3 = 200 → round up to 256 PGs

For a small Ceph cluster with fewer than 5 OSDs, Ceph recommends 128 PGs per pool. Adjust incrementally on a live cluster — PG changes trigger rebalancing and are resource-intensive.


Minimum Requirements

Ceph requires at least three nodes for a production cluster. This is non-negotiable for quorum and data safety:

A two-node Ceph cluster cannot achieve quorum — a single node failure creates a 1:1 split where neither side can declare itself authoritative. The cluster would stop accepting writes to protect data integrity.

For the cluster network, two separate networks are strongly recommended:

Separating these networks prevents replication traffic from saturating the same network that serves VM disk I/O. Each network should be at least 10 GbE in production.


Deploying Ceph from the Proxmox GUI

As of Proxmox VE 5.0, the entire Ceph deployment can be managed from the Proxmox GUI. The process is:

Step 1 — Install Ceph on Each Node

Navigate to Node → Ceph → Install for each node. This installs the Ceph packages from the Proxmox Ceph repository. After installation, the Ceph configuration tab becomes active.

Via CLI:

pveceph install --version quincy

Step 2 — Initialize Ceph Configuration (First Node Only)

On the first node: Node → Ceph → Configuration → Initialize

Specify the public network CIDR and (optionally) the separate cluster network CIDR.

Via CLI:

pveceph init --network 192.168.20.0/24 --cluster-network 10.0.0.0/24

This creates /etc/pve/ceph.conf, which is symlinked to /etc/ceph/ceph.conf and distributed to all cluster nodes via pmxcfs.

Step 3 — Create Monitors

Create a monitor on each of the three nodes: Node → Ceph → Monitor → Create Monitor

Via CLI (run on each node):

pveceph createmon

Step 4 — Create OSDs

On each node, for each data disk: Node → Ceph → OSD → Create OSD, then select the disk.

Via CLI:

# Create an OSD for disk /dev/sdb
pveceph createosd /dev/sdb

# Create OSD with a specific filesystem
pveceph createosd -fstype ext4 /dev/sdb

Each disk becomes one OSD daemon. The disk is formatted and a new Ceph OSD starts automatically.

Step 5 — Create a Pool

Node → Ceph → Pools → Create Pool. Set the pool name, PG count, and replication size.

Via CLI:

# Create a pool named "vm-store" with 128 PGs
pveceph createpool vm-store

# Or with ceph commands directly
ceph osd pool create vm-store 128 128
ceph osd pool set vm-store size 3
ceph osd pool set vm-store min_size 2

Step 6 — Add Pool as Proxmox Storage

Datacenter → Storage → Add → RBD. Enter the pool name, monitor IPs, and storage ID. This adds the Ceph pool as a Proxmox storage backend available to all cluster nodes for VM disk images.


Pool Replication Settings

SettingValueMeaning
size3Data is written to 3 different OSDs
min_size2Cluster accepts writes with at least 2 OSDs available
pg_num128–1024Number of placement groups (size dependent)

With size=3 and min_size=2, losing one complete node is tolerable — the cluster continues serving data from the two remaining replicas. Losing two nodes simultaneously drops the cluster below min_size and it stops accepting writes to protect integrity.

Pool management commands:

# Set replica count
ceph osd pool set rbd size 3

# Set minimum replicas
ceph osd pool set rbd min_size 2

# Increase PG count (do in steps on live clusters)
ceph osd pool set rbd pg_num 256
ceph osd pool set rbd pgp_num 256

# List all pools
ceph osd lspools

# Delete a pool (requires explicit confirmation flag)
ceph osd pool delete vm-store vm-store --yes-i-really-really-mean-it

CephFS vs. RBD

Proxmox supports two Ceph storage interfaces:

Ceph RBD

RBD (RADOS Block Device) presents Ceph storage as raw block devices. Each VM disk is a separate RBD image — functionally equivalent to a raw disk from the VM’s perspective. RBD is the recommended interface for VM disk images in Proxmox because:

For LXC containers, enable the KRBD option on the RBD storage definition. KRBD uses a kernel-level RBD driver that LXC can use for container root filesystems. Without KRBD, RBD is only accessible to QEMU/KVM VMs.

CephFS

CephFS is a POSIX-compliant shared filesystem built on top of the same Ceph cluster. It requires the MDS daemon and provides a mountable directory rather than block devices. In Proxmox, CephFS is useful for:

CephFS is not recommended for VM disk images — POSIX filesystem semantics add overhead compared to direct RBD block access.


OSD Journal

Every Ceph OSD write goes to the OSD journal first. The journal is a small, fast-write partition where Ceph batches incoming writes before committing them to the OSD’s main data area. The journal provides crash consistency — if an OSD crashes mid-write, the journal allows recovery.

By default the journal is co-located with the OSD on the same disk. For better write performance, journals can be placed on separate SSDs:

# Create OSD with journal on separate SSD partition
pveceph createosd /dev/sdb --journal /dev/sdc

Placing journals on enterprise SSDs with power-loss protection significantly improves write throughput for spinning-disk OSD clusters. The recommended ratio is one SSD journal device per 4–6 spinning OSD drives.

Important: Losing an OSD’s journal partition while an OSD is online causes data loss for that OSD. Always use enterprise-grade SSDs with power-loss protection for journal devices.


Monitoring Ceph Health

# Cluster status summary
ceph -s

# Live event log
ceph -w

# Detailed health output (errors and warnings)
ceph health detail

# OSD tree (shows which OSDs are on which nodes)
ceph osd tree

# List pools with usage
ceph df

# Per-OSD statistics
ceph osd df

During maintenance (disk replacement, node reboot), prevent Ceph from rebalancing data unnecessarily:

# Prevent OSDs from being marked out during maintenance
ceph osd set noout

# Perform maintenance (reboot node, replace disk, etc.)

# Resume normal operation after maintenance
ceph osd unset noout

Failing to set noout before rebooting a node causes Ceph to immediately start rebalancing data to other OSDs, generating heavy replication traffic and degrading cluster performance while the node is temporarily offline.


Hyper-Converged vs. Dedicated Ceph Cluster

Hyper-converged (VMs and Ceph on same nodes):

Dedicated Ceph cluster (separate storage nodes):

For most Proxmox deployments that are not running hundreds of VMs or extremely I/O-intensive workloads, hyper-converged Ceph on three nodes is a practical and cost-effective design.


pveceph Command Reference

CommandFunction
pveceph installInstall Ceph packages on the node
pveceph init --network <CIDR>Initialize Ceph configuration
pveceph createmonCreate a monitor on the current node
pveceph createosd /dev/XCreate an OSD for a disk
pveceph createpool <name>Create a new pool
pveceph destroymon <id>Remove a monitor
pveceph destroyosd <id>Remove an OSD
pveceph statusShow cluster, monitor, OSD, and MDS status
pveceph start <service>Start a Ceph daemon
pveceph stop <service>Stop a Ceph daemon
pveceph purgeRemove Ceph and all data from the node (destructive)

Ceph on Proxmox transforms a cluster of ordinary servers into a fault-tolerant, scale-out storage platform. When properly sized and networked, it provides the shared storage foundation that enables live migration and HA without any dedicated storage hardware.