To build a hyper-converged Proxmox + Ceph Cluster, you must use at least three
(preferably) identical servers for the setup.
|
The recommendations below should be seen as a rough guidance for choosing hardware. Therefore, it is still essential to adapt it to your specific needs. You should test your setup and monitor health and performance continuously. |
CPU
Ceph services can be classified into two categories:
-
-
Intensive CPU usage, benefiting from high CPU base frequencies and multiple
cores. Members of that category are:
-
Intensive CPU usage, benefiting from high CPU base frequencies and multiple
-
-
Moderate CPU usage, not needing multiple CPU cores. These are:
-
-
Monitor (MON) services
-
Monitor (MON) services
-
-
Manager (MGR) services
-
Manager (MGR) services
-
-
Moderate CPU usage, not needing multiple CPU cores. These are:
As a simple rule of thumb, you should assign at least one CPU core (or thread)
to each Ceph service to provide the minimum resources required for stable and
durable Ceph performance.
For example, if you plan to run a Ceph monitor, a Ceph manager and 6 Ceph OSDs
services on a node you should reserve 8 CPU cores purely for Ceph when targeting
basic and stable performance.
Note that OSDs CPU usage depend mostly from the disks performance. The higher
the possible IOPS (IO Operations per Second) of a disk, the more CPU
can be utilized by a OSD service.
For modern enterprise SSD disks, like NVMe’s that can permanently sustain a high
IOPS load over 100’000 with sub millisecond latency, each OSD can use multiple
CPU threads, e.g., four to six CPU threads utilized per NVMe backed OSD is
likely for very high performance disks.
Memory
Especially in a hyper-converged setup, the memory consumption needs to be
carefully planned out and monitored. In addition to the predicted memory usage
of virtual machines and containers, you must also account for having enough
memory available for Ceph to provide excellent and stable performance.
As a rule of thumb, for roughly 1 TiB of data, 1 GiB of memory will be used
by an OSD. While the usage might be less under normal conditions, it will use
most during critical operations like recovery, re-balancing or backfilling.
That means that you should avoid maxing out your available memory already on
normal operation, but rather leave some headroom to cope with outages.
The OSD service itself will use additional memory. The Ceph BlueStore backend of
the daemon requires by default 3-5 GiB of memory (adjustable).
Network
We recommend a network bandwidth of at least 10 Gbps, or more, to be used
exclusively for Ceph traffic. A meshed network setup
is also an option for three to five node clusters, if there are no 10+ Gbps
switches available.
|
The volume of traffic, especially during recovery, will interfere with other services on the same network, especially the latency sensitive Proxmox VE corosync cluster stack can be affected, resulting in possible loss of cluster quorum. Moving the Ceph traffic to dedicated and physical separated networks will avoid such interference, not only for corosync, but also for the networking services provided by any virtual guests. |
For estimating your bandwidth needs, you need to take the performance of your
disks into account.. While a single HDD might not saturate a 1 Gb link, multiple
HDD OSDs per node can already saturate 10 Gbps too.
If modern NVMe-attached SSDs are used, a single one can already saturate 10 Gbps
of bandwidth, or more. For such high-performance setups we recommend at least
a 25 Gpbs, while even 40 Gbps or 100+ Gbps might be required to utilize the full
performance potential of the underlying disks.
If unsure, we recommend using three (physical) separate networks for
high-performance setups:
-
-
one very high bandwidth (25+ Gbps) network for Ceph (internal) cluster
traffic.
-
one very high bandwidth (25+ Gbps) network for Ceph (internal) cluster
-
-
one high bandwidth (10+ Gpbs) network for Ceph (public) traffic between the
ceph server and ceph client storage traffic. Depending on your needs this can
also be used to host the virtual guest traffic and the VM live-migration
traffic.
-
one high bandwidth (10+ Gpbs) network for Ceph (public) traffic between the
-
-
one medium bandwidth (1 Gbps) exclusive for the latency sensitive corosync
cluster communication.
-
one medium bandwidth (1 Gbps) exclusive for the latency sensitive corosync
Disks
When planning the size of your Ceph cluster, it is important to take the
recovery time into consideration. Especially with small clusters, recovery
might take long. It is recommended that you use SSDs instead of HDDs in small
setups to reduce recovery time, minimizing the likelihood of a subsequent
failure event during recovery.
In general, SSDs will provide more IOPS than spinning disks. With this in mind,
in addition to the higher cost, it may make sense to implement a
class based separation of pools. Another way to
speed up OSDs is to use a faster disk as a journal or
DB/Write-Ahead-Log device, see
creating Ceph OSDs.
If a faster disk is used for multiple OSDs, a proper balance between OSD
and WAL / DB (or journal) disk must be selected, otherwise the faster disk
becomes the bottleneck for all linked OSDs.
Aside from the disk type, Ceph performs best with an evenly sized, and an evenly
distributed amount of disks per node. For example, 4 x 500 GB disks within each
node is better than a mixed setup with a single 1 TB and three 250 GB disk.
You also need to balance OSD count and single OSD capacity. More capacity
allows you to increase storage density, but it also means that a single OSD
failure forces Ceph to recover more data at once.
Avoid RAID
As Ceph handles data object redundancy and multiple parallel writes to disks
(OSDs) on its own, using a RAID controller normally doesn’t improve
performance or availability. On the contrary, Ceph is designed to handle whole
disks on it’s own, without any abstraction in between. RAID controllers are not
designed for the Ceph workload and may complicate things and sometimes even
reduce performance, as their write and caching algorithms may interfere with
the ones from Ceph.
|
Avoid RAID controllers. Use host bus adapter (HBA) instead. |
Summarize this content to 100 words
To build a hyper-converged Proxmox + Ceph Cluster, you must use at least three
(preferably) identical servers for the setup.
The recommendations below should be seen as a rough guidance for choosing
hardware. Therefore, it is still essential to adapt it to your specific needs.
You should test your setup and monitor health and performance continuously.
CPU
Ceph services can be classified into two categories:
Intensive CPU usage, benefiting from high CPU base frequencies and multiple
cores. Members of that category are:
Moderate CPU usage, not needing multiple CPU cores. These are:
Monitor (MON) services
Manager (MGR) services
As a simple rule of thumb, you should assign at least one CPU core (or thread)
to each Ceph service to provide the minimum resources required for stable and
durable Ceph performance.
For example, if you plan to run a Ceph monitor, a Ceph manager and 6 Ceph OSDs
services on a node you should reserve 8 CPU cores purely for Ceph when targeting
basic and stable performance.
Note that OSDs CPU usage depend mostly from the disks performance. The higher
the possible IOPS (IO Operations per Second) of a disk, the more CPU
can be utilized by a OSD service.
For modern enterprise SSD disks, like NVMe’s that can permanently sustain a high
IOPS load over 100’000 with sub millisecond latency, each OSD can use multiple
CPU threads, e.g., four to six CPU threads utilized per NVMe backed OSD is
likely for very high performance disks.
Memory
Especially in a hyper-converged setup, the memory consumption needs to be
carefully planned out and monitored. In addition to the predicted memory usage
of virtual machines and containers, you must also account for having enough
memory available for Ceph to provide excellent and stable performance.
As a rule of thumb, for roughly 1 TiB of data, 1 GiB of memory will be used
by an OSD. While the usage might be less under normal conditions, it will use
most during critical operations like recovery, re-balancing or backfilling.
That means that you should avoid maxing out your available memory already on
normal operation, but rather leave some headroom to cope with outages.
The OSD service itself will use additional memory. The Ceph BlueStore backend of
the daemon requires by default 3-5 GiB of memory (adjustable).
Network
We recommend a network bandwidth of at least 10 Gbps, or more, to be used
exclusively for Ceph traffic. A meshed network setup
is also an option for three to five node clusters, if there are no 10+ Gbps
switches available.
The volume of traffic, especially during recovery, will interfere
with other services on the same network, especially the latency sensitive Proxmox VE
corosync cluster stack can be affected, resulting in possible loss of cluster
quorum. Moving the Ceph traffic to dedicated and physical separated networks
will avoid such interference, not only for corosync, but also for the networking
services provided by any virtual guests.
For estimating your bandwidth needs, you need to take the performance of your
disks into account.. While a single HDD might not saturate a 1 Gb link, multiple
HDD OSDs per node can already saturate 10 Gbps too.
If modern NVMe-attached SSDs are used, a single one can already saturate 10 Gbps
of bandwidth, or more. For such high-performance setups we recommend at least
a 25 Gpbs, while even 40 Gbps or 100+ Gbps might be required to utilize the full
performance potential of the underlying disks.
If unsure, we recommend using three (physical) separate networks for
high-performance setups:
one very high bandwidth (25+ Gbps) network for Ceph (internal) cluster
traffic.
one high bandwidth (10+ Gpbs) network for Ceph (public) traffic between the
ceph server and ceph client storage traffic. Depending on your needs this can
also be used to host the virtual guest traffic and the VM live-migration
traffic.
one medium bandwidth (1 Gbps) exclusive for the latency sensitive corosync
cluster communication.
Disks
When planning the size of your Ceph cluster, it is important to take the
recovery time into consideration. Especially with small clusters, recovery
might take long. It is recommended that you use SSDs instead of HDDs in small
setups to reduce recovery time, minimizing the likelihood of a subsequent
failure event during recovery.
In general, SSDs will provide more IOPS than spinning disks. With this in mind,
in addition to the higher cost, it may make sense to implement a
class based separation of pools. Another way to
speed up OSDs is to use a faster disk as a journal or
DB/Write-Ahead-Log device, see
creating Ceph OSDs.
If a faster disk is used for multiple OSDs, a proper balance between OSD
and WAL / DB (or journal) disk must be selected, otherwise the faster disk
becomes the bottleneck for all linked OSDs.
Aside from the disk type, Ceph performs best with an evenly sized, and an evenly
distributed amount of disks per node. For example, 4 x 500 GB disks within each
node is better than a mixed setup with a single 1 TB and three 250 GB disk.
You also need to balance OSD count and single OSD capacity. More capacity
allows you to increase storage density, but it also means that a single OSD
failure forces Ceph to recover more data at once.
Avoid RAID
As Ceph handles data object redundancy and multiple parallel writes to disks
(OSDs) on its own, using a RAID controller normally doesn’t improve
performance or availability. On the contrary, Ceph is designed to handle whole
disks on it’s own, without any abstraction in between. RAID controllers are not
designed for the Ceph workload and may complicate things and sometimes even
reduce performance, as their write and caching algorithms may interfere with
the ones from Ceph.
Avoid RAID controllers. Use host bus adapter (HBA) instead.