this post was submitted on 04 Aug 2023

21 points (95.7% liked)

Selfhosted

39206 readers

353 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
No spam posting.
Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.
Don't duplicate the full text of your blog or github here. Just post the link for folks to click.
Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).
No trolling.

Resources:

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 1 year ago

MODERATORS

HybridSarcasm@lemmy.world

HybridSarcasm@lemmy.hybridsarcasm.xyz

Experiments in Ceph (with Promox) (lemmyonline.com)

submitted 1 year ago by xtremeownage@lemmyonline.com to c/selfhosted@lemmy.world

8 comments fedilink hide all child comments

So, last month, my kubernetes cluster decided to literally eat shit while I was out on a work conference.

When I returned, I decided to try something a tad different, by rolling out proxmox to all of my servers.

Well, I am a huge fan of hyper-converged, and clustered architectures for my home network / lab, so, I decided to give ceph another try.

I have previously used it in the past with relative success with Kubernetes (via rook/ceph), and currently leverage longhorn.

Cluster Details

Kube01 - Optiplex SFF

i7-8700 / 32G DDR4
1T Samsung 980 NVMe
128G KIOXIA NVMe (Boot disk)
512G Sata SSD
10G via ConnectX-3

Kube02 - R730XD

2x E5-2697a v4 (32c / 64t)
256G DDR4
128T of spinning disk.
2x 1T 970 evo
2x 1T 970 evo plus
A few more NVMes, and Sata
Nvidia Tesla P4 GPU.
2x Google Coral TPU
10G intel networking

Kube05 - HP z240

i5-6500 / 28G ram
2T Samsung 970 Evo plus NVMe
512G Samsung boot NVMe
10G via ConnectX-3

Kube06 - Optiplex Micro

i7-6700 / 16G DDR4
Liteon 256G Sata SSD (boot)
1T Samsung 980

Attempt number one.

I installed and configured ceph, using Kube01, and Kube05.

I used a mixture of 5x 970 evo / 970 evo plus / 980 NVMe drives, and expected it to work pretty decently.

It didn't. The IO was so bad, it was causing my servers to crash.

I ended up removing ceph, and using LVM / ZFS for the time being.

Here are some benchmarks I found online:

https://docs.google.com/spreadsheets/d/1E9-eXjzsKboiCCX-0u0r5fAjjufLKayaut_FOPxYZjc/edit#gid=0

https://www.proxmox.com/images/download/pve/docs/Proxmox-VE_Ceph-Benchmark-202009-rev2.pdf

The TLDR; after lots of research- Don't use consumer SSDs. Only use enterprise SSDs.

Attempt / Experiment Number 2.

I ended up ordering 5x 1T Samsung PM863a enterprise sata drives.

After, reinstalling ceph, I put three of the drives into kube05, and one more into kube01 (no ports / power for adding more then a single sata disk...).

And- put the cluster together. At first, performance wasn't great.... (but, was still 10x the performance of the first attempt!). But, after updating the crush map to set the failure domain to OSD rather then host, performance picked up quite dramatically.

This- is due to the current imbalance of storage/host. Kube05 has 3T of drives, Kube01 has 1T. No storage elsewhere.

BUT.... since this was a very successful test, and it was able to deliver enough IOPs to run my I/O heavy kubernetes workloads.... I decided to take it up another step.

A few notes-

Can you guess which drive is the samsung 980 EVO, and which drives are enterprise SATA SSDs? (look at the latency column)

Future - Attempt #3

The next goal, is to properly distribute OSDs.

Since, I am maxed out on the number of 2.5" SATA drives I can deploy... I picked up some NVMe.

5x 1T Samsung PM963 M.2 NVMe.

I picked up a pair of dual-spot half-height bifurcation cards for Kube02. This will allow me to place 4 of these into it, with dedicated bandwidth to the CPU.

The remaining one, will be placed inside of Kube01, to replace the 1T samsung 980 NVMe.

This should give me a pretty decent distribution of data, and with all enterprise drives, it should deliver pretty acceptable performance.

More to come....

top 8 comments

sorted by: hot top controversial new old

[–] MangoPenguin@lemmy.blahaj.zone 4 points 1 year ago (1 children)

Ceph seems neat, but the fact that it can't even function with normal SSDs points to something very wrong with how it's designed. It seems like it has an absurd overhead.

[–] xtremeownage@lemmyonline.com 2 points 1 year ago

I believe its a data-safety thing, similar to how ZFS's ZIL works.

That is, a write isn't completed until its actually written. In the case of consumer SSDs, this means, waiting for the write to complete. In the case of enterprise SSDs, this means the write-cache, (due to PLP, power loss protection).

With anything though, you can disable those safety features.

absurd overhead.

Actually a massive understatement. I threw together over 5 million IOPs worth of disks, to barely squeeze 100k IOPs out of the cluster! Its EXTREMELY inefficient, compared to.... well, pretty much any other option. I mean, writing encrypted zip files to SD card storage can be faster in some circumstances. lol

But, its reliable, fault-tolerant storage, which is instantly available(ie, no replication, syncing, etc).

[–] 30021190@lemmy.cloud.aboutcher.co.uk 1 points 1 year ago* (last edited 1 year ago) (1 children)

Ceph works best if you have identical osd, quantity, type and capacity across the cluster, also works best on a 3+ node cluster.

I ran a mixed sata SSD/HDD 256gb/4tb cluster and it was always a bit pants. Now I have 7x1tb SSD per node (4nodes) and it works fantastic now.

Proxmox uses replica 3/2 failure at host level but you may find that EC works better for your mixed infra as you noticed you can't meed the 3 host failure and so setting to osd failure level means data may be kept on a single host so would need to traverse the network to the other machine.

You may also need more than a single 10Gb nic too as you might start hitting bandwidth issues.

[–] xtremeownage@lemmyonline.com 0 points 1 year ago (1 children)

Proxmox uses replica 3/2 failure at host level

I ended up having to set the failure domain to OSD, rather then host.... at least, until the next group of 5 enterprise SSDs arrives to properly distribute data across all three nodes. But.... once the next group of 5 arrives, it will allow me to setup a fairly even distribute of data across all three 10G nodes.

You may also need more than a single 10Gb nic too as you might start hitting bandwidth issues.

Knock on wood, I don't "think" I have enough heavy bandwidth loads for this to be a huge issue, at least, with the exception of when the backups are running. Most of my workloads use fast random I/O. (databases, kubernetes, etc.)

BUT.... I do have 40g networking on the r730xd already, and I have enough 40G NICs laying around to build a full mesh 40G network between those three nodes if needed.

[–] 30021190@lemmy.cloud.aboutcher.co.uk 1 points 1 year ago

So my production setup is 2x10Gb bonded NICs for networking and 2x10Gb bonded NICs for Ceph/Cluster stuff. I suspect that when ceph is being heavily used you may see bottlenecks however once you have host based failure then in theory your data should be closer to the correct host and not have an issue. But it's on a basic level like have 3 copies of data, one on each host so it doesn't save you any storage, just reduces the risks during failure.

Thinking about it, you may actually see better results with ZFS and replicate jobs. As there's fewer overheads and the ZFS sending is incremental. You'd obviously just loose X minutes of data instead of ceph being X seconds.

[–] eros@lemmy.world 0 points 1 year ago (1 children)

Nice writeup. As long as you can throw fast drives, fast networking and plenty of RAM at it Ceph is happy.

Ceph seems to work fine on my cluster at work. For less than $40k I replaced my whole VMware vSAN cluster and we're saving as much again in software licensing over the next 5 years with buying support from Proxmox. Also much lighter as far as administrative tasks to keep it up to date and running well.

3x Supermicro SSG-110P-NTR10

Intel Xeon Gold 5713
256 GB RAM
10 Intel D7-P5510 3.84TB NVME
2 Micron 5400 Max
Onboard dual 10GbE
Mellanox ConnectX4 Dual SFP28 25GbE
5 year NBD parts warranty

[–] xtremeownage@lemmyonline.com 3 points 1 year ago (1 children)

Have you done any measurements of IOPs? Just curious to know.

[–] eros@lemmy.world 2 points 1 year ago

I don't, but I'll run some and try to remember to post back.