Production cluster - architecture


We had a call, and EcoBytes, IndieHosters and expressed interest in sharing a production cluster.

We need to define the desired architecture.

The building of the staging cluster would depend on that as the purpose of the staging cluster is to test features before production.

I have several questions:

  • baremetal vs cloud

My guess was that we’ll use Hetzner cloud for the VMs, and the auction for the baremetal machines.
Apparently has experience with physical machines.
Even for Indie, we want at the end of the day to have our own machines, so it looks like a great opportunity.
On the other hand, I think the complexity to build on baremetal is higher, so it’ll take more time to setup.

  • load balancer

If we go with hetzner, we can have failover IP and point our dns to this IP, and make sure this IP is always attached to a healthy host.
Do you have idea on how to achieve that on your baremetal?

  • persistence layer

The idea was to experiement with rook, and try various failure mode to get a sense if it is reliable or not.
This means that nodes are generic, and we have both storage and compute on the same nodes, and provisioned with k8s.
If we see the need, we can separate the nodes from storage and compute.
And if we see that rook is not up to the job, we can make a ceph cluster.
Do you have strong opinions on this topic?

  • encryption

If we use our hardware, it would make sense to encrypt underlying disks. do you have experience with that?
(I mean, not encrypting disk, but how to boot the servers in a secure way)
Or maybe we could handle this at ceph level, I’ll see if that’s viable.

Then, I think we have to discuss what hardware, how do we do staging, buy the hardware, mount the staging cluster, and start to play!

If you have other doubts/questions, please let u know!


I think that’s a sensible approach.

That would also work but it is relatively recent (introduced in Luminous which was released less that one year ago). I assume the goal is not to have keys managed by the user (please correct me if I’m wrong). And if that’s the case, using Luks instead may be a better choice.