Unplanned maintenance [fixed]

pierreok · Décembre 22, 2019, 8:16

Our kubernetes cluster is facing some issues right now, investigating, I’ll let you know of the status as I progress.

pierreok · Décembre 22, 2019, 8:26

We reached a limit, containers are consuming all IPs available on each node, leading to this disrutpion.

Yesterday we provisioned more services. Making us closer to this limit. This night the backups started to fire and started to exhaust even more the IPs until the downtime we are in now.

I’ll increase the IPs available per node, and I’ll need to reboot the full cluster.

I’ll do this now, to be sure to be clean from now on, and not have this disruption any more.

You might see some reboot of your services, if you have the high availability option, you should be fine.

I’ll let you know as I progress.

pierreok · Décembre 22, 2019, 10:17

Ok, I couldn’t perform the maintenance, due to another issue on our storage cluster.

This is now fixed.

I also tried the procedure on our ingresses servers, all good, I’ll start to roll this to our 3 workers.
It should be over in less than 30min.

pierreok · Décembre 22, 2019, 11:33

Ok, everything is back to normal.

For the curious nerds:

ceph storage cluster was not using the 10G backend network for all OSDs, this is now fixed, so it might even be a bit faster
I updated the node cidr range, delete every nodes and recreated them with kubeadm

All green, time to go to walk in the forest!