Vengeful Centos 8 Has Been Cancelled – boot partition full, zfs removed itself, kvm changed the rules, firewall dead

Hits: 623

Centos Project Cancelled As Well As Kubernetes Project is Removing Docker Support

Recent news has informed that Devops and System administrators will be working extra for the next few months. In the build up to becoming defunct Centos is becoming as unstable as leading edge beta distros

Our server was becoming a bit sluggish for no good reason, so I figured that I should check how full the partitions were. I immediately saw that

$ df -h

Showed me that the /boot partition was %100. This likely happened due to a failed update. The command given all over the net no longer works on centos 8. But they also said to edit /etc/yum.conf set the number of kernels lower than it is then all should be fine.

Okay, I checked up how to remove some old kernels

command to remove kernels didnt work
figured /etc/yum.conf lowering the number of installonly_limit of kernels should help.

I announced to the universe that  I will be doing maintenance in the morning, before people start working. I made a concrete plan to reboot, thus removing the offending kernels. Then I planned to give it a dnf update for good measure as long as was doing maintenance.

In the morning I woke up and got straight to work, reboot should be followed by dnf update, but the word “followed” took about 2 hours.

The Nightmare started only after I woke up

After rebooting /boot was still full. So I attempted to run dnf commands. However,  I was repeatedly warned that it was going to delete the kernel

# dnf update
Last metadata expiration check: 1:36:12 ago on Thu 17 Dec 2020 09:07:44 AM IST.
Error:
Problem: The operation would result in removing the following protected packages: kernel-core
(try to add '--skip-broken' to skip uninstallable packages or '--nobest' to use not only best candidate packages)

Apparently this error is caused by changing the yum.conf installonly_limit  to too low, I think I set it to 1 .

It turns out that running the installonly_limit as part of dnf command will indeed work!

dnf remove --oldinstallonly --setopt installonly_limit=2

Immediately, the /boot partition was cleaned, now  a new surprises waited for me.

Now ZFS Got Messed Up

Apparently a recent Centos update neutralized and utterly removed zfs and zpool packages so none of the Vms could boot without there storage.

Docker Sidenote

Docker compose on the same server was using the zfs mount for storage, since the storage wasn’t there docker was kind enough to create an empty dir owned by root.

Docker loves to drop empty directories that were mistakenly designated as volumes in the wrong place. BTW, this is a feature. I already knew about these droppings, so I figured that I can’t mount the zfs with the docker dirictory dropped in the place that the real directory was hiding inside the broken zfs volume. Docker replacement CRI-io is not supposed to run as root like docker does, so I hope that whatever volumes it creates by mistake can be deleted for ordinary users.

At the end of this story,  I forgot to run the docker compose that indeed is needed by other apps and users.

Re-installing zfs

At some point in recent history, after the last reboot, the ZFS and ZPOOL yum packages were removed, due to incompatibility with the new kernel. Reminds me of suicidal Ubuntu Gnome of Grampa Yore.

# yum install zfs

worked but the kernel was unaffected, so there was still no access to the VM storage.

Research showed that ZFS does not yet officially support centos 8.3. (CENTOS 8 will NOT be LTS) 

# modprobe zfs
modprobe: FATAL: Module zfs not found in directory /lib/modules/4.18.0-193.10.el8.x86_64

modprobe zfs gave that nasty error. It also Did nothing, as the contents of the following was empty.

# lsmod | grep zfs

The solution Is To Install The  Testing Version of ZFS.

After searching and stressing, I finally found out that an ordinary ZFS will not be supported in the now demoted Centos 8 LTS, which itself plans to beta stream.

So I ran:

yum install http://download.zfsonlinux.org/epel/zfs-release.el8_3.noarch.rpm
sudo dnf config-manager --enable zfs-testing
sudo dnf install kernel-devel zfs

Now the ZFSis back up , yay, thrills!

So I moved on now to try to launch my VMs

virsh start vmsteve

gives:

error: operation failed: guest CPU doesn't match specification: missing features: hle,rtm,tsx-

I guess that dnf update messed even more stuff up, even though this seems like a bug that was seen many months ago in Fedora, nothing prevented it from getting into Centos 8 LTS.

virsh edit vmsteve

showed that the vm should show some flexibilty since is it had:

virsh managedsave-dumpxml vmsteve
....
<cpu mode='host-model' check='partial'>
<topology sockets='2' dies='1' cores='1' threads='1'/>
</cpu>

but apparently kvm changes this in virsh managedsave xml settings.

My older vms were able to come up after editing the managedsave xml <cpu> settings

virsh managedsave-edit vmsteve

<cpu mode='custom' match='exact' check='full'>
<model fallback='forbid'>Cascadelake-Server</model>
<vendor>Intel</vendor>
<topology sockets='2' dies='1' cores='1' threads='1'/>
<feature policy='require' name='ss'/>
<feature policy='require' name='vmx'/>
<feature policy='require' name='hypervisor'/>
<feature policy='require' name='tsc_adjust'/>
<feature policy='require' name='umip'/>
<feature policy='require' name='pku'/>
<feature policy='require' name='md-clear'/>
<feature policy='require' name='stibp'/>
<feature policy='require' name='arch-capabilities'/>
<feature policy='require' name='xsaves'/>
<feature policy='require' name='ibpb'/>
<feature policy='require' name='amd-ssbd'/>
<feature policy='require' name='rdctl-no'/>
<feature policy='require' name='ibrs-all'/>
<feature policy='require' name='skip-l1dfl-vmentry'/>
<feature policy='require' name='mds-no'/>
<feature policy='require' name='pschange-mc-no'/>
<feature policy='disable' name='hle'/>
<feature policy='disable' name='rtm'/>
</cpu>

however the newer vms had never been managesaved, in which case it did work to do (this didn’t work for the vms that had managedsave xmls.

virsh edit vmsteve

Putting the above xml in place of simple <cpu> settings worked.

Now my vms started.

Finally, No Network Connectivity

The Vms now started , but still they lacked network connectivity, not getting ip address and can’t communicate at all.

Alas! Firewalld was disabled, enabling it made everything work like before this saga. The Firewall handles the routing for the Vms. So they won’t get any ip address from LAN dhcp without the firewall working on the host.

systemctl enable firewalld --now

I fixed a little more stuff that I had expected to deal with from the beginning and everything was back as it was.

Salvation has now come to the land of Zion
ובא לציון גואל