31053
Comment: do more config after the OS build
|
← Revision 36 as of 2024-07-12 16:04:10 ⇥
23192
added Longhorn backups
|
Deletions are marked like this. | Additions are marked like this. |
Line 5: | Line 5: |
* Alma Linux 9.1 x64 | * Alma Linux 9.4 x64 |
Line 9: | Line 9: |
* 128gb SSD | * 128gb SSD SK hynix SC311 SATA M.2 * I bought the three of them for $405 in total, so $135 AUD each, in March 2023. I also later picked up an extra node to run as the controller, [[../kalina| kalina]], and whatever other services it might need. The controller is fairly heavyweight, so the Raspberry Pi won't cut it, and I don't want to jam it on the machine hosting my webserver at the same time. * Alma Linux 9.4 x64 * Lenovo !ThinkCentre M710q * Intel Core i5-6500T @ 2.50GHz (4-core no HT) * 16gb DDR4-2133 (DIMMs specced for 2667) * 256gb SSD SK hynix PC601 NVMe M.2 * I bought this one for $159 in November 2023 I've also added a dedicated Mikrotik hEX S router (model RB760iGS) to the setup, it gives a dedicated /24 subnet to the cluster, routed to the rest of my LAN but '''without using NAT'''. Now I get to learn OSPF, and BGP once I add MetalLB to the cluster for ingress. |
Line 12: | Line 23: |
I last touched this in April 2023 and it was very annoying to get as far as I did. Next time I look at it, I think I will rebuild the cluster from scratch again, and use a different guide. Something with actual explanations and a few opinions, like this one: https://github.com/hobby-kube/guide | This has been such a chore getting things working smoothly, it's just so damn finicky and it makes my notes a mess. I've tried to clean them up, but I'll keep all the failure notes in a section at the end. * Doing it mostly raw with kubeadm sucked, the docs are completely unopinionated and give you every option at every instance of there being a choice. Great if you know what you're doing, but if you know what you're doing then you don't really need those docs * This guide looks like an improvement, something with actual explanations and a few opinions: https://github.com/hobby-kube/guide * So many guides assume you're doing this in the cloud, which is a fair assumption for starting as a beginner with no infra, but they make too many logical leaps that you have to fill in the gaps yourselves, or just can't be applied on your own baremetal |
Line 17: | Line 32: |
== Another rebuild attempt in late 2023 == A few changes for this one: * I'm going to use Rancher this time, or that guide linked above * Alma 9.2 because it's the latest * Move them to the "subnet" of 192.168.1.32/29 so I can configure the router to give them DHCP options easily * persica1 / 192.168.1.33 * persica2 / 192.168.1.34 * persica3 / 192.168.1.35 * Put the controller node onto asval rather than illustrious, which in this case might be the rancher docker container * asval / 192.168.1.32 (should probably be a static IP) * persica / CNAME to asval * Go with Longhorn for PVCs * Dunno what to do about ingress yet === Prepare asval controller node === ==== OS imaging ==== Using the Raspberry Pi Imager app, start with RPi OS Lite 64-bit, suitable for the RPi 3B+ It lets you make some customisations before flashing, which is really nice: * Set hostname to asval * Enable SSH * Password auth (I would use SSH but it didn't work right for me and I couldn't sudo later) * Set username and password * `pi // <something new>` * No WLAN * Set locale to Australia/Sydney, us keyboard * Disable telemetry ==== Config ==== * Login as `pi@asval` and copy your SSH key there * Install base packages {{{ apt install -y vim git screen ack }}} * Enable I2C bus for the RTC * Run `raspi-config` * Interface Options -> I2C -> Enable * `dtparam=i2c_arm=on` has already been enabled in `/boot/config.txt` for us * i2c_dev module should now/already be loaded so we're ready to go I hope * Reboot now, it can't hurt * Install i2c tools {{{ apt install -y i2c-tools }}} * Detect the device on i2c bus: `i2cdetect -y 1` * Should appear at 0x68 * Enable the kernel driver for it, or something, by adding a devicetree overlay. Append this to the end of /boot/config.txt {{{ dtoverlay=i2c-rtc,ds3231 }}} * Reboot again to load the device tree overlay that we just configured * Again detect the device on i2c bus: `i2cdetect -y 1` * Should appear at 0x68, BUT with "UU" at the address this time * Remove the fake hardware clock {{{ systemctl disable fake-hwclock --now apt purge -y fake-hwclock }}} * In theory everything just works now thanks to a udev rule: https://www.raspberrypi.org/forums/viewtopic.php?t=209700 {{{ root@asval:~# cat /lib/udev/rules.d/85-hwclock.rules # Set the System Time from the Hardware Clock and set the kernel's timezone # value to the local timezone when the kernel clock module is loaded. KERNEL=="rtc0", RUN+="/usr/lib/udev/hwclock-set $root/$name" }}} * Install chrony so it manages the hardware clock {{{ apt install -y chrony }}} It'll do the rest once it's installed and synced. Try some commands to see how it's fairing: {{{ chronyc sources chronyc tracking }}} ==== Disable unneeded stuff ==== I'm using asval as a network appliance, so I don't need the wifi and bluetooth radios. https://sleeplessbeastie.eu/2018/12/31/how-to-disable-onboard-wifi-and-bluetooth-on-raspberry-pi-3/ 1. Edit your `/boot/config.txt` and add: {{{ dtoverlay=disable-wifi dtoverlay=disable-bt }}} * The linked page above uses pi3-disable-foo, which are deprecated names 1. Disable hciuart daemon used for bluetooth modem access {{{ systemctl disable --now hciuart }}} 1. Reboot I guess ==== TFTP server ==== * Install the daemon {{{ apt install -y tftpd-hpa }}} * Copy your stuff into `/src/tftp` {{{ root@illustrious:~# tree /srv/tftp /srv/tftp ├── BOOTX64.EFI ├── default.efi ├── grub │ ├── grub.cfg │ ├── grub.cfg-01-64-00-6a-70-e6-73 -> persica3 │ ├── grub.cfg-01-64-00-6a-78-50-ed -> persica2 │ ├── grub.cfg-01-98-90-96-be-89-52 -> persica1 │ ├── grubx64.efi │ ├── persica1 │ ├── persica2 │ └── persica3 ├── images │ └── Alma-9.1 │ ├── initrd.img │ └── vmlinuz ├── ipxe.efi └── shimx64.efi 3 directories, 14 files }}} * make an ssh keypair {{{ ssh-keygen -t ed25519 }}} * Dump your key onto the source server then steal its data == k8s notes == * Make a simple 3-node cluster * Single-node control plane will run externally, on illustrious * Use kubeadm to build the cluster: https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/ * Selected containerd as the container runtime * Will use Flannel as the networking plugin * Allocated IPs: * persica1 / 192.168.1.31 * persica2 / 192.168.1.32 * persica3 / 192.168.1.33 * Ingress: undecided so far * Cgroup driver: let's use systemd * k8s version: whatever is latest right now (2023-04-04) == Build notes == === Per node === * Update the BIOS using this guide: https://www.dell.com/support/kbdoc/en-au/000131486/update-the-dell-bios-in-a-linux-or-ubuntu-environment#updatebios2015 * Despite the usual Dell docs saying you need to make a DOS boot disk and run the flash updater app from there, it turns out that the BIOS Flash Update target (mash F12 to get the one-time boot menu) can read the `9020MA19.exe` file from a FAT32 filesystem on a USB stick just fine * Not sure if this only works in UEFI mode or not, but I kinda don't care because we ''want'' to be in UEFI mode * This applies to systems made from 2015 or later * The latest BIOS update for the Optiplex 9020M is version A19, released * Set BIOS to full UEFI mode, no legacy * We'll be using DHCP, so find the MAC address so we can give it a consistent IP address when it boots * Add the MAC address and IP assignment to dnsmasq on calico (a pihole box) * `/etc/dnsmasq.d/02-pihole-dhcp-persica-cluster.conf` * Something like this {{{ dhcp-host=98:90:96:BE:89:52,set:persica,192.168.1.31,persica1,5m # one dhcp-host line per host dhcp-boot=tag:persica,grub/grubx64.efi,illustrious.thighhighs.top,192.168.1.12 }}} * Run `pihole restartdns` after making changes * PXE boot for kickstart install, which will hit calico for DHCP, then illustrious for the boot image and kickstart config * tftpd-hpa is running on illustrious * Upstream repo mirror: https://repo.almalinux.org/almalinux/9/BaseOS/x86_64/os/EFI/BOOT/ * Drop that content in `/srv/tftp/` {{{ root@illustrious:/srv/tftp# tree . ├── BOOTX64.EFI ├── default.efi ├── grub │ ├── grub.cfg │ ├── grub.cfg-01-98-90-96-be-89-52 │ └── grubx64.efi ├── images │ └── Alma-9.1 │ ├── initrd.img │ └── vmlinuz ├── ipxe.efi └── shimx64.efi }}} * Add a grub config fragment for the host's MAC address: `grub.cfg-01-xx-xx-xx-xx-xx-xx` * Make sure the grub config has the correct URL for its kickstart config * kickstart file served from `/data/www/illustrious/ks`: https://illustrious.thighhighs.top/ks/persica1.ks.cfg * Make sure your per-host config file has the correct name * KS references: * Reference manual: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html/performing_an_advanced_rhel_9_installation/kickstart-commands-and-options-reference_installing-rhel-as-an-experienced-user#keyboard-required_kickstart-commands-for-system-configuration * Generator tool: https://access.redhat.com/labs/kickstartconfig/ * k8s doesn't play well with swap so we need to disable it. Provision a minimal swap volume of 1gb, then disable it later This was useful for figuring out the TFTP stuff for the first time: https://askubuntu.com/questions/1183487/grub2-efi-boot-via-pxe-load-config-file-automatically Paths are hardcoded into the `grubx64.efi` binary, meaning HDD and PXE versions aren't the same. Make sure you put all the grub stuff in a `grub/` directory. Check the `$prefix` to see where it's searching: === UEFI settings === Get to the UEFI * Probably get stuck in windows for first boot * Win, then "UEFI", get to advanced startup options * Boot with Advanced Boot Options * Troubleshoot, Advanced Options, UEFI Firmware Settings, Restart Record details * Get the LOM MAC Address from Settings, General, System Info Change settings * General * Boot Sequence * Select UEFI boot list * Advanced Boot Options * Disable Legacy OPROMs * UEFI Boot Path Security * Set to Never * Date/Time * Set clock to approx correct for UTC time * System Configuration * Integrated NIC * Enable UEFI Network Stack * Enabled w/ PXE * SATA Operation * AHCI * SMART Reporting * Disabled, we don't need it * Audio * Disable all audio, we don't need it * Security * TPM Security * Check everything except Clear * Activated * CPU XD support * Enabled * Secure Boot * Secure Boot Enable * Disabled * Performance * Multi-core support: All * Speedstep: Enabled * C-states: Enabled * Limit CPUID: Disabled * Turboboost: Enabled * Power Management * AC Recovery: Power On * Deep Sleep Control: Disabled * USB Wake Support: Enable USB wake from Standby * Wake on LAN/WLAN: LAN with PXE Boot * Block Sleep: Enable blocking of sleep * POST Behaviour * Keyboard Errors: Disable error detection * Virtualisation support * Enable VT * Enable VT-d * Enable Trusted Execution Reboot and go back in again. * Boot only from IPv4 with NIC (PXE boot) === Ansible management after kickstart build === This is getting everything to the state where I can bootstrap the cluster. I should ansible'ise everything, making minimal assumptions about the kickstart part of the process. I'm keeping a simple ansible repo in `~/git/persica-ansible/` I have a basic set of roles to get the nodes into a workable state, right before I invoke `kubeadm` for the first time. {{{ |
== Intro == So here's the gist of this setup, third or fourth attempt now: * I'm going to use Rancher for the controlplane * Alma 9.4 because it's the latest * Move it a new subnet of 192.168.3.0/26 and put that behind the new Mikrotik router, helena. This means DHCP stays within the cluster, though the PXE service host is still outside. * persica1 / 192.168.3.3 * persica2 / 192.168.3.4 * persica3 / 192.168.3.5 * kalina / 192.168.3.2 * persica / CNAME to kalina for the Rancher web interface * Try using Longhorn for PVCs, though Portworx could be on the cards as well. At least I understand it now * Will try using MetalLB for non-http ingress == Hardware prep for the cluster nodes == Setup each new node like so, it's stuff that we just need to do one time when we receive the hardware: * k8s nodes: [[servers/HardwarePrep/DellOptiplex9020Micro]] * controller: [[servers/HardwarePrep/LenovoThinkCentreM710q]] == Prepare azusa for PXE services == This is needed so we can build kalina and the persica nodes consistently and easily. It can be used for other systems on the LAN as well, it's not just for this cluster. Build [[servers/azusa]] as the network services node, directions on how to configure these components are on her page. * Client netboots in UEFI mode and performs DHCP to get an IP address and PXE options * helena (router) points to azusa as the PXE boot `next-server` * azusa serves `grubx64.efi` as the EFI bootloader, via its TFTP server * grub reads grug.cfg and fetches menu entries specific to the client, based on its MAC address, also via TFTP * The client boots the kickstart installer target, fetching `vmlinuz` and `initrd.img` from azusa via TFTP * Kickstart begins thanks to kernel cmdline options, fetching the kickstart config from azusa, now via HTTP === ansible management for the cluster === azusa will also host the ansible repo for managing the cluster. Once a node is built with kickstart and online, we'll run an ansible playbook against it to get it up to spec. Make minimal assumptions about the kickstart part of the process, let ansible do the rest. * Login as myself, `furinkan` * Repo for the cluster is in `~/git/ansible/` * Valid targets are simple: {{{ make kalina # just the controller make persica # controller and k8s nodes }}} === Have nice SSH config so azusa can connect to each k8s node easily === Make yourself a little config in `~/.ssh/config` {{{ Host * User root IdentityFile ~/git/ansible/sshkey_ed25519 }}} == Prepare kalina controller node == Now build kalina: 1. Kickstart-build kalina using the configs on azusa 2. Run ansible against kalina, this will configure the OS and install docker. 3. Check that docker works {{{ docker run hello-world }}} 4. Push the certs from illustrious to kalina, we're using real publicly trusted CA-signed certs: https://ranchermanager.docs.rancher.com/pages-for-subheaders/rancher-on-a-single-node-with-docker#option-c-bring-your-own-certificate-signed-by-a-recognized-ca * On illustrious: {{{ cd /etc/ssl/ rsync -avx \ STAR_thighhighs_top.key \ STAR_thighhighs_top.crtbundled \ STAR_thighhighs_top.key.2023 \ STAR_thighhighs_top.crtbundled.2023 \ root@kalina:/etc/ssl/ }}} * Then on kalina: {{{ chown root:root /etc/ssl/STAR_thighhighs_top.* }}} == Run Rancher on kalina == If you're doing this on an ARM system follow this guide, it just tells you to specify an exact version so you know it's built with arm64 support: https://ranchermanager.docs.rancher.com/how-to-guides/advanced-user-guides/enable-experimental-features/rancher-on-arm64 Well I'm on x86 now so that doesn't matter, but I'm still going to specify an exact version because I'm sensible and want a repeatable build with no surprises. {{{ docker run -d --restart=unless-stopped \ -p 443:443 \ -v /etc/ssl/STAR_thighhighs_top.crtbundled:/etc/rancher/ssl/cert.pem \ -v /etc/ssl/STAR_thighhighs_top.key:/etc/rancher/ssl/key.pem \ --privileged \ rancher/rancher:v2.6.6 \ --no-cacerts }}} It'll take some time to start. Then you can try hitting the Rancher web UI: https://kalina.thighhighs.top/ Login with the local user password as directed, then let it set the new admin password. Record it somewhere safe, and set the server URL to https://persica.thighhighs.top because that's how we're going to access the cluster once we're done. == Build the k8s nodes == Manually kick the BIOS of each node to do a one-time PXE boot (mash F12 during POST), then let it do its thing. == Ansible-ise the k8s nodes == On azusa, run ansible against the hosts to configure the OS and install docker. {{{ make persica }}} == Stand up the cluster == We're following these instructions: https://ranchermanager.docs.rancher.com/pages-for-subheaders/use-existing-nodes 1. From the Dashboard click the Create button 2. Select ''Use existing nodes and create a cluster using RKE'' 3. Fill in the details * cluster name: persica * leave most options as default * I'm picking k8s version `v1.20.15-rancher2-2` so it matches what we run at work, and I can test upgrades at home * set the docker root directory to `/persist/docker` because we're moving to a disk with plenty of space, separate to the OS * ''Allow unsupported versions'' of Docker is already enabled; we need this because we're using a much newer distro and docker version * Hit Next to go to the next page 4. Check the boxes for all three cluster roles, all nodes will perform all roles 5. Go ahead and run the supplied command on each node. I like to do it one at a time so I can watch it {{{ docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:v2.6.6 --server https://persica.thighhighs.top --token lx5qjbl4dn7zkpbmt5qqz8qfdvtgsl2x5ft95j8lh785bxrjjccq2t --etcd --controlplane --worker docker logs recursing_proskuriakova -f }}} Give it like 10min, eventually the containers logs that you're following will die, because the container terminates once all the k8s components are up and running. == Install kubectl on controller kalina == This friggen sucks for older version, no package management for you! https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/#install-kubectl-binary-with-curl-on-linux Once you've got it installed, go to Rancher and Explore the persica cluster (https://persica.thighhighs.top/dashboard/c/c-gfnh7/explorer#cluster-events), then copy the kubeconfig to your clipboard with the button in the toolbar at the top of the screen. Go paste that into `~/.kube/config` in your account on kalina, now you can run `kubectl` there! Add this to your `~/.bashrc` for cool [[https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/#enable-shell-autocompletion| tab-completion]]: {{{ if hash kubectl 2>/dev/null ; then source <(kubectl completion bash) fi }}} == Install Longhorn cluster storage manager == This is done from the builtin Helm charts, let it go to work. It's a couple of simple clicks: https://persica.thighhighs.top/dashboard/c/c-gfnh7/apps/charts?category=Storage For some reason the predefined things you can configure on the helm chart '''don't''' include the local path to the disk on each node. Which is pretty bloody obvious you'd think, but no. It'll default to `/var/lib/longhorn` or something unless you override it. 1. Install into the ''System'' project 2. Do customise helm options before install 3. Go to the Edit YAML page and change the `defaultDataPath` to `/persist/longhorn/` instead 4. Now you can run the install I tried out this dude's demo app that uses flask and redis to deploy a trivial website, that was a nifty test of all the bits working together as expected: * https://ranchergovernment.com/blog/article-simple-rke2-longhorn-and-rancher-install#longhorn-gui * https://raw.githubusercontent.com/clemenko/k8s_yaml/master/flask_simple_nginx.yml Blessedly the ingress just works. No idea what to do yet to make a service that presents itself on public IPs. === Backups === I tried using Backblaze B2 as an S3 backend for backups, but couldn't get it working. {{{ error listing backup volume names: failed to execute: /var/lib/longhorn/engine-binaries/rancher-mirrored-longhornio-longhorn-engine-v1.3.3/longhorn [backup ls --volume-only s3.us-west-001.backblazeb2.com], output driver is not supported , stderr, time="2024-07-08T12:33:34Z" level=error msg="driver is not supported" , error exit status 1 https://forums.rancher.com/t/longhorn-ibm-object-storage-backup-configuration/19175 Here, the backup target format should be s3://<your-bucket-name>@<your-aws-region>/mypath/ s3://persica-longhorn-backups@s3.us-west-001.backblazeb2.com/ https://github.com/longhorn/longhorn/issues/1552#issuecomment-678389544 error listing backup volume names: failed to execute: /var/lib/longhorn/engine-binaries/rancher-mirrored-longhornio-longhorn-engine-v1.3.3/longhorn [backup ls --volume-only s3://persica-longhorn-backups@s3.us-west-001.backblazeb2.com/], output failed to list objects with param: { Bucket: "persica-longhorn-backups", Delimiter: "/", Prefix: "/" } error: AWS Error: MissingEndpoint 'Endpoint' configuration is required for this service <nil> , stderr, time="2024-07-08T12:44:07Z" level=error msg="Failed to list s3" error="failed to list objects with param: {\n Bucket: \"persica-longhorn-backups\",\n Delimiter: \"/\",\n Prefix: \"/\"\n} error: AWS Error: MissingEndpoint 'Endpoint' configuration is required for this service <nil>\n" pkg=s3 time="2024-07-08T12:44:07Z" level=error msg="failed to list objects with param: {\n Bucket: \"persica-longhorn-backups\",\n Delimiter: \"/\",\n Prefix: \"/\"\n} error: AWS Error: MissingEndpoint 'Endpoint' configuration is required for this service <nil>\n" , error exit status 1 https://persica-longhorn-backups.s3.us-west-001.backblazeb2.com }}} So I gave up and used NFS served up by [[servers/iowa]], which worked straight away. * Create the shared folder * Allow access from `192.168.3.0/24` * Give Longhorn the correctly-formatted URI for the NFS share: `nfs://iowa.thighhighs.top:/volume1/longhorn-backups` == Prepare dummy DNS records so we can test ingress and load balancing == Apps need ingress, and ingress means you need hostnames to refer to stuff. Let's add these to our zone: {{{ # Dodgy roundrobin for "load balancing" or ingress connections, which are terminated by a proxy on any node persicanodes 300 IN A 192.168.3.3 persicanodes 300 IN A 192.168.3.4 persicanodes 300 IN A 192.168.3.5 # Now some unique names for all the apps we're going to try app1.persica 300 IN CNAME persicanodes app2.persica 300 IN CNAME persicanodes app3.persica 300 IN CNAME persicanodes app4.persica 300 IN CNAME persicanodes app5.persica 300 IN CNAME persicanodes # These will be BGP or Layer2 MetalLB IPs lb1.persica 300 IN A 192.168.3.65 lb2.persica 300 IN A 192.168.3.66 lb3.persica 300 IN A 192.168.3.67 lb4.persica 300 IN A 192.168.3.68 lb5.persica 300 IN A 192.168.3.69 }}} == Load balancing with MetalLB == I thought I wouldn't need it, but it looks like I do, if I want sensible useful functionality. Here's an explanation of why I want to use Metal LB, and it's not just for BGP-based configs: https://github.com/kubernetes/ingress-nginx/blob/main/docs/deploy/baremetal.md Install it: 1. RTFM: https://metallb.universe.tf/installation/ 2. Grab the manifest and pull it into the repo, I'm using this one as it's similar to work: https://github.com/metallb/metallb/blob/v0.9/manifests/metallb.yaml 3. Create the namespace first, I'm putting it into the System project: {{{#!yaml apiVersion: v1 kind: Namespace metadata: name: metallb-system # This is the System project on the prod cluster annotations: field.cattle.io/projectId: c-gfnh7:p-db8t4 labels: app: metallb }}} 4. Create the metallb resources: `kubectl apply -f 01-metallb.yaml` 5. Create the memberlist secret that the nodes need to communicate: `kubectl -n metallb-system create secret generic memberlist --from-literal=secretkey="$$(openssl rand -base64 128)"` 6. Setup the configmap to configure its behaviour, they have a fully documented example here: https://github.com/metallb/metallb/blob/v0.9/manifests/example-config.yaml 7. Apply the config: `kubectl apply -f 02-config.yaml` === Configure BGP === https://metallb.universe.tf/configuration/#bgp-configuration Let's go for an iBGP design here - we both belong to the same private AS, number 64520 On helena: {{{ /routing/bgp/connection/add name=persica1 remote.address=192.168.3.3 as=64520 local.role=ibgp /routing/bgp/connection/add name=persica2 remote.address=192.168.3.4 as=64520 local.role=ibgp /routing/bgp/connection/add name=persica3 remote.address=192.168.3.5 as=64520 local.role=ibgp }}} And in metallb we drop this config in: {{{#!yaml data: config: | peers: - peer-address: 192.168.3.1 peer-asn: 64520 my-asn: 64520 address-pools: - name: persica-lb protocol: bgp addresses: - 192.168.3.64/26 avoid-buggy-ips: true auto-assign: false bgp-advertisements: - aggregation-length: 32 localpref: 100 communities: - no-export bgp-communities: # "Do not advertise this route to external BGP peers" no-export: 65535:65281 # "Do not advertise this route to any peer" no-advertise: 65535:65282 }}} The moment I apply this, helena sees a connection from the persica nodes, awesome. When we just need to define a loadbalanced service in k8s, and they'll start advertising the address. With a bit of faffing, it does just that. Had to force it to pick the IP I wanted, it uses .64 initially which I don't want. Our version doesn't respect the request by annotation, but spec.loadbalancerIP works (though it's deprecated). === Try MetalLB in Layer 2 mode first === NB: this is old I'll use it in L2 mode with ARP/NDP I think. Just need to dedicate a bunch of IPs to it so it can manage the traffic to them. Holy crap I think I got it working. * We'll use it in L2 mode, no BGP yet * Set aside 192.168.3.64 - 192.168.3.127 for load balanced services * Install it via Rancher helm chart interface, no config * Push a simple address pool and advertisement config {{{#!yaml |
Line 277: | Line 331: |
- name: Configure persica k8s cluster hosts: persica roles: - role: common tags: common - role: docker_for_kube tags: docker_for_kube - role: kube_daemons tags: kube_daemons }}} === Initialise the control plane === This is manual of course, no ansible here. https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/#initializing-your-control-plane-node 1. This will be a single-node control plane, but we should specify `--control-plane-endpoint` anyway. persica1 is going to be our control plane. 2. Our Pod network add-on will be Flannel. We can specify `--pod-network-cidr` but I'll try without first. 3. It'll detect containerd 4. The default `--apiserver-advertise-address` will be fine, let it autodetect I added a custom CNAME record to local pihole (calico) and Gandi (public service), for `persica-endpoint` => `persica1`. Unlike the DHCP stuff, this is in the general DNS web interface, not a custom config file. After a bunch of faffing around to fix up the firewall config, bridge filtering kernel module, and enabling ipv4 forwarding, the init begins after passing preflight checks. {{{ [root@persica1 ~]# kubeadm init --control-plane-endpoint=persica-endpoint [init] Using Kubernetes version: v1.27.1 [preflight] Running pre-flight checks [WARNING Firewalld]: firewalld is active, please ensure ports [6443 10250] are open or your cluster may not function correctly [preflight] Pulling images required for setting up a Kubernetes cluster [preflight] This might take a minute or two, depending on the speed of your internet connection [preflight] You can also perform this action in beforehand using 'kubeadm config images pull' W0415 03:43:19.958609 39430 images.go:80] could not find officially supported version of etcd for Kubernetes v1.27.1, falling back to the nearest etcd version (3.5.7-0) W0415 03:43:52.646765 39430 checks.go:835] detected that the sandbox image "registry.k8s.io/pause:3.6" of the container runtime is inconsistent with that used by kubeadm. It is recommended that using "registry.k8s.io/pause:3.9" as the CRI sandbox image. [certs] Using certificateDir folder "/etc/kubernetes/pki" [certs] Generating "ca" certificate and key [certs] Generating "apiserver" certificate and key [certs] apiserver serving cert is signed for DNS names [kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local persica-endpoint persica1] and IPs [10.96.0.1 192.168.1.31] [certs] Generating "apiserver-kubelet-client" certificate and key [certs] Generating "front-proxy-ca" certificate and key [certs] Generating "front-proxy-client" certificate and key [certs] Generating "etcd/ca" certificate and key [certs] Generating "etcd/server" certificate and key [certs] etcd/server serving cert is signed for DNS names [localhost persica1] and IPs [192.168.1.31 127.0.0.1 ::1] [certs] Generating "etcd/peer" certificate and key [certs] etcd/peer serving cert is signed for DNS names [localhost persica1] and IPs [192.168.1.31 127.0.0.1 ::1] [certs] Generating "etcd/healthcheck-client" certificate and key [certs] Generating "apiserver-etcd-client" certificate and key [certs] Generating "sa" key and public key [kubeconfig] Using kubeconfig folder "/etc/kubernetes" [kubeconfig] Writing "admin.conf" kubeconfig file [kubeconfig] Writing "kubelet.conf" kubeconfig file [kubeconfig] Writing "controller-manager.conf" kubeconfig file [kubeconfig] Writing "scheduler.conf" kubeconfig file [kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env" [kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml" [kubelet-start] Starting the kubelet [control-plane] Using manifest folder "/etc/kubernetes/manifests" [control-plane] Creating static Pod manifest for "kube-apiserver" [control-plane] Creating static Pod manifest for "kube-controller-manager" [control-plane] Creating static Pod manifest for "kube-scheduler" [etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests" W0415 03:44:21.781505 39430 images.go:80] could not find officially supported version of etcd for Kubernetes v1.27.1, falling back to the nearest etcd version (3.5.7-0) [wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s [kubelet-check] Initial timeout of 40s passed. Unfortunately, an error has occurred: timed out waiting for the condition This error is likely caused by: - The kubelet is not running - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled) If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands: - 'systemctl status kubelet' - 'journalctl -xeu kubelet' Additionally, a control plane component may have crashed or exited when started by the container runtime. To troubleshoot, list all containers using your preferred container runtimes CLI. Here is one example how you may list all running Kubernetes containers by using crictl: - 'crictl --runtime-endpoint unix:///var/run/containerd/containerd.sock ps -a | grep kube | grep -v pause' Once you have found the failing container, you can inspect its logs with: - 'crictl --runtime-endpoint unix:///var/run/containerd/containerd.sock logs CONTAINERID' error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster To see the stack trace of this error execute with --v=5 or higher }}} No worky :/ https://serverfault.com/questions/1116281/kubeadm-1-25-init-failed-on-debian-11-with-containerd-connection-refused Maybe I need the control plane on a separate node after all. I'll try illustrious. * copy containerd/config.toml to illustrious * apt install -y apt-transport-https ca-certificates curl * curl -fsSLo /etc/apt/trusted.gpg.d/kubernetes-archive-keyring.gpg https://packages.cloud.google.com/apt/doc/apt-key.gpg * prep repo defn {{{ |
# https://metallb.universe.tf/configuration/#layer-2-configuration apiVersion: metallb.io/v1beta1 kind: IPAddressPool metadata: name: metallb-pool-1 namespace: metallb-system spec: addresses: - 192.168.3.65-192.168.3.126 --- apiVersion: metallb.io/v1beta1 kind: L2Advertisement metadata: name: metallb-pool-1 namespace: metallb-system # Not needed because L2Advertisement claims all IPAddressPools by default spec: ipAddressPools: - metallb-pool-1 }}} * Copy the existing redis service in the example, and add an external access route to it as a secondary service {{{#!yaml --- apiVersion: v1 kind: Service metadata: namespace: flask name: redis-ext labels: name: redis kubernetes.io/name: "redis" spec: selector: app: redis ports: - name: redis protocol: TCP port: 6379 type: LoadBalancer }}} It's really as simple as adding `type: LoadBalancer`, then MetalLB selects the next free IP itself and binds it. === Try it in BGP mode next === TBC == Making ingress work - was this for the kubeadm method? == I don't understand this well enough, but I want to use ingress-nginx. Here's a page about it, albeit not using raw kubectl: https://kubernetes.github.io/ingress-nginx/kubectl-plugin/ Maybe this one too: https://medium.com/tektutor/using-nginx-ingress-controller-in-kubernetes-bare-metal-setup-890eb4e7772 == Things that suck == === cgroups === Alma9 introduces cgroups v2, which weren't a thing on Centos 7. That means you have to deal with them now. They tend to break docker a lot, so just revert back to v1 cgroups. How it manifests: * For context: fucking cgroups, k3s dies instantly * https://github.com/rancher/rancher/issues/35201#issuecomment-947331154 * https://groups.google.com/g/linux.debian.bugs.dist/c/Z-Cc0WmlEGA/m/NB6XGDsnAwAJ * Finally found a simple solution: https://github.com/rancher/rancher/issues/36165 Fix it: * Append an option to the kernel cmdline, this'll do it for you: {{{ grubby --update-kernel=ALL --args="systemd.unified_cgroup_hierarchy=0" }}} * Then reboot for it to take effect === Networking kernel modules === The problem: you fixed cgroups but now you get an error like this when Rancher starts up: {{{ I1125 03:57:50.129406 93 network_policy_controller.go:163] Starting network policy controller F1125 03:57:50.130225 93 network_policy_controller.go:404] failed to run iptables command to create KUBE-ROUTER-FORWARD chain due to running [/usr/bin/iptables -t filter -S KUBE-ROUTER-FORWARD 1 --wait]: exit status 3: iptables v1.8.8 (legacy): can't initialize iptables table `filter': Table does not exist (do you need to insmod?) Perhaps iptables or your kernel needs to be upgraded. panic: F1125 03:57:50.130225 93 network_policy_controller.go:404] failed to run iptables command to create KUBE-ROUTER-FORWARD chain due to running [/usr/bin/iptables -t filter -S KUBE-ROUTER-FORWARD 1 --wait]: exit status 3: iptables v1.8.8 (legacy): can't initialize iptables table `filter': Table does not exist (do you need to insmod?) Perhaps iptables or your kernel needs to be upgraded. }}} And then it explodes and the container dies. Turns out you need some iptables modules loaded. This fixed it the first time: {{{ [root@kalina ~]# modprobe iptable_nat [root@kalina ~]# modprobe br_netfilter }}} But it happened again the next time I rebuilt the cluster. You gotta make it stick by adding config fragments to `/etc/modules-load.d` Explanations: * This kinda describes the issue: https://slack-archive.rancher.com/t/9761163/hey-folks-i-have-a-quick-question-for-a-newbie-i-have-setup- * Yeah it turns out that the rancher container fucking dies in the arse with no explanation when you don't have the iptables modules loaded, duhhhh. I figured that out and made them load on-boot like so: https://forums.centos.org/viewtopic.php?t=72040 === Firewalls === Now whyTF can't persica2 and persica3 contact services on persica1..? Aha, firewalld is running on persica1, and it shouldn't be. Need to disable it using ansible as well. {{{ systemctl disable firewalld.service --now }}} Yeah that's jank, but hey it's what they tell you to do! https://ranchermanager.docs.rancher.com/how-to-guides/advanced-user-guides/open-ports-with-firewalld "We recommend disabling firewalld. For Kubernetes 1.19.x and higher, firewalld must be turned off." === Cleanup and try again === Find that it doesn't work and you can't make it work, awesome. Tear it all down and start again, killing every container, nuking files, and starting from scratch: https://ranchermanager.docs.rancher.com/how-to-guides/new-user-guides/manage-clusters/clean-cluster-nodes#directories-and-files Eventually, you get a cluster with three working nodes in it!! === Installing older versions of kubectl === Running an older version of k8s and need an older version of kubectl to go with it? You're shit out of luck, my friend! https://kubernetes.io/blog/2023/08/15/pkgs-k8s-io-introduction/ They moved to new package repos in 2023, and as of early 2024 the old repos are gone! The new repos only have v1.24 and newer, so if you need anything older it's just not there. Looks like our last option is: "You can directly download binaries instead of using packages. As an example, see ''Without a package manager'' instructions in "[[https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm| Installing kubeadm]]" document. And you end up here: https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/#install-kubectl-binary-with-curl-on-linux Here's a modern way of defining the repo on debian-type systems btw: {{{ # Our cluster is k8s v1.23 so we can use kubectl as late as 1.24 curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.24/deb/Release.key | gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg |
Line 380: | Line 467: |
URIs: https://apt.kubernetes.io/ Suites: kubernetes-xenial Architectures: amd64 Components: main Signed-By: /etc/apt/trusted.gpg.d/kubernetes-archive-keyring.gpg X-Repolib-ID: Kubernetes |
URIs: https://pkgs.k8s.io/core:/stable:/v1.24/deb/ Suites: / Architectures: arm64 Signed-By: /etc/apt/keyrings/kubernetes-apt-keyring.gpg |
Line 387: | Line 472: |
}}} * apt update * apt install -y kubelet kubeadm kubectl * apt-mark hold kubelet kubeadm kubectl Now try kubeadm again. ---- Oh sonovabitch! Config not well described: https://github.com/containerd/containerd/issues/6964 Fixed config /etc/containerd/config.toml: {{{ version = 2 disabled_plugins = [] [plugins."io.containerd.grpc.v1.cri".containerd.runtimes] [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc] base_runtime_spec = "" cni_conf_dir = "" cni_max_conf_num = 0 container_annotations = [] pod_annotations = [] privileged_without_host_devices = false runtime_engine = "" runtime_path = "" runtime_root = "" runtime_type = "io.containerd.runc.v2" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options] BinaryName = "" CriuImagePath = "" CriuPath = "" CriuWorkPath = "" IoGid = 0 IoUid = 0 NoNewKeyring = false NoPivotRoot = false Root = "" ShimCgroup = "" SystemdCgroup = true # They suggest pinning this image, so we'll do that. This is the out-of-box default. # https://kubernetes.io/docs/setup/production-environment/container-runtimes/#override-pause-image-containerd [plugins."io.containerd.grpc.v1.cri"] sandbox_image = "registry.k8s.io/pause:3.9" }}} We could/should be using kubeadm init with a configuration file: https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ {{{ Apr 15 04:48:26 illustrious.thighhighs.top systemd[1]: Started kubelet: The Kubernetes Node Agent. Apr 15 04:48:26 illustrious.thighhighs.top kubelet[12354]: Flag --container-runtime-endpoint has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information. Apr 15 04:48:26 illustrious.thighhighs.top kubelet[12354]: Flag --pod-infra-container-image has been deprecated, will be removed in a future release. Image garbage collector will get sandbox image information from CRI. }}} But screw that. Because guess what, it's also poorly documented! === Initialising the control plane now actually works === {{{ kubeadm init --control-plane-endpoint=persica-endpoint Setup my `~/.kube/` config stuff as directed. Apparently this is an uber-superuser, so I shouldn't be using it regularly. Oh. cat <<EOF > kubeconfig_example.yml apiVersion: kubeadm.k8s.io/v1beta3 kind: ClusterConfiguration # Will be used as the target "cluster" in the kubeconfig clusterName: "persica" # Will be used as the "server" (IP or DNS name) of this cluster in the kubeconfig controlPlaneEndpoint: "persica-endpoint.thighhighs.top:6443" # The cluster CA key and certificate will be loaded from this local directory certificatesDir: "/etc/kubernetes/pki" EOF # on illustrious kubeadm kubeconfig user --config kubeconfig_example.yml --client-name furinkan --validity-period 8760h }}} Now try adding a pod network. We'll use Flannel, and find the docs ourselves: https://github.com/flannel-io/flannel#deploying-flannel-manually {{{ # from suomi kubectl --context=persica-admin apply -f https://github.com/flannel-io/flannel/releases/latest/download/kube-flannel.yml kubectl --context=persica-admin get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE kube-flannel kube-flannel-ds-zr6fb 0/1 CrashLoopBackOff 1 (16s ago) 34s kube-system coredns-5d78c9869d-mp7p9 0/1 ContainerCreating 0 66m kube-system coredns-5d78c9869d-tlsc6 0/1 ContainerCreating 0 66m kube-system etcd-illustrious.thighhighs.top 1/1 Running 1 66m kube-system kube-apiserver-illustrious.thighhighs.top 1/1 Running 1 66m kube-system kube-controller-manager-illustrious.thighhighs.top 1/1 Running 1 66m kube-system kube-proxy-5mntm 1/1 Running 0 66m kube-system kube-scheduler-illustrious.thighhighs.top 1/1 Running 1 66m }}} Doesn't work because we don't have the same podCIDR, and the default isn't compatible with whatever kubeadm does? FFS! https://devops.stackexchange.com/questions/5898/how-to-get-kubernetes-pod-network-cidr Okay so I can either nuke the cluster and reinstantiate it with podCIDR, or just reinstall the network plugin or something. Let's try the latter. * get the current podCIDR: https://devops.stackexchange.com/a/14867 * kubeadm config print init-defaults | grep serviceSubnet * wget https://github.com/flannel-io/flannel/releases/latest/download/kube-flannel.yml * Edit it * Reapply it? kubectl apply -f kube-flannel.yml * Is it still crashlooping? kubectl get pods --all-namespaces Yeah. === Fukkit try again === {{{ # on illustrious kubeadm reset rm -rf /etc/cni/net.d/ rm -rf ~/.kube/ # fix the init: https://github.com/flannel-io/flannel/issues/728#issuecomment-308878912 kubeadm init --control-plane-endpoint=persica-endpoint.thighhighs.top --pod-network-cidr=10.244.0.0/16 # Fix up my kubectl creds again # install flannel again kubectl apply -f https://github.com/flannel-io/flannel/releases/latest/download/kube-flannel.yml # is it working now? kubectl get pods --all-namespaces # IT FUCKING WORKS!! }}} Now we join some worker nodes to the cluster, finally. {{{ # on persica1 kubeadm join persica-endpoint.thighhighs.top:6443 --token FOO.FOOFOOFOO \ --discovery-token-ca-cert-hash sha256:BARBARBARBARBARBARBARBARBARBARBARBARBARBARBARBARBARBARBARBARBAR [preflight] Running pre-flight checks [preflight] Reading configuration from the cluster... [preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml' [kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml" [kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env" [kubelet-start] Starting the kubelet [kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap... This node has joined the cluster: * Certificate signing request was sent to apiserver and a response was received. * The Kubelet was informed of the new secure connection details. Run 'kubectl get nodes' on the control-plane to see this node join the cluster. }}} It's joined but apparently `NotReady`: {{{ root@illustrious:~# kubectl get nodes NAME STATUS ROLES AGE VERSION illustrious.thighhighs.top NotReady control-plane 17m v1.27.1 persica1 NotReady <none> 2m7s v1.27.0 }}} Apparently coredns won't start because of taints, as described here: * https://serverfault.com/questions/1064936/coredns-pods-stuck-in-pending-state * No explanation as to why the taints aren't going away * Similar problem here * Someone says to just restart containerd Fuck yoooooouuu, now the coredns containers are running. I probably shouldn't have jumped the gun and joined all the worker nodes... I need to kick them so they start properly. {{{ root@illustrious:~# kubectl get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE kube-flannel kube-flannel-ds-4p4wd 0/1 Init:0/2 0 21m kube-flannel kube-flannel-ds-6qfrm 0/1 Init:0/2 0 12m kube-flannel kube-flannel-ds-kb94w 0/1 Init:0/2 0 12m kube-flannel kube-flannel-ds-vctrt 1/1 Running 0 30m kube-system coredns-5d78c9869d-dqnkh 1/1 Running 0 36m kube-system coredns-5d78c9869d-rbmhm 1/1 Running 0 36m kube-system etcd-illustrious.thighhighs.top 1/1 Running 2 36m kube-system kube-apiserver-illustrious.thighhighs.top 1/1 Running 2 36m kube-system kube-controller-manager-illustrious.thighhighs.top 1/1 Running 0 36m kube-system kube-proxy-8dl56 0/1 ContainerCreating 0 12m kube-system kube-proxy-dppxt 0/1 ContainerCreating 0 21m kube-system kube-proxy-ljk6c 1/1 Running 0 36m kube-system kube-proxy-t7gcn 0/1 ContainerCreating 0 12m kube-system kube-scheduler-illustrious.thighhighs.top 1/1 Running 2 36m }}} Try deleting and re-adding a node. From https://stackoverflow.com/a/54220808/806927 {{{ # on illustrious kubectl get nodes kubectl drain persica1 kubectl drain persica1 --ignore-daemonsets --delete-local-data kubectl delete node persica1 # on persica1 kubeadm reset then join again }}} Looks like the kube-proxy is having trouble starting on persica1. And while it's only a warning, I bet it's more significant than that. {{{ root@illustrious:~# kubectl get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE kube-flannel kube-flannel-ds-gjq5h 0/1 Init:0/2 0 3m33s kube-flannel kube-flannel-ds-vctrt 1/1 Running 0 41m kube-system coredns-5d78c9869d-dqnkh 1/1 Running 0 47m kube-system coredns-5d78c9869d-rbmhm 1/1 Running 0 47m kube-system etcd-illustrious.thighhighs.top 1/1 Running 2 47m kube-system kube-apiserver-illustrious.thighhighs.top 1/1 Running 2 47m kube-system kube-controller-manager-illustrious.thighhighs.top 1/1 Running 0 47m kube-system kube-proxy-ljk6c 1/1 Running 0 47m kube-system kube-proxy-xpv58 0/1 ContainerCreating 0 3m33s kube-system kube-scheduler-illustrious.thighhighs.top 1/1 Running 2 47m root@illustrious:~# kubectl get events --namespace=kube-system | grep pod/kube-proxy-xpv58 4m29s Normal Scheduled pod/kube-proxy-xpv58 Successfully assigned kube-system/kube-proxy-xpv58 to persica1 9s Warning FailedCreatePodSandBox pod/kube-proxy-xpv58 Failed to create pod sandbox: open /run/systemd/resolve/resolv.conf: no such file or directory # on persica1 mkdir /run/systemd/resolve ln -s /etc/resolv.conf /run/systemd/resolve/resolv.conf wtf now there's another error: root@illustrious:~# kubectl get events --namespace=kube-system | grep pod/kube-proxy-grqhf 20s Normal Scheduled pod/kube-proxy-grqhf Successfully assigned kube-system/kube-proxy-grqhf to persica1 6s Warning FailedCreatePodSandBox pod/kube-proxy-grqhf Failed to create pod sandbox: rpc error: code = InvalidArgument desc = failed to create containerd container: create container failed validation: container.Runtime.Name must be set: invalid argument }}} I think I haven't deployed a good containerd config everywhere yet. Deployed that, and suddenly the damn kube-proxy and kube-flannel containers are working. Now I can add the other two nodes, still need to fix the resolv.conf manually. {{{ root@illustrious:~# kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME illustrious.thighhighs.top Ready control-plane 78m v1.27.1 192.168.1.12 <none> Ubuntu 22.04.2 LTS 5.15.0-69-generic containerd://1.6.20 persica1 Ready <none> 21m v1.27.0 192.168.1.31 <none> AlmaLinux 9.1 (Lime Lynx) 5.14.0-162.6.1.el9_1.x86_64 containerd://1.6.20 persica2 Ready <none> 2m41s v1.27.0 192.168.1.32 <none> AlmaLinux 9.1 (Lime Lynx) 5.14.0-162.6.1.el9_1.x86_64 containerd://1.6.20 persica3 Ready <none> 33s v1.27.0 192.168.1.33 <none> AlmaLinux 9.1 (Lime Lynx) 5.14.0-162.6.1.el9_1.x86_64 containerd://1.6.20 }}} Good enough for now! == Making ingress work == I don't understand this well enough, but I want to use ingress-nginx. Here's a page about it, albeit not using raw kubectl: https://kubernetes.github.io/ingress-nginx/kubectl-plugin/ Maybe this one too: https://medium.com/tektutor/using-nginx-ingress-controller-in-kubernetes-bare-metal-setup-890eb4e7772 == Making load balancing work == I thought I wouldn't need it, but it looks like I do, if I want sensible useful functionality. Here's an explanation of why I want to use Metal LB, and it's not just for BGP-based configs: https://github.com/kubernetes/ingress-nginx/blob/main/docs/deploy/baremetal.md I'll use it in L2 mode with ARP/NDP I think. Just need to dedicate a bunch of IPs to it so it can manage the traffic to them. |
apt update apt install -y kubectl }}} === You can't use selinux === It just breaks way too much shit, it's not worth it. Install something new and it doesn't work? You'll forever be wondering "is it selinux" immediately after it fails. |
persica cluster
This is a cluster of three identical nodes, named persica1/2/3
- Alma Linux 9.4 x64
- Dell Optiplex 9020 Micro
- Intel Core i5-4590T @ 2.00 GHz
- 16gb DDR3-1600
- 128gb SSD SK hynix SC311 SATA M.2
- I bought the three of them for $405 in total, so $135 AUD each, in March 2023.
I also later picked up an extra node to run as the controller, kalina, and whatever other services it might need. The controller is fairly heavyweight, so the Raspberry Pi won't cut it, and I don't want to jam it on the machine hosting my webserver at the same time.
- Alma Linux 9.4 x64
Lenovo ThinkCentre M710q
- Intel Core i5-6500T @ 2.50GHz (4-core no HT)
- 16gb DDR4-2133 (DIMMs specced for 2667)
- 256gb SSD SK hynix PC601 NVMe M.2
- I bought this one for $159 in November 2023
I've also added a dedicated Mikrotik hEX S router (model RB760iGS) to the setup, it gives a dedicated /24 subnet to the cluster, routed to the rest of my LAN but without using NAT. Now I get to learn OSPF, and BGP once I add MetalLB to the cluster for ingress.
This has been such a chore getting things working smoothly, it's just so damn finicky and it makes my notes a mess. I've tried to clean them up, but I'll keep all the failure notes in a section at the end.
- Doing it mostly raw with kubeadm sucked, the docs are completely unopinionated and give you every option at every instance of there being a choice. Great if you know what you're doing, but if you know what you're doing then you don't really need those docs
This guide looks like an improvement, something with actual explanations and a few opinions: https://github.com/hobby-kube/guide
- So many guides assume you're doing this in the cloud, which is a fair assumption for starting as a beginner with no infra, but they make too many logical leaps that you have to fill in the gaps yourselves, or just can't be applied on your own baremetal
Contents
-
persica cluster
- Intro
- Hardware prep for the cluster nodes
- Prepare azusa for PXE services
- Prepare kalina controller node
- Run Rancher on kalina
- Build the k8s nodes
- Ansible-ise the k8s nodes
- Stand up the cluster
- Install kubectl on controller kalina
- Install Longhorn cluster storage manager
- Prepare dummy DNS records so we can test ingress and load balancing
- Load balancing with MetalLB
- Making ingress work - was this for the kubeadm method?
- Things that suck
Intro
So here's the gist of this setup, third or fourth attempt now:
- I'm going to use Rancher for the controlplane
- Alma 9.4 because it's the latest
- Move it a new subnet of 192.168.3.0/26 and put that behind the new Mikrotik router, helena. This means DHCP stays within the cluster, though the PXE service host is still outside.
- persica1 / 192.168.3.3
- persica2 / 192.168.3.4
- persica3 / 192.168.3.5
- kalina / 192.168.3.2
- persica / CNAME to kalina for the Rancher web interface
- Try using Longhorn for PVCs, though Portworx could be on the cards as well. At least I understand it now
- Will try using MetalLB for non-http ingress
Hardware prep for the cluster nodes
Setup each new node like so, it's stuff that we just need to do one time when we receive the hardware:
k8s nodes: servers/HardwarePrep/DellOptiplex9020Micro
controller: servers/HardwarePrep/LenovoThinkCentreM710q
Prepare azusa for PXE services
This is needed so we can build kalina and the persica nodes consistently and easily. It can be used for other systems on the LAN as well, it's not just for this cluster.
Build servers/azusa as the network services node, directions on how to configure these components are on her page.
- Client netboots in UEFI mode and performs DHCP to get an IP address and PXE options
helena (router) points to azusa as the PXE boot next-server
azusa serves grubx64.efi as the EFI bootloader, via its TFTP server
- grub reads grug.cfg and fetches menu entries specific to the client, based on its MAC address, also via TFTP
The client boots the kickstart installer target, fetching vmlinuz and initrd.img from azusa via TFTP
- Kickstart begins thanks to kernel cmdline options, fetching the kickstart config from azusa, now via HTTP
ansible management for the cluster
azusa will also host the ansible repo for managing the cluster.
Once a node is built with kickstart and online, we'll run an ansible playbook against it to get it up to spec. Make minimal assumptions about the kickstart part of the process, let ansible do the rest.
Login as myself, furinkan
Repo for the cluster is in ~/git/ansible/
Valid targets are simple:
make kalina # just the controller make persica # controller and k8s nodes
Have nice SSH config so azusa can connect to each k8s node easily
Make yourself a little config in ~/.ssh/config
Host * User root IdentityFile ~/git/ansible/sshkey_ed25519
Prepare kalina controller node
Now build kalina:
- Kickstart-build kalina using the configs on azusa
- Run ansible against kalina, this will configure the OS and install docker.
Check that docker works
docker run hello-world
Push the certs from illustrious to kalina, we're using real publicly trusted CA-signed certs: https://ranchermanager.docs.rancher.com/pages-for-subheaders/rancher-on-a-single-node-with-docker#option-c-bring-your-own-certificate-signed-by-a-recognized-ca
On illustrious:
cd /etc/ssl/ rsync -avx \ STAR_thighhighs_top.key \ STAR_thighhighs_top.crtbundled \ STAR_thighhighs_top.key.2023 \ STAR_thighhighs_top.crtbundled.2023 \ root@kalina:/etc/ssl/
Then on kalina:
chown root:root /etc/ssl/STAR_thighhighs_top.*
Run Rancher on kalina
If you're doing this on an ARM system follow this guide, it just tells you to specify an exact version so you know it's built with arm64 support: https://ranchermanager.docs.rancher.com/how-to-guides/advanced-user-guides/enable-experimental-features/rancher-on-arm64
Well I'm on x86 now so that doesn't matter, but I'm still going to specify an exact version because I'm sensible and want a repeatable build with no surprises.
docker run -d --restart=unless-stopped \ -p 443:443 \ -v /etc/ssl/STAR_thighhighs_top.crtbundled:/etc/rancher/ssl/cert.pem \ -v /etc/ssl/STAR_thighhighs_top.key:/etc/rancher/ssl/key.pem \ --privileged \ rancher/rancher:v2.6.6 \ --no-cacerts
It'll take some time to start. Then you can try hitting the Rancher web UI: https://kalina.thighhighs.top/
Login with the local user password as directed, then let it set the new admin password. Record it somewhere safe, and set the server URL to https://persica.thighhighs.top because that's how we're going to access the cluster once we're done.
Build the k8s nodes
Manually kick the BIOS of each node to do a one-time PXE boot (mash F12 during POST), then let it do its thing.
Ansible-ise the k8s nodes
On azusa, run ansible against the hosts to configure the OS and install docker.
make persica
Stand up the cluster
We're following these instructions: https://ranchermanager.docs.rancher.com/pages-for-subheaders/use-existing-nodes
- From the Dashboard click the Create button
Select Use existing nodes and create a cluster using RKE
- Fill in the details
- cluster name: persica
- leave most options as default
I'm picking k8s version v1.20.15-rancher2-2 so it matches what we run at work, and I can test upgrades at home
set the docker root directory to /persist/docker because we're moving to a disk with plenty of space, separate to the OS
Allow unsupported versions of Docker is already enabled; we need this because we're using a much newer distro and docker version
- Hit Next to go to the next page
- Check the boxes for all three cluster roles, all nodes will perform all roles
Go ahead and run the supplied command on each node. I like to do it one at a time so I can watch it
docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:v2.6.6 --server https://persica.thighhighs.top --token lx5qjbl4dn7zkpbmt5qqz8qfdvtgsl2x5ft95j8lh785bxrjjccq2t --etcd --controlplane --worker docker logs recursing_proskuriakova -f
Give it like 10min, eventually the containers logs that you're following will die, because the container terminates once all the k8s components are up and running.
Install kubectl on controller kalina
This friggen sucks for older version, no package management for you!
Once you've got it installed, go to Rancher and Explore the persica cluster (https://persica.thighhighs.top/dashboard/c/c-gfnh7/explorer#cluster-events), then copy the kubeconfig to your clipboard with the button in the toolbar at the top of the screen.
Go paste that into ~/.kube/config in your account on kalina, now you can run kubectl there!
Add this to your ~/.bashrc for cool tab-completion:
if hash kubectl 2>/dev/null ; then source <(kubectl completion bash) fi
Install Longhorn cluster storage manager
This is done from the builtin Helm charts, let it go to work. It's a couple of simple clicks: https://persica.thighhighs.top/dashboard/c/c-gfnh7/apps/charts?category=Storage
For some reason the predefined things you can configure on the helm chart don't include the local path to the disk on each node. Which is pretty bloody obvious you'd think, but no. It'll default to /var/lib/longhorn or something unless you override it.
Install into the System project
- Do customise helm options before install
Go to the Edit YAML page and change the defaultDataPath to /persist/longhorn/ instead
- Now you can run the install
I tried out this dude's demo app that uses flask and redis to deploy a trivial website, that was a nifty test of all the bits working together as expected:
https://ranchergovernment.com/blog/article-simple-rke2-longhorn-and-rancher-install#longhorn-gui
https://raw.githubusercontent.com/clemenko/k8s_yaml/master/flask_simple_nginx.yml
Blessedly the ingress just works. No idea what to do yet to make a service that presents itself on public IPs.
Backups
I tried using Backblaze B2 as an S3 backend for backups, but couldn't get it working.
error listing backup volume names: failed to execute: /var/lib/longhorn/engine-binaries/rancher-mirrored-longhornio-longhorn-engine-v1.3.3/longhorn [backup ls --volume-only s3.us-west-001.backblazeb2.com], output driver is not supported , stderr, time="2024-07-08T12:33:34Z" level=error msg="driver is not supported" , error exit status 1 https://forums.rancher.com/t/longhorn-ibm-object-storage-backup-configuration/19175 Here, the backup target format should be s3://<your-bucket-name>@<your-aws-region>/mypath/ s3://persica-longhorn-backups@s3.us-west-001.backblazeb2.com/ https://github.com/longhorn/longhorn/issues/1552#issuecomment-678389544 error listing backup volume names: failed to execute: /var/lib/longhorn/engine-binaries/rancher-mirrored-longhornio-longhorn-engine-v1.3.3/longhorn [backup ls --volume-only s3://persica-longhorn-backups@s3.us-west-001.backblazeb2.com/], output failed to list objects with param: { Bucket: "persica-longhorn-backups", Delimiter: "/", Prefix: "/" } error: AWS Error: MissingEndpoint 'Endpoint' configuration is required for this service <nil> , stderr, time="2024-07-08T12:44:07Z" level=error msg="Failed to list s3" error="failed to list objects with param: {\n Bucket: \"persica-longhorn-backups\",\n Delimiter: \"/\",\n Prefix: \"/\"\n} error: AWS Error: MissingEndpoint 'Endpoint' configuration is required for this service <nil>\n" pkg=s3 time="2024-07-08T12:44:07Z" level=error msg="failed to list objects with param: {\n Bucket: \"persica-longhorn-backups\",\n Delimiter: \"/\",\n Prefix: \"/\"\n} error: AWS Error: MissingEndpoint 'Endpoint' configuration is required for this service <nil>\n" , error exit status 1 https://persica-longhorn-backups.s3.us-west-001.backblazeb2.com
So I gave up and used NFS served up by servers/iowa, which worked straight away.
- Create the shared folder
Allow access from 192.168.3.0/24
Give Longhorn the correctly-formatted URI for the NFS share: nfs://iowa.thighhighs.top:/volume1/longhorn-backups
Prepare dummy DNS records so we can test ingress and load balancing
Apps need ingress, and ingress means you need hostnames to refer to stuff. Let's add these to our zone:
# Dodgy roundrobin for "load balancing" or ingress connections, which are terminated by a proxy on any node persicanodes 300 IN A 192.168.3.3 persicanodes 300 IN A 192.168.3.4 persicanodes 300 IN A 192.168.3.5 # Now some unique names for all the apps we're going to try app1.persica 300 IN CNAME persicanodes app2.persica 300 IN CNAME persicanodes app3.persica 300 IN CNAME persicanodes app4.persica 300 IN CNAME persicanodes app5.persica 300 IN CNAME persicanodes # These will be BGP or Layer2 MetalLB IPs lb1.persica 300 IN A 192.168.3.65 lb2.persica 300 IN A 192.168.3.66 lb3.persica 300 IN A 192.168.3.67 lb4.persica 300 IN A 192.168.3.68 lb5.persica 300 IN A 192.168.3.69
Load balancing with MetalLB
I thought I wouldn't need it, but it looks like I do, if I want sensible useful functionality. Here's an explanation of why I want to use Metal LB, and it's not just for BGP-based configs: https://github.com/kubernetes/ingress-nginx/blob/main/docs/deploy/baremetal.md
Install it:
Grab the manifest and pull it into the repo, I'm using this one as it's similar to work: https://github.com/metallb/metallb/blob/v0.9/manifests/metallb.yaml
Create the namespace first, I'm putting it into the System project:
apiVersion: v1 kind: Namespace metadata: name: metallb-system # This is the System project on the prod cluster annotations: field.cattle.io/projectId: c-gfnh7:p-db8t4 labels: app: metallb
Create the metallb resources: kubectl apply -f 01-metallb.yaml
Create the memberlist secret that the nodes need to communicate: kubectl -n metallb-system create secret generic memberlist --from-literal=secretkey="$$(openssl rand -base64 128)"
Setup the configmap to configure its behaviour, they have a fully documented example here: https://github.com/metallb/metallb/blob/v0.9/manifests/example-config.yaml
Apply the config: kubectl apply -f 02-config.yaml
Configure BGP
https://metallb.universe.tf/configuration/#bgp-configuration
Let's go for an iBGP design here - we both belong to the same private AS, number 64520
On helena:
/routing/bgp/connection/add name=persica1 remote.address=192.168.3.3 as=64520 local.role=ibgp /routing/bgp/connection/add name=persica2 remote.address=192.168.3.4 as=64520 local.role=ibgp /routing/bgp/connection/add name=persica3 remote.address=192.168.3.5 as=64520 local.role=ibgp
And in metallb we drop this config in:
data: config: | peers: - peer-address: 192.168.3.1 peer-asn: 64520 my-asn: 64520 address-pools: - name: persica-lb protocol: bgp addresses: - 192.168.3.64/26 avoid-buggy-ips: true auto-assign: false bgp-advertisements: - aggregation-length: 32 localpref: 100 communities: - no-export bgp-communities: # "Do not advertise this route to external BGP peers" no-export: 65535:65281 # "Do not advertise this route to any peer" no-advertise: 65535:65282
The moment I apply this, helena sees a connection from the persica nodes, awesome.
When we just need to define a loadbalanced service in k8s, and they'll start advertising the address.
With a bit of faffing, it does just that. Had to force it to pick the IP I wanted, it uses .64 initially which I don't want. Our version doesn't respect the request by annotation, but spec.loadbalancerIP works (though it's deprecated).
Try MetalLB in Layer 2 mode first
NB: this is old
I'll use it in L2 mode with ARP/NDP I think. Just need to dedicate a bunch of IPs to it so it can manage the traffic to them.
Holy crap I think I got it working.
- We'll use it in L2 mode, no BGP yet
- Set aside 192.168.3.64 - 192.168.3.127 for load balanced services
- Install it via Rancher helm chart interface, no config
Push a simple address pool and advertisement config
--- # https://metallb.universe.tf/configuration/#layer-2-configuration apiVersion: metallb.io/v1beta1 kind: IPAddressPool metadata: name: metallb-pool-1 namespace: metallb-system spec: addresses: - 192.168.3.65-192.168.3.126 --- apiVersion: metallb.io/v1beta1 kind: L2Advertisement metadata: name: metallb-pool-1 namespace: metallb-system # Not needed because L2Advertisement claims all IPAddressPools by default spec: ipAddressPools: - metallb-pool-1
Copy the existing redis service in the example, and add an external access route to it as a secondary service
--- apiVersion: v1 kind: Service metadata: namespace: flask name: redis-ext labels: name: redis kubernetes.io/name: "redis" spec: selector: app: redis ports: - name: redis protocol: TCP port: 6379 type: LoadBalancer
It's really as simple as adding type: LoadBalancer, then MetalLB selects the next free IP itself and binds it.
Try it in BGP mode next
TBC
Making ingress work - was this for the kubeadm method?
I don't understand this well enough, but I want to use ingress-nginx. Here's a page about it, albeit not using raw kubectl: https://kubernetes.github.io/ingress-nginx/kubectl-plugin/
Maybe this one too: https://medium.com/tektutor/using-nginx-ingress-controller-in-kubernetes-bare-metal-setup-890eb4e7772
Things that suck
cgroups
Alma9 introduces cgroups v2, which weren't a thing on Centos 7. That means you have to deal with them now. They tend to break docker a lot, so just revert back to v1 cgroups.
How it manifests:
- For context: fucking cgroups, k3s dies instantly
https://github.com/rancher/rancher/issues/35201#issuecomment-947331154
https://groups.google.com/g/linux.debian.bugs.dist/c/Z-Cc0WmlEGA/m/NB6XGDsnAwAJ
Finally found a simple solution: https://github.com/rancher/rancher/issues/36165
Fix it:
Append an option to the kernel cmdline, this'll do it for you:
grubby --update-kernel=ALL --args="systemd.unified_cgroup_hierarchy=0"
- Then reboot for it to take effect
Networking kernel modules
The problem: you fixed cgroups but now you get an error like this when Rancher starts up:
I1125 03:57:50.129406 93 network_policy_controller.go:163] Starting network policy controller F1125 03:57:50.130225 93 network_policy_controller.go:404] failed to run iptables command to create KUBE-ROUTER-FORWARD chain due to running [/usr/bin/iptables -t filter -S KUBE-ROUTER-FORWARD 1 --wait]: exit status 3: iptables v1.8.8 (legacy): can't initialize iptables table `filter': Table does not exist (do you need to insmod?) Perhaps iptables or your kernel needs to be upgraded. panic: F1125 03:57:50.130225 93 network_policy_controller.go:404] failed to run iptables command to create KUBE-ROUTER-FORWARD chain due to running [/usr/bin/iptables -t filter -S KUBE-ROUTER-FORWARD 1 --wait]: exit status 3: iptables v1.8.8 (legacy): can't initialize iptables table `filter': Table does not exist (do you need to insmod?) Perhaps iptables or your kernel needs to be upgraded.
And then it explodes and the container dies.
Turns out you need some iptables modules loaded. This fixed it the first time:
[root@kalina ~]# modprobe iptable_nat [root@kalina ~]# modprobe br_netfilter
But it happened again the next time I rebuilt the cluster. You gotta make it stick by adding config fragments to /etc/modules-load.d
Explanations:
This kinda describes the issue: https://slack-archive.rancher.com/t/9761163/hey-folks-i-have-a-quick-question-for-a-newbie-i-have-setup-
Yeah it turns out that the rancher container fucking dies in the arse with no explanation when you don't have the iptables modules loaded, duhhhh. I figured that out and made them load on-boot like so: https://forums.centos.org/viewtopic.php?t=72040
Firewalls
Now whyTF can't persica2 and persica3 contact services on persica1..? Aha, firewalld is running on persica1, and it shouldn't be. Need to disable it using ansible as well.
systemctl disable firewalld.service --now
Yeah that's jank, but hey it's what they tell you to do! https://ranchermanager.docs.rancher.com/how-to-guides/advanced-user-guides/open-ports-with-firewalld
"We recommend disabling firewalld. For Kubernetes 1.19.x and higher, firewalld must be turned off."
Cleanup and try again
Find that it doesn't work and you can't make it work, awesome. Tear it all down and start again, killing every container, nuking files, and starting from scratch: https://ranchermanager.docs.rancher.com/how-to-guides/new-user-guides/manage-clusters/clean-cluster-nodes#directories-and-files
Eventually, you get a cluster with three working nodes in it!!
Installing older versions of kubectl
Running an older version of k8s and need an older version of kubectl to go with it? You're shit out of luck, my friend!
https://kubernetes.io/blog/2023/08/15/pkgs-k8s-io-introduction/
They moved to new package repos in 2023, and as of early 2024 the old repos are gone! The new repos only have v1.24 and newer, so if you need anything older it's just not there.
Looks like our last option is: "You can directly download binaries instead of using packages. As an example, see Without a package manager instructions in "Installing kubeadm" document.
And you end up here: https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/#install-kubectl-binary-with-curl-on-linux
Here's a modern way of defining the repo on debian-type systems btw:
# Our cluster is k8s v1.23 so we can use kubectl as late as 1.24 curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.24/deb/Release.key | gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg cat <<EOF > /etc/apt/sources.list.d/kubernetes.sources X-Repolib-Name: Kubernetes Enabled: yes Types: deb URIs: https://pkgs.k8s.io/core:/stable:/v1.24/deb/ Suites: / Architectures: arm64 Signed-By: /etc/apt/keyrings/kubernetes-apt-keyring.gpg EOF apt update apt install -y kubectl
You can't use selinux
It just breaks way too much shit, it's not worth it. Install something new and it doesn't work? You'll forever be wondering "is it selinux" immediately after it fails.