Managing Kubernetes Nodes with Ansible¶

Repeatable node lifecycle automation for Rancher-managed Kubernetes clusters.

This repository provides Ansible automation for administering nodes in Rancher-managed Kubernetes clusters. Its goal is to make node operations repeatable: discover existing cluster nodes, prepare new servers, join them to a cluster, and run routine maintenance through playbooks instead of one-off shell commands.

Scope¶

Use this repo to manage the node lifecycle around Kubernetes:

discover existing Kubernetes nodes and build Ansible inventory
configure baseline packages, users, firewall, and OS prerequisites
prepare a new node before it joins a cluster
fetch Rancher registration data through the Rancher API
execute the Rancher node registration command on the target host
run routine node operations such as package updates and service restarts

Rancher remains the source of truth for Kubernetes cluster membership. Ansible prepares nodes and generates the Rancher registration command from the Rancher API; it does not create clusters or bypass Rancher's RKE2 join flow.

Use this repo when you want repeatable node operations. Do not use it to create Kubernetes clusters, bypass Rancher, or replace Rancher's cluster management model.

Start with GETTING_STARTED.md if this is a fresh clone.

Table Of Contents¶

[[TOC]]

Operator Workflow¶

For most work, the flow is:

configure local access
-> refresh inventory from Kubernetes
-> choose a target with --limit
-> run the needed playbook
-> refresh inventory again if cluster membership changed

Basic setup:

make configure
make doctor
make graph
make ping

Per-user configuration is stored in config.yml, which is gitignored and auto-loaded by Ansible and the helper scripts. You do not need to source it.

Requirements¶

The runner needs:

ansible and ansible-playbook
kubectl
kubeconfig contexts for the clusters you want to manage
SSH access to the managed nodes
for Rancher node joins, a Rancher API token from a dedicated remote/external Rancher user

Managed nodes do not need Ansible installed; they need SSH access and Python for Ansible modules.

Do not use the local Rancher admin account for automation tokens.

Install repo dependencies:

make install

make install installs Ansible, Ansible collections, and local helper dependencies. It does not install kubectl; install kubectl separately from the Kubernetes project documentation or your package manager.

Inventory Model¶

Inventory has two sources:

static/manual inventory in inventory/*.yml
dynamic Kubernetes inventory from inventory/k8s-nodes.sh

inventory/k8s-nodes.sh is an Ansible dynamic-inventory script that calls kubectl get nodes for each context configured in KUBE_CONTEXTS on every ansible run. There is no cached file to refresh — adding or removing a node shows up on the next playbook invocation.

Context names become Ansible group names by replacing - with _:

kubectl context: gem-cluster-01
Ansible group:   gem_cluster_01
group vars:      inventory/group_vars/gem_cluster_01.yml

Per Kubernetes node, the generated inventory stores:

private_ipv4: Kubernetes InternalIP
public_ipv4: Kubernetes ExternalIP, or mintfit.io/public-ip annotation
kubernetes_version: kubelet version

The main group hierarchy is:

all
|-- bastion
|-- cluster_nodes
|   |-- gem_cluster_01
|   |   |-- gem_cluster_01_controlplane
|   |   `-- gem_cluster_01_worker
|   `-- gem_mgmt
`-- manual
    |-- manual_bastion
    |-- manual_controlplane
    `-- manual_worker

Tag groups such as bare_metal, cloud_instance, privileged, and unprivileged are built from per-tag Kubernetes labels:

kubectl --context=gem-cluster-01 label node <node-name> \
  mintfit.io/group.bare_metal=true mintfit.io/group.unprivileged=true

The next ansible-playbook (or make graph) invocation picks the new labels up automatically. Side benefit: kubectl get nodes -l mintfit.io/group.bare_metal=true filters directly.

Allowed tag names live in tag-names.yml. Labels with an unlisted tag are ignored with a warning when the dynamic inventory runs.

Connection Model¶

The same inventory works from a local workstation and from an automation runner inside the cluster.

`network_location`	Address used by Ansible	ProxyJump
`local`	Bastion via public IP; other nodes via private IP	yes, through the cluster bastion
`cluster`	private IP for every node	no

Set network_location with make configure, or override per run:

ansible-playbook playbooks/ping.yml -e network_location=cluster

Each cluster defines its bastion in inventory/group_vars/<cluster>.yml with bastion_inventory_name.

New Node Lifecycle¶

Use this workflow for a fresh server that should join a Rancher-managed cluster.

1. Add The Node Manually¶

Add the new host to inventory/05-manual.yml under manual_worker or manual_controlplane.

Manual hosts connect as root through their public IP and do not use the cluster bastion.

Example:

manual:
  children:
    manual_worker:
      hosts:
        gem-c01-w03:
          public_ipv4: "178.105.200.132"
          private_ipv4: "10.0.0.13"

2. Prepare The Selected Node Profile¶

Each cluster defines one or more node profiles in inventory/group_vars/<cluster>.yml. A node profile is the supported node contract for that cluster: OS family/version, kernel policy, kernel modules, and required node packages.

The selected profile is controlled by node_profile_name:

node_profiles:
  ubuntu_24_04_rke2_calico_longhorn:
    os:
      distribution: Ubuntu
      major_version: "24"
    kernel:
      install_version: "6.8.0-106-generic"
      allowed_regex: "^6\\.8\\.0-106-generic$"
    required_kernel_modules:
      - br_netfilter
      - overlay
    required_packages: []

node_profile_name: ubuntu_24_04_rke2_calico_longhorn
node_profile: "{{ node_profiles[node_profile_name] }}"

Roles still consume simple compatibility variables. The cluster vars derive those from the selected profile:

node_kernel_version: "{{ node_profile.kernel.install_version }}"
rke2_prep_allowed_distribution: "{{ node_profile.os.distribution }}"
rke2_prep_allowed_distribution_major_version: "{{ node_profile.os.major_version }}"
rke2_prep_allowed_kernel_regex: "{{ node_profile.kernel.allowed_regex }}"
rke2_prep_required_kernel_modules: "{{ node_profile.required_kernel_modules | default([]) }}"
rke2_prep_required_packages: "{{ node_profile.required_packages | default([]) }}"

If the node does not run the selected profile's kernel, run:

ansible-playbook playbooks/node-kernel.yml \
  --limit gem-c01-w03 \
  -e target_cluster=gem_cluster_01 \
  -e node_kernel_auto_reboot=true

For gem_cluster_01, the current selected profile installs and allows:

node_profile_name: ubuntu_24_04_rke2_calico_longhorn
node_profile:
  kernel:
    install_version: "6.8.0-106-generic"
    allowed_regex: "^6\\.8\\.0-106-generic$"

Package holds are disabled by default. Enable them only when you intentionally want to freeze kernel packages:

ansible-playbook playbooks/node-kernel.yml \
  --limit gem-c01-w03 \
  -e target_cluster=gem_cluster_01 \
  -e node_kernel_auto_reboot=true \
  -e node_kernel_hold=true

3. Configure Rancher API Access¶

The join playbook normally uses the Rancher API to fetch the cluster registration token and build the registration command itself.

In inventory/group_vars/<cluster>.yml:

rancher_node_join_rancher_url: "https://rancher.gem.mintfit.hamburg"
rancher_node_join_cluster_name: "gem-cluster-01"

Store rancher_node_join_api_token in Ansible Vault, automation runner secrets, or another secret backend.

For local development, copy .rancher.env.example to .rancher.env and set:

export RANCHER_API_TOKEN="token-xxxxx:yyyyy"

The token must come from a dedicated remote/external Rancher user with the smallest permissions needed to read the target cluster and its registration tokens. Do not generate it from the local Rancher admin account.

4. Join The Node¶

Run the join playbook with an explicit --limit. Lifecycle playbooks refuse to run without a limit unless explicitly overridden.

ansible-playbook playbooks/rancher-node-join.yml \
  --limit gem-c01-w03 \
  -e target_cluster=gem_cluster_01 \
  -e rancher_node_role=worker

For control-plane nodes:

ansible-playbook playbooks/rancher-node-join.yml \
  --limit gem-c01-cp04 \
  -e target_cluster=gem_cluster_01 \
  -e rancher_node_roles=etcd,controlplane,worker

The playbook runs:

node_baseline
-> users
-> rke2_prep
-> rancher_node_join

rke2_prep performs profile preflight checks before Rancher registration. For gem_cluster_01, it currently blocks nodes that do not match the selected profile's known-good kernel pattern:

rke2_prep_allowed_kernel_regex: "{{ node_profile.kernel.allowed_regex }}"

5. Verify And Remove Manual Entry¶

After the node appears in Kubernetes:

make graph

Then remove the temporary host from inventory/05-manual.yml. The generated Kubernetes inventory owns the node from this point on.

Reset A Node For Join Testing¶

To remove a joined node and test the join workflow again on the same host:

ansible-playbook playbooks/rancher-node-reset.yml \
  --limit gem-c01-w03 \
  -e target_cluster=gem_cluster_01 \
  -e rancher_node_reset_confirm=true

This playbook is destructive. It requires both --limit and rancher_node_reset_confirm=true.

It removes the Kubernetes node object when present, stops Rancher/RKE2/K3s services, runs vendor uninstall scripts when present, unmounts kubelet pod mounts, and deletes local Rancher/RKE2/K3s state.

Before deleting the Kubernetes node object, it checks the Rancher/CAPI management cluster for a matching Machine. A confirmed reset deletes the matching Machine first so Rancher is not left with NodeDeleted / NodeNotFound status. Prefer removing the node through Rancher first when this is not a reset test.

ansible-playbook playbooks/rancher-node-reset.yml \
  --limit gem-c01-w03 \
  -e target_cluster=gem_cluster_01 \
  -e rancher_node_reset_confirm=true

It does not remove normal baseline management state such as team users, SSH keys, MOTD, sysctl files, or installed baseline packages.

After reset:

make graph

Then add the host back to inventory/05-manual.yml if it is no longer present there and run the normal node lifecycle again.

Rancher Registration Helper¶

To inspect the command generated from Rancher API data:

scripts/rancher-registration-command.sh \
  --target-cluster gem_cluster_01 \
  --worker

With .rancher.env configured:

make rancher-registration-command

List visible Rancher clusters:

scripts/rancher-registration-command.sh --list-clusters

Use a cluster id when names are ambiguous:

scripts/rancher-registration-command.sh \
  --cluster-id c-m-xxxxx \
  --worker

Include explicit node addresses:

scripts/rancher-registration-command.sh \
  --target-cluster gem_cluster_01 \
  --worker \
  --address 178.105.200.132 \
  --internal-address 10.0.0.13

The printed command contains a node registration token. Treat it as sensitive.

Routine Operations¶

Always use --limit for mutating playbooks unless you intentionally want a larger scope.

Read-only playbooks:

Playbook	Purpose
`playbooks/ping.yml`	connection test
`playbooks/system-info.yml`	OS, CPU, memory, uptime, kubelet, disk usage
`playbooks/node-config-report.yml`	print redacted node config and network report
`playbooks/diagnose-node.yml`	collect journals and system state into `./diagnostics/`

Mutating playbooks:

Playbook	Purpose
`playbooks/node-kernel.yml`	install/select approved kernel before node join
`playbooks/rancher-node-join.yml`	prepare and join a manual node through Rancher
`playbooks/rancher-node-reset.yml`	remove Rancher/RKE2/K3s state for rejoin testing
`playbooks/rke2-prep.yml`	RKE2 sysctl, preflight checks, etcd user, worker packages
`playbooks/users.yml`	manage team users and SSH keys
`playbooks/node-baseline.yml`	install baseline packages and MOTD
`playbooks/firewall.yml`	apply UFW baseline
`playbooks/unattended-upgrades.yml`	configure security-only unattended upgrades
`playbooks/apt-update.yml`	cordon, drain, apt upgrade, optional reboot, uncordon
`playbooks/rancher-upgrade.yml`	upgrade Rancher Manager Helm release
`playbooks/bootstrap-ansible-user.yml`	create the dedicated Ansible user

Ops and emergency playbooks:

Playbook	Purpose
`playbooks/network-static-ip.yml`	pin private interface via netplan
`playbooks/netplan-migrate-management.yml`	migrate management CP `enp7s0` from manual networkd to netplan
`playbooks/restart-systemd-networkd.yml`	restart `systemd-networkd`
`playbooks/restart-kubelet.yml`	restart `rke2-server`, `rke2-agent`, `k3s`, or `k3s-agent`
`playbooks/restart-containerd.yml`	restart containerd; this recycles pods

Examples:

ansible-playbook playbooks/system-info.yml --limit gem-c01-w03

ansible-playbook playbooks/node-config-report.yml \
  --limit gem-rancher-k3s-03

ansible-playbook playbooks/apt-update.yml \
  --limit gem_cluster_01_worker \
  -e auto_reboot=true

Use -e dist_upgrade=true only when you intentionally want apt dist-upgrade instead of a normal apt upgrade.

Use playbooks/node-config-report.yml when comparing node configuration. It prints live routing, networkd/netplan/resolver state, selected service status, OS facts, package versions, sysctl/module/firewall state, and selected config files from /etc/netplan, /etc/systemd/network, /run/systemd/network, /etc/cloud, /etc/rancher/k3s, and /etc/rancher/rke2. Token-like fields and kubeconfig certificate/key data are redacted by default.

The default stdout mode is a compact summary. To print every collected command and config file to the Ansible output:

ansible-playbook playbooks/node-config-report.yml \
  --limit gem-rancher-k3s-03 \
  -e node_config_stdout_mode=full

Report sections can be enabled or disabled independently. All are enabled by default:

ansible-playbook playbooks/node-config-report.yml \
  --limit gem-rancher-k3s-03 \
  -e node_config_section_os=false \
  -e node_config_section_packages=false \
  -e node_config_section_network=true

Available section vars are:

node_config_section_os
node_config_section_services
node_config_section_packages
node_config_section_network
node_config_section_firewall
node_config_section_kubernetes
node_config_section_config_files
node_config_section_journals

To also fetch the full report tarball, including recent service journals:

ansible-playbook playbooks/node-config-report.yml \
  --limit gem-rancher-k3s-03 \
  -e node_config_save_report=true

Saved reports are written to diagnostics/node-config/. Use -e node_config_include_sensitive=true only if you intentionally need raw values in the printed or saved report.

For the Rancher management control-plane netplan migration, keep Hetzner console access open. The playbook runs with serial: 1, drains one node, migrates its network config, waits for it to become Ready, uncordons it, and then continues with the next node:

ansible-playbook playbooks/netplan-migrate-management.yml --diff

The playbook backs up /etc/netplan/50-cloud-init.yaml and /etc/systemd/network/10-enp7s0.network, writes /etc/netplan/60-enp7s0-static.yaml, restarts systemd-networkd, and verifies that enp7s0 is using /run/systemd/network/10-netplan-enp7s0.network with the 10.0.0.1 default gateway. To test only one node, add --limit gem-rancher-k3s-01. To skip Kubernetes drain, pass -e netplan_drain_node=false.

Make Targets¶

Run make help to list available targets.

Common targets:

Target	Purpose
`make configure`	write local `config.yml`
`make doctor`	sanity-check local tools and config
`make graph`	print the inventory tree (queries kubectl)
`make ping`	smoke-test Ansible connectivity
`make node-kernel`	run `playbooks/node-kernel.yml`
`make rancher-upgrade`	run `playbooks/rancher-upgrade.yml`
`make rancher-node-join`	run `playbooks/rancher-node-join.yml`
`make rancher-node-reset`	run `playbooks/rancher-node-reset.yml`
`make rancher-registration-command`	print generated Rancher registration command

The Make targets are convenient shortcuts for routine default runs. For node lifecycle operations, use direct ansible-playbook commands so required values such as --limit, target_cluster, and confirmation flags are explicit in the command history.

Repository Layout¶

ansible/
|-- README.md
|-- GETTING_STARTED.md
|-- Makefile                    # operator shortcuts
|-- ansible.cfg                 # inventory, callbacks, fact cache, role paths
|-- requirements.yml            # Ansible collection requirements
|-- config.yml.example          # template written to config.yml by make configure
|-- tag-names.yml               # allowed inventory tag groups
|-- .rancher.env.example        # local Rancher API token template
|-- inventory/
|   |-- 00-static.yml           # static group hierarchy and bastions
|   |-- 05-manual.yml           # temporary/fresh hosts before dynamic inventory owns them
|   |-- k8s-nodes.sh            # dynamic inventory: queries `kubectl get nodes`
|   `-- group_vars/             # global, manual, cluster, and role-group vars
|-- playbooks/
|   |-- apt-update.yml
|   |-- bootstrap-ansible-user.yml
|   |-- diagnose-node.yml
|   |-- firewall.yml
|   |-- network-static-ip.yml
|   |-- netplan-migrate-management.yml
|   |-- node-baseline.yml
|   |-- node-config-report.yml
|   |-- node-kernel.yml
|   |-- ping.yml
|   |-- rancher-upgrade.yml
|   |-- rancher-node-join.yml
|   |-- rancher-node-reset.yml
|   |-- restart-containerd.yml
|   |-- restart-kubelet.yml
|   |-- restart-systemd-networkd.yml
|   |-- rke2-prep.yml
|   |-- system-info.yml
|   |-- unattended-upgrades.yml
|   |-- users.yml
|   `-- templates/
|-- roles/
|   |-- firewall/
|   |-- node_baseline/
|   |-- node_kernel/
|   |-- rancher_node_join/
|   |-- rancher_node_reset/
|   |-- rke2_prep/
|   |-- unattended_upgrades/
|   `-- users/
|-- files/
|   |-- README.md
|   `-- keys/                   # managed users' public SSH keys
`-- scripts/
    |-- configure.sh
    |-- doctor.sh
    |-- install-deps.sh
    |-- lib-config.sh
    `-- rancher-registration-command.sh

Troubleshooting¶

Run make doctor first. It checks the common local setup problems.

Symptom	Likely cause	Fix
`Connection timed out` to private IP	missing bastion config while in `local` mode	check `bastion_inventory_name` and the bastion public IP
`Permission denied (publickey)`	wrong SSH user or key	check `config.yml` and key permissions
`BECOME password:` prompt hangs	user has no passwordless sudo	run with `-K` or run `make bootstrap-user`
`context X unreachable or lacks get nodes permission`	kubeconfig context is wrong or lacks RBAC	verify with `kubectl --context=X get nodes`
`inventory/group_vars/<slug>.yml does not exist`	context slug has no cluster vars file	create the file or rename the kube context
Rancher API returns `401`	wrong token type, expired token, or bad `.rancher.env`	use a Rancher API token, not a node registration token
join waits forever for Kubernetes node	Rancher agent installed but RKE2/CNI is unhealthy	check `rancher-system-agent`, `rke2-agent`, and CNI pod logs
preflight rejects kernel	node does not match cluster-approved kernel	run `playbooks/node-kernel.yml` or update the approved kernel after testing

Useful one-node test:

ansible-playbook playbooks/system-info.yml \
  --limit gem-c01-w03 \
  --check \
  --diff

Inspect resolved inventory:

ansible-inventory --host gem-c01-w03
ansible-inventory --graph