Managing Kubernetes Nodes with Ansible¶
Repeatable node lifecycle automation for Rancher-managed Kubernetes clusters.
This repository provides Ansible automation for administering nodes in Rancher-managed Kubernetes clusters. Its goal is to make node operations repeatable: discover existing cluster nodes, prepare new servers, join them to a cluster, and run routine maintenance through playbooks instead of one-off shell commands.
Scope¶
Use this repo to manage the node lifecycle around Kubernetes:
- discover existing Kubernetes nodes and build Ansible inventory
- configure baseline packages, users, firewall, and OS prerequisites
- prepare a new node before it joins a cluster
- fetch Rancher registration data through the Rancher API
- execute the Rancher node registration command on the target host
- run routine node operations such as package updates and service restarts
Rancher remains the source of truth for Kubernetes cluster membership. Ansible prepares nodes and generates the Rancher registration command from the Rancher API; it does not create clusters or bypass Rancher's RKE2 join flow.
Use this repo when you want repeatable node operations. Do not use it to create Kubernetes clusters, bypass Rancher, or replace Rancher's cluster management model.
Start with GETTING_STARTED.md if this is a fresh clone.
Table Of Contents¶
[[TOC]]
Operator Workflow¶
For most work, the flow is:
configure local access
-> refresh inventory from Kubernetes
-> choose a target with --limit
-> run the needed playbook
-> refresh inventory again if cluster membership changed
Basic setup:
make configure
make doctor
make graph
make ping
Per-user configuration is stored in config.yml, which is gitignored and
auto-loaded by Ansible and the helper scripts. You do not need to source it.
Requirements¶
The runner needs:
ansibleandansible-playbookkubectl- kubeconfig contexts for the clusters you want to manage
- SSH access to the managed nodes
- for Rancher node joins, a Rancher API token from a dedicated remote/external Rancher user
Managed nodes do not need Ansible installed; they need SSH access and Python for Ansible modules.
Do not use the local Rancher admin account for automation tokens.
Install repo dependencies:
make install
make install installs Ansible, Ansible collections, and local helper
dependencies. It does not install kubectl; install kubectl separately from
the Kubernetes project documentation or your package manager.
Inventory Model¶
Inventory has two sources:
- static/manual inventory in
inventory/*.yml - dynamic Kubernetes inventory from
inventory/k8s-nodes.sh
inventory/k8s-nodes.sh is an Ansible dynamic-inventory script that calls
kubectl get nodes for each context configured in KUBE_CONTEXTS on every
ansible run. There is no cached file to refresh — adding or removing a node
shows up on the next playbook invocation.
Context names become Ansible group names by replacing - with _:
kubectl context: gem-cluster-01
Ansible group: gem_cluster_01
group vars: inventory/group_vars/gem_cluster_01.yml
Per Kubernetes node, the generated inventory stores:
private_ipv4: KubernetesInternalIPpublic_ipv4: KubernetesExternalIP, ormintfit.io/public-ipannotationkubernetes_version: kubelet version
The main group hierarchy is:
all
|-- bastion
|-- cluster_nodes
| |-- gem_cluster_01
| | |-- gem_cluster_01_controlplane
| | `-- gem_cluster_01_worker
| `-- gem_mgmt
`-- manual
|-- manual_bastion
|-- manual_controlplane
`-- manual_worker
Tag groups such as bare_metal, cloud_instance, privileged, and
unprivileged are built from per-tag Kubernetes labels:
kubectl --context=gem-cluster-01 label node <node-name> \
mintfit.io/group.bare_metal=true mintfit.io/group.unprivileged=true
The next ansible-playbook (or make graph) invocation picks the new labels
up automatically. Side benefit: kubectl get nodes -l mintfit.io/group.bare_metal=true
filters directly.
Allowed tag names live in tag-names.yml. Labels with an unlisted tag are ignored with a warning when the dynamic inventory runs.
Connection Model¶
The same inventory works from a local workstation and from an automation runner inside the cluster.
network_location |
Address used by Ansible | ProxyJump |
|---|---|---|
local |
Bastion via public IP; other nodes via private IP | yes, through the cluster bastion |
cluster |
private IP for every node | no |
Set network_location with make configure, or override per run:
ansible-playbook playbooks/ping.yml -e network_location=cluster
Each cluster defines its bastion in inventory/group_vars/<cluster>.yml with
bastion_inventory_name.
New Node Lifecycle¶
Use this workflow for a fresh server that should join a Rancher-managed cluster.
1. Add The Node Manually¶
Add the new host to inventory/05-manual.yml under
manual_worker or manual_controlplane.
Manual hosts connect as root through their public IP and do not use the cluster bastion.
Example:
manual:
children:
manual_worker:
hosts:
gem-c01-w03:
public_ipv4: "178.105.200.132"
private_ipv4: "10.0.0.13"
2. Prepare The Selected Node Profile¶
Each cluster defines one or more node profiles in
inventory/group_vars/<cluster>.yml. A node profile is the supported node
contract for that cluster: OS family/version, kernel policy, kernel modules,
and required node packages.
The selected profile is controlled by node_profile_name:
node_profiles:
ubuntu_24_04_rke2_calico_longhorn:
os:
distribution: Ubuntu
major_version: "24"
kernel:
install_version: "6.8.0-106-generic"
allowed_regex: "^6\\.8\\.0-106-generic$"
required_kernel_modules:
- br_netfilter
- overlay
required_packages: []
node_profile_name: ubuntu_24_04_rke2_calico_longhorn
node_profile: "{{ node_profiles[node_profile_name] }}"
Roles still consume simple compatibility variables. The cluster vars derive those from the selected profile:
node_kernel_version: "{{ node_profile.kernel.install_version }}"
rke2_prep_allowed_distribution: "{{ node_profile.os.distribution }}"
rke2_prep_allowed_distribution_major_version: "{{ node_profile.os.major_version }}"
rke2_prep_allowed_kernel_regex: "{{ node_profile.kernel.allowed_regex }}"
rke2_prep_required_kernel_modules: "{{ node_profile.required_kernel_modules | default([]) }}"
rke2_prep_required_packages: "{{ node_profile.required_packages | default([]) }}"
If the node does not run the selected profile's kernel, run:
ansible-playbook playbooks/node-kernel.yml \
--limit gem-c01-w03 \
-e target_cluster=gem_cluster_01 \
-e node_kernel_auto_reboot=true
For gem_cluster_01, the current selected profile installs and allows:
node_profile_name: ubuntu_24_04_rke2_calico_longhorn
node_profile:
kernel:
install_version: "6.8.0-106-generic"
allowed_regex: "^6\\.8\\.0-106-generic$"
Package holds are disabled by default. Enable them only when you intentionally want to freeze kernel packages:
ansible-playbook playbooks/node-kernel.yml \
--limit gem-c01-w03 \
-e target_cluster=gem_cluster_01 \
-e node_kernel_auto_reboot=true \
-e node_kernel_hold=true
3. Configure Rancher API Access¶
The join playbook normally uses the Rancher API to fetch the cluster registration token and build the registration command itself.
In inventory/group_vars/<cluster>.yml:
rancher_node_join_rancher_url: "https://rancher.gem.mintfit.hamburg"
rancher_node_join_cluster_name: "gem-cluster-01"
Store rancher_node_join_api_token in Ansible Vault, automation runner secrets,
or another secret backend.
For local development, copy .rancher.env.example to
.rancher.env and set:
export RANCHER_API_TOKEN="token-xxxxx:yyyyy"
The token must come from a dedicated remote/external Rancher user with the smallest permissions needed to read the target cluster and its registration tokens. Do not generate it from the local Rancher admin account.
4. Join The Node¶
Run the join playbook with an explicit --limit. Lifecycle playbooks refuse to
run without a limit unless explicitly overridden.
ansible-playbook playbooks/rancher-node-join.yml \
--limit gem-c01-w03 \
-e target_cluster=gem_cluster_01 \
-e rancher_node_role=worker
For control-plane nodes:
ansible-playbook playbooks/rancher-node-join.yml \
--limit gem-c01-cp04 \
-e target_cluster=gem_cluster_01 \
-e rancher_node_roles=etcd,controlplane,worker
The playbook runs:
node_baseline
-> users
-> rke2_prep
-> rancher_node_join
rke2_prep performs profile preflight checks before Rancher registration. For
gem_cluster_01, it currently blocks nodes that do not match the selected
profile's known-good kernel pattern:
rke2_prep_allowed_kernel_regex: "{{ node_profile.kernel.allowed_regex }}"
5. Verify And Remove Manual Entry¶
After the node appears in Kubernetes:
make graph
Then remove the temporary host from inventory/05-manual.yml. The generated
Kubernetes inventory owns the node from this point on.
Reset A Node For Join Testing¶
To remove a joined node and test the join workflow again on the same host:
ansible-playbook playbooks/rancher-node-reset.yml \
--limit gem-c01-w03 \
-e target_cluster=gem_cluster_01 \
-e rancher_node_reset_confirm=true
This playbook is destructive. It requires both --limit and
rancher_node_reset_confirm=true.
It removes the Kubernetes node object when present, stops Rancher/RKE2/K3s services, runs vendor uninstall scripts when present, unmounts kubelet pod mounts, and deletes local Rancher/RKE2/K3s state.
Before deleting the Kubernetes node object, it checks the Rancher/CAPI
management cluster for a matching Machine. A confirmed reset deletes the
matching Machine first so Rancher is not left with NodeDeleted /
NodeNotFound status. Prefer removing the node through Rancher first when this
is not a reset test.
ansible-playbook playbooks/rancher-node-reset.yml \
--limit gem-c01-w03 \
-e target_cluster=gem_cluster_01 \
-e rancher_node_reset_confirm=true
It does not remove normal baseline management state such as team users, SSH keys, MOTD, sysctl files, or installed baseline packages.
After reset:
make graph
Then add the host back to inventory/05-manual.yml if it is no longer present
there and run the normal node lifecycle again.
Rancher Registration Helper¶
To inspect the command generated from Rancher API data:
scripts/rancher-registration-command.sh \
--target-cluster gem_cluster_01 \
--worker
With .rancher.env configured:
make rancher-registration-command
List visible Rancher clusters:
scripts/rancher-registration-command.sh --list-clusters
Use a cluster id when names are ambiguous:
scripts/rancher-registration-command.sh \
--cluster-id c-m-xxxxx \
--worker
Include explicit node addresses:
scripts/rancher-registration-command.sh \
--target-cluster gem_cluster_01 \
--worker \
--address 178.105.200.132 \
--internal-address 10.0.0.13
The printed command contains a node registration token. Treat it as sensitive.
Routine Operations¶
Always use --limit for mutating playbooks unless you intentionally want a
larger scope.
Read-only playbooks:
| Playbook | Purpose |
|---|---|
playbooks/ping.yml |
connection test |
playbooks/system-info.yml |
OS, CPU, memory, uptime, kubelet, disk usage |
playbooks/node-config-report.yml |
print redacted node config and network report |
playbooks/diagnose-node.yml |
collect journals and system state into ./diagnostics/ |
Mutating playbooks:
| Playbook | Purpose |
|---|---|
playbooks/node-kernel.yml |
install/select approved kernel before node join |
playbooks/rancher-node-join.yml |
prepare and join a manual node through Rancher |
playbooks/rancher-node-reset.yml |
remove Rancher/RKE2/K3s state for rejoin testing |
playbooks/rke2-prep.yml |
RKE2 sysctl, preflight checks, etcd user, worker packages |
playbooks/users.yml |
manage team users and SSH keys |
playbooks/node-baseline.yml |
install baseline packages and MOTD |
playbooks/firewall.yml |
apply UFW baseline |
playbooks/unattended-upgrades.yml |
configure security-only unattended upgrades |
playbooks/apt-update.yml |
cordon, drain, apt upgrade, optional reboot, uncordon |
playbooks/rancher-upgrade.yml |
upgrade Rancher Manager Helm release |
playbooks/bootstrap-ansible-user.yml |
create the dedicated Ansible user |
Ops and emergency playbooks:
| Playbook | Purpose |
|---|---|
playbooks/network-static-ip.yml |
pin private interface via netplan |
playbooks/netplan-migrate-management.yml |
migrate management CP enp7s0 from manual networkd to netplan |
playbooks/restart-systemd-networkd.yml |
restart systemd-networkd |
playbooks/restart-kubelet.yml |
restart rke2-server, rke2-agent, k3s, or k3s-agent |
playbooks/restart-containerd.yml |
restart containerd; this recycles pods |
Examples:
ansible-playbook playbooks/system-info.yml --limit gem-c01-w03
ansible-playbook playbooks/node-config-report.yml \
--limit gem-rancher-k3s-03
ansible-playbook playbooks/apt-update.yml \
--limit gem_cluster_01_worker \
-e auto_reboot=true
Use -e dist_upgrade=true only when you intentionally want apt dist-upgrade
instead of a normal apt upgrade.
Use playbooks/node-config-report.yml when comparing node configuration. It
prints live routing, networkd/netplan/resolver state, selected service status,
OS facts, package versions, sysctl/module/firewall state, and selected config
files from /etc/netplan, /etc/systemd/network, /run/systemd/network,
/etc/cloud, /etc/rancher/k3s, and /etc/rancher/rke2. Token-like fields
and kubeconfig certificate/key data are redacted by default.
The default stdout mode is a compact summary. To print every collected command and config file to the Ansible output:
ansible-playbook playbooks/node-config-report.yml \
--limit gem-rancher-k3s-03 \
-e node_config_stdout_mode=full
Report sections can be enabled or disabled independently. All are enabled by default:
ansible-playbook playbooks/node-config-report.yml \
--limit gem-rancher-k3s-03 \
-e node_config_section_os=false \
-e node_config_section_packages=false \
-e node_config_section_network=true
Available section vars are:
node_config_section_os
node_config_section_services
node_config_section_packages
node_config_section_network
node_config_section_firewall
node_config_section_kubernetes
node_config_section_config_files
node_config_section_journals
To also fetch the full report tarball, including recent service journals:
ansible-playbook playbooks/node-config-report.yml \
--limit gem-rancher-k3s-03 \
-e node_config_save_report=true
Saved reports are written to diagnostics/node-config/. Use
-e node_config_include_sensitive=true only if you intentionally need raw
values in the printed or saved report.
For the Rancher management control-plane netplan migration, keep Hetzner
console access open. The playbook runs with serial: 1, drains one node,
migrates its network config, waits for it to become Ready, uncordons it, and
then continues with the next node:
ansible-playbook playbooks/netplan-migrate-management.yml --diff
The playbook backs up /etc/netplan/50-cloud-init.yaml and
/etc/systemd/network/10-enp7s0.network, writes
/etc/netplan/60-enp7s0-static.yaml, restarts systemd-networkd, and verifies
that enp7s0 is using /run/systemd/network/10-netplan-enp7s0.network with
the 10.0.0.1 default gateway. To test only one node, add
--limit gem-rancher-k3s-01. To skip Kubernetes drain, pass
-e netplan_drain_node=false.
Make Targets¶
Run make help to list available targets.
Common targets:
| Target | Purpose |
|---|---|
make configure |
write local config.yml |
make doctor |
sanity-check local tools and config |
make graph |
print the inventory tree (queries kubectl) |
make ping |
smoke-test Ansible connectivity |
make node-kernel |
run playbooks/node-kernel.yml |
make rancher-upgrade |
run playbooks/rancher-upgrade.yml |
make rancher-node-join |
run playbooks/rancher-node-join.yml |
make rancher-node-reset |
run playbooks/rancher-node-reset.yml |
make rancher-registration-command |
print generated Rancher registration command |
The Make targets are convenient shortcuts for routine default runs. For node
lifecycle operations, use direct ansible-playbook commands so required values
such as --limit, target_cluster, and confirmation flags are explicit in the
command history.
Repository Layout¶
ansible/
|-- README.md
|-- GETTING_STARTED.md
|-- Makefile # operator shortcuts
|-- ansible.cfg # inventory, callbacks, fact cache, role paths
|-- requirements.yml # Ansible collection requirements
|-- config.yml.example # template written to config.yml by make configure
|-- tag-names.yml # allowed inventory tag groups
|-- .rancher.env.example # local Rancher API token template
|-- inventory/
| |-- 00-static.yml # static group hierarchy and bastions
| |-- 05-manual.yml # temporary/fresh hosts before dynamic inventory owns them
| |-- k8s-nodes.sh # dynamic inventory: queries `kubectl get nodes`
| `-- group_vars/ # global, manual, cluster, and role-group vars
|-- playbooks/
| |-- apt-update.yml
| |-- bootstrap-ansible-user.yml
| |-- diagnose-node.yml
| |-- firewall.yml
| |-- network-static-ip.yml
| |-- netplan-migrate-management.yml
| |-- node-baseline.yml
| |-- node-config-report.yml
| |-- node-kernel.yml
| |-- ping.yml
| |-- rancher-upgrade.yml
| |-- rancher-node-join.yml
| |-- rancher-node-reset.yml
| |-- restart-containerd.yml
| |-- restart-kubelet.yml
| |-- restart-systemd-networkd.yml
| |-- rke2-prep.yml
| |-- system-info.yml
| |-- unattended-upgrades.yml
| |-- users.yml
| `-- templates/
|-- roles/
| |-- firewall/
| |-- node_baseline/
| |-- node_kernel/
| |-- rancher_node_join/
| |-- rancher_node_reset/
| |-- rke2_prep/
| |-- unattended_upgrades/
| `-- users/
|-- files/
| |-- README.md
| `-- keys/ # managed users' public SSH keys
`-- scripts/
|-- configure.sh
|-- doctor.sh
|-- install-deps.sh
|-- lib-config.sh
`-- rancher-registration-command.sh
Troubleshooting¶
Run make doctor first. It checks the common local setup problems.
| Symptom | Likely cause | Fix |
|---|---|---|
Connection timed out to private IP |
missing bastion config while in local mode |
check bastion_inventory_name and the bastion public IP |
Permission denied (publickey) |
wrong SSH user or key | check config.yml and key permissions |
BECOME password: prompt hangs |
user has no passwordless sudo | run with -K or run make bootstrap-user |
context X unreachable or lacks get nodes permission |
kubeconfig context is wrong or lacks RBAC | verify with kubectl --context=X get nodes |
inventory/group_vars/<slug>.yml does not exist |
context slug has no cluster vars file | create the file or rename the kube context |
Rancher API returns 401 |
wrong token type, expired token, or bad .rancher.env |
use a Rancher API token, not a node registration token |
| join waits forever for Kubernetes node | Rancher agent installed but RKE2/CNI is unhealthy | check rancher-system-agent, rke2-agent, and CNI pod logs |
| preflight rejects kernel | node does not match cluster-approved kernel | run playbooks/node-kernel.yml or update the approved kernel after testing |
Useful one-node test:
ansible-playbook playbooks/system-info.yml \
--limit gem-c01-w03 \
--check \
--diff
Inspect resolved inventory:
ansible-inventory --host gem-c01-w03
ansible-inventory --graph