Managing Kubernetes Nodes with Ansible

Repeatable node lifecycle automation for Rancher-managed Kubernetes clusters.

Ansible Kubernetes Rancher

This repository provides Ansible automation for administering nodes in Rancher-managed Kubernetes clusters. Its goal is to make node operations repeatable: discover existing cluster nodes, prepare new servers, join them to a cluster, and run routine maintenance through playbooks instead of one-off shell commands.

Scope

Use this repo to manage the node lifecycle around Kubernetes:

  • discover existing Kubernetes nodes and build Ansible inventory
  • configure baseline packages, users, firewall, and OS prerequisites
  • prepare a new node before it joins a cluster
  • fetch Rancher registration data through the Rancher API
  • execute the Rancher node registration command on the target host
  • run routine node operations such as package updates and service restarts

Rancher remains the source of truth for Kubernetes cluster membership. Ansible prepares nodes and generates the Rancher registration command from the Rancher API; it does not create clusters or bypass Rancher's RKE2 join flow.

Use this repo when you want repeatable node operations. Do not use it to create Kubernetes clusters, bypass Rancher, or replace Rancher's cluster management model.

Start with GETTING_STARTED.md if this is a fresh clone.

Table Of Contents

[[TOC]]

Operator Workflow

For most work, the flow is:

configure local access
-> refresh inventory from Kubernetes
-> choose a target with --limit
-> run the needed playbook
-> refresh inventory again if cluster membership changed

Basic setup:

make configure
make doctor
make graph
make ping

Per-user configuration is stored in config.yml, which is gitignored and auto-loaded by Ansible and the helper scripts. You do not need to source it.

Requirements

The runner needs:

  • ansible and ansible-playbook
  • kubectl
  • kubeconfig contexts for the clusters you want to manage
  • SSH access to the managed nodes
  • for Rancher node joins, a Rancher API token from a dedicated remote/external Rancher user

Managed nodes do not need Ansible installed; they need SSH access and Python for Ansible modules.

Do not use the local Rancher admin account for automation tokens.

Install repo dependencies:

make install

make install installs Ansible, Ansible collections, and local helper dependencies. It does not install kubectl; install kubectl separately from the Kubernetes project documentation or your package manager.

Inventory Model

Inventory has two sources:

  • static/manual inventory in inventory/*.yml
  • dynamic Kubernetes inventory from inventory/k8s-nodes.sh

inventory/k8s-nodes.sh is an Ansible dynamic-inventory script that calls kubectl get nodes for each context configured in KUBE_CONTEXTS on every ansible run. There is no cached file to refresh — adding or removing a node shows up on the next playbook invocation.

Context names become Ansible group names by replacing - with _:

kubectl context: gem-cluster-01
Ansible group:   gem_cluster_01
group vars:      inventory/group_vars/gem_cluster_01.yml

Per Kubernetes node, the generated inventory stores:

  • private_ipv4: Kubernetes InternalIP
  • public_ipv4: Kubernetes ExternalIP, or mintfit.io/public-ip annotation
  • kubernetes_version: kubelet version

The main group hierarchy is:

all
|-- bastion
|-- cluster_nodes
|   |-- gem_cluster_01
|   |   |-- gem_cluster_01_controlplane
|   |   `-- gem_cluster_01_worker
|   `-- gem_mgmt
`-- manual
    |-- manual_bastion
    |-- manual_controlplane
    `-- manual_worker

Tag groups such as bare_metal, cloud_instance, privileged, and unprivileged are built from per-tag Kubernetes labels:

kubectl --context=gem-cluster-01 label node <node-name> \
  mintfit.io/group.bare_metal=true mintfit.io/group.unprivileged=true

The next ansible-playbook (or make graph) invocation picks the new labels up automatically. Side benefit: kubectl get nodes -l mintfit.io/group.bare_metal=true filters directly.

Allowed tag names live in tag-names.yml. Labels with an unlisted tag are ignored with a warning when the dynamic inventory runs.

Connection Model

The same inventory works from a local workstation and from an automation runner inside the cluster.

network_location Address used by Ansible ProxyJump
local Bastion via public IP; other nodes via private IP yes, through the cluster bastion
cluster private IP for every node no

Set network_location with make configure, or override per run:

ansible-playbook playbooks/ping.yml -e network_location=cluster

Each cluster defines its bastion in inventory/group_vars/<cluster>.yml with bastion_inventory_name.

New Node Lifecycle

Use this workflow for a fresh server that should join a Rancher-managed cluster.

1. Add The Node Manually

Add the new host to inventory/05-manual.yml under manual_worker or manual_controlplane.

Manual hosts connect as root through their public IP and do not use the cluster bastion.

Example:

manual:
  children:
    manual_worker:
      hosts:
        gem-c01-w03:
          public_ipv4: "178.105.200.132"
          private_ipv4: "10.0.0.13"

2. Prepare The Selected Node Profile

Each cluster defines one or more node profiles in inventory/group_vars/<cluster>.yml. A node profile is the supported node contract for that cluster: OS family/version, kernel policy, kernel modules, and required node packages.

The selected profile is controlled by node_profile_name:

node_profiles:
  ubuntu_24_04_rke2_calico_longhorn:
    os:
      distribution: Ubuntu
      major_version: "24"
    kernel:
      install_version: "6.8.0-106-generic"
      allowed_regex: "^6\\.8\\.0-106-generic$"
    required_kernel_modules:
      - br_netfilter
      - overlay
    required_packages: []

node_profile_name: ubuntu_24_04_rke2_calico_longhorn
node_profile: "{{ node_profiles[node_profile_name] }}"

Roles still consume simple compatibility variables. The cluster vars derive those from the selected profile:

node_kernel_version: "{{ node_profile.kernel.install_version }}"
rke2_prep_allowed_distribution: "{{ node_profile.os.distribution }}"
rke2_prep_allowed_distribution_major_version: "{{ node_profile.os.major_version }}"
rke2_prep_allowed_kernel_regex: "{{ node_profile.kernel.allowed_regex }}"
rke2_prep_required_kernel_modules: "{{ node_profile.required_kernel_modules | default([]) }}"
rke2_prep_required_packages: "{{ node_profile.required_packages | default([]) }}"

If the node does not run the selected profile's kernel, run:

ansible-playbook playbooks/node-kernel.yml \
  --limit gem-c01-w03 \
  -e target_cluster=gem_cluster_01 \
  -e node_kernel_auto_reboot=true

For gem_cluster_01, the current selected profile installs and allows:

node_profile_name: ubuntu_24_04_rke2_calico_longhorn
node_profile:
  kernel:
    install_version: "6.8.0-106-generic"
    allowed_regex: "^6\\.8\\.0-106-generic$"

Package holds are disabled by default. Enable them only when you intentionally want to freeze kernel packages:

ansible-playbook playbooks/node-kernel.yml \
  --limit gem-c01-w03 \
  -e target_cluster=gem_cluster_01 \
  -e node_kernel_auto_reboot=true \
  -e node_kernel_hold=true

3. Configure Rancher API Access

The join playbook normally uses the Rancher API to fetch the cluster registration token and build the registration command itself.

In inventory/group_vars/<cluster>.yml:

rancher_node_join_rancher_url: "https://rancher.gem.mintfit.hamburg"
rancher_node_join_cluster_name: "gem-cluster-01"

Store rancher_node_join_api_token in Ansible Vault, automation runner secrets, or another secret backend.

For local development, copy .rancher.env.example to .rancher.env and set:

export RANCHER_API_TOKEN="token-xxxxx:yyyyy"

The token must come from a dedicated remote/external Rancher user with the smallest permissions needed to read the target cluster and its registration tokens. Do not generate it from the local Rancher admin account.

4. Join The Node

Run the join playbook with an explicit --limit. Lifecycle playbooks refuse to run without a limit unless explicitly overridden.

ansible-playbook playbooks/rancher-node-join.yml \
  --limit gem-c01-w03 \
  -e target_cluster=gem_cluster_01 \
  -e rancher_node_role=worker

For control-plane nodes:

ansible-playbook playbooks/rancher-node-join.yml \
  --limit gem-c01-cp04 \
  -e target_cluster=gem_cluster_01 \
  -e rancher_node_roles=etcd,controlplane,worker

The playbook runs:

node_baseline
-> users
-> rke2_prep
-> rancher_node_join

rke2_prep performs profile preflight checks before Rancher registration. For gem_cluster_01, it currently blocks nodes that do not match the selected profile's known-good kernel pattern:

rke2_prep_allowed_kernel_regex: "{{ node_profile.kernel.allowed_regex }}"

5. Verify And Remove Manual Entry

After the node appears in Kubernetes:

make graph

Then remove the temporary host from inventory/05-manual.yml. The generated Kubernetes inventory owns the node from this point on.

Reset A Node For Join Testing

To remove a joined node and test the join workflow again on the same host:

ansible-playbook playbooks/rancher-node-reset.yml \
  --limit gem-c01-w03 \
  -e target_cluster=gem_cluster_01 \
  -e rancher_node_reset_confirm=true

This playbook is destructive. It requires both --limit and rancher_node_reset_confirm=true.

It removes the Kubernetes node object when present, stops Rancher/RKE2/K3s services, runs vendor uninstall scripts when present, unmounts kubelet pod mounts, and deletes local Rancher/RKE2/K3s state.

Before deleting the Kubernetes node object, it checks the Rancher/CAPI management cluster for a matching Machine. A confirmed reset deletes the matching Machine first so Rancher is not left with NodeDeleted / NodeNotFound status. Prefer removing the node through Rancher first when this is not a reset test.

ansible-playbook playbooks/rancher-node-reset.yml \
  --limit gem-c01-w03 \
  -e target_cluster=gem_cluster_01 \
  -e rancher_node_reset_confirm=true

It does not remove normal baseline management state such as team users, SSH keys, MOTD, sysctl files, or installed baseline packages.

After reset:

make graph

Then add the host back to inventory/05-manual.yml if it is no longer present there and run the normal node lifecycle again.

Rancher Registration Helper

To inspect the command generated from Rancher API data:

scripts/rancher-registration-command.sh \
  --target-cluster gem_cluster_01 \
  --worker

With .rancher.env configured:

make rancher-registration-command

List visible Rancher clusters:

scripts/rancher-registration-command.sh --list-clusters

Use a cluster id when names are ambiguous:

scripts/rancher-registration-command.sh \
  --cluster-id c-m-xxxxx \
  --worker

Include explicit node addresses:

scripts/rancher-registration-command.sh \
  --target-cluster gem_cluster_01 \
  --worker \
  --address 178.105.200.132 \
  --internal-address 10.0.0.13

The printed command contains a node registration token. Treat it as sensitive.

Routine Operations

Always use --limit for mutating playbooks unless you intentionally want a larger scope.

Read-only playbooks:

Playbook Purpose
playbooks/ping.yml connection test
playbooks/system-info.yml OS, CPU, memory, uptime, kubelet, disk usage
playbooks/node-config-report.yml print redacted node config and network report
playbooks/diagnose-node.yml collect journals and system state into ./diagnostics/

Mutating playbooks:

Playbook Purpose
playbooks/node-kernel.yml install/select approved kernel before node join
playbooks/rancher-node-join.yml prepare and join a manual node through Rancher
playbooks/rancher-node-reset.yml remove Rancher/RKE2/K3s state for rejoin testing
playbooks/rke2-prep.yml RKE2 sysctl, preflight checks, etcd user, worker packages
playbooks/users.yml manage team users and SSH keys
playbooks/node-baseline.yml install baseline packages and MOTD
playbooks/firewall.yml apply UFW baseline
playbooks/unattended-upgrades.yml configure security-only unattended upgrades
playbooks/apt-update.yml cordon, drain, apt upgrade, optional reboot, uncordon
playbooks/rancher-upgrade.yml upgrade Rancher Manager Helm release
playbooks/bootstrap-ansible-user.yml create the dedicated Ansible user

Ops and emergency playbooks:

Playbook Purpose
playbooks/network-static-ip.yml pin private interface via netplan
playbooks/netplan-migrate-management.yml migrate management CP enp7s0 from manual networkd to netplan
playbooks/restart-systemd-networkd.yml restart systemd-networkd
playbooks/restart-kubelet.yml restart rke2-server, rke2-agent, k3s, or k3s-agent
playbooks/restart-containerd.yml restart containerd; this recycles pods

Examples:

ansible-playbook playbooks/system-info.yml --limit gem-c01-w03

ansible-playbook playbooks/node-config-report.yml \
  --limit gem-rancher-k3s-03

ansible-playbook playbooks/apt-update.yml \
  --limit gem_cluster_01_worker \
  -e auto_reboot=true

Use -e dist_upgrade=true only when you intentionally want apt dist-upgrade instead of a normal apt upgrade.

Use playbooks/node-config-report.yml when comparing node configuration. It prints live routing, networkd/netplan/resolver state, selected service status, OS facts, package versions, sysctl/module/firewall state, and selected config files from /etc/netplan, /etc/systemd/network, /run/systemd/network, /etc/cloud, /etc/rancher/k3s, and /etc/rancher/rke2. Token-like fields and kubeconfig certificate/key data are redacted by default.

The default stdout mode is a compact summary. To print every collected command and config file to the Ansible output:

ansible-playbook playbooks/node-config-report.yml \
  --limit gem-rancher-k3s-03 \
  -e node_config_stdout_mode=full

Report sections can be enabled or disabled independently. All are enabled by default:

ansible-playbook playbooks/node-config-report.yml \
  --limit gem-rancher-k3s-03 \
  -e node_config_section_os=false \
  -e node_config_section_packages=false \
  -e node_config_section_network=true

Available section vars are:

node_config_section_os
node_config_section_services
node_config_section_packages
node_config_section_network
node_config_section_firewall
node_config_section_kubernetes
node_config_section_config_files
node_config_section_journals

To also fetch the full report tarball, including recent service journals:

ansible-playbook playbooks/node-config-report.yml \
  --limit gem-rancher-k3s-03 \
  -e node_config_save_report=true

Saved reports are written to diagnostics/node-config/. Use -e node_config_include_sensitive=true only if you intentionally need raw values in the printed or saved report.

For the Rancher management control-plane netplan migration, keep Hetzner console access open. The playbook runs with serial: 1, drains one node, migrates its network config, waits for it to become Ready, uncordons it, and then continues with the next node:

ansible-playbook playbooks/netplan-migrate-management.yml --diff

The playbook backs up /etc/netplan/50-cloud-init.yaml and /etc/systemd/network/10-enp7s0.network, writes /etc/netplan/60-enp7s0-static.yaml, restarts systemd-networkd, and verifies that enp7s0 is using /run/systemd/network/10-netplan-enp7s0.network with the 10.0.0.1 default gateway. To test only one node, add --limit gem-rancher-k3s-01. To skip Kubernetes drain, pass -e netplan_drain_node=false.

Make Targets

Run make help to list available targets.

Common targets:

Target Purpose
make configure write local config.yml
make doctor sanity-check local tools and config
make graph print the inventory tree (queries kubectl)
make ping smoke-test Ansible connectivity
make node-kernel run playbooks/node-kernel.yml
make rancher-upgrade run playbooks/rancher-upgrade.yml
make rancher-node-join run playbooks/rancher-node-join.yml
make rancher-node-reset run playbooks/rancher-node-reset.yml
make rancher-registration-command print generated Rancher registration command

The Make targets are convenient shortcuts for routine default runs. For node lifecycle operations, use direct ansible-playbook commands so required values such as --limit, target_cluster, and confirmation flags are explicit in the command history.

Repository Layout

ansible/
|-- README.md
|-- GETTING_STARTED.md
|-- Makefile                    # operator shortcuts
|-- ansible.cfg                 # inventory, callbacks, fact cache, role paths
|-- requirements.yml            # Ansible collection requirements
|-- config.yml.example          # template written to config.yml by make configure
|-- tag-names.yml               # allowed inventory tag groups
|-- .rancher.env.example        # local Rancher API token template
|-- inventory/
|   |-- 00-static.yml           # static group hierarchy and bastions
|   |-- 05-manual.yml           # temporary/fresh hosts before dynamic inventory owns them
|   |-- k8s-nodes.sh            # dynamic inventory: queries `kubectl get nodes`
|   `-- group_vars/             # global, manual, cluster, and role-group vars
|-- playbooks/
|   |-- apt-update.yml
|   |-- bootstrap-ansible-user.yml
|   |-- diagnose-node.yml
|   |-- firewall.yml
|   |-- network-static-ip.yml
|   |-- netplan-migrate-management.yml
|   |-- node-baseline.yml
|   |-- node-config-report.yml
|   |-- node-kernel.yml
|   |-- ping.yml
|   |-- rancher-upgrade.yml
|   |-- rancher-node-join.yml
|   |-- rancher-node-reset.yml
|   |-- restart-containerd.yml
|   |-- restart-kubelet.yml
|   |-- restart-systemd-networkd.yml
|   |-- rke2-prep.yml
|   |-- system-info.yml
|   |-- unattended-upgrades.yml
|   |-- users.yml
|   `-- templates/
|-- roles/
|   |-- firewall/
|   |-- node_baseline/
|   |-- node_kernel/
|   |-- rancher_node_join/
|   |-- rancher_node_reset/
|   |-- rke2_prep/
|   |-- unattended_upgrades/
|   `-- users/
|-- files/
|   |-- README.md
|   `-- keys/                   # managed users' public SSH keys
`-- scripts/
    |-- configure.sh
    |-- doctor.sh
    |-- install-deps.sh
    |-- lib-config.sh
    `-- rancher-registration-command.sh

Troubleshooting

Run make doctor first. It checks the common local setup problems.

Symptom Likely cause Fix
Connection timed out to private IP missing bastion config while in local mode check bastion_inventory_name and the bastion public IP
Permission denied (publickey) wrong SSH user or key check config.yml and key permissions
BECOME password: prompt hangs user has no passwordless sudo run with -K or run make bootstrap-user
context X unreachable or lacks get nodes permission kubeconfig context is wrong or lacks RBAC verify with kubectl --context=X get nodes
inventory/group_vars/<slug>.yml does not exist context slug has no cluster vars file create the file or rename the kube context
Rancher API returns 401 wrong token type, expired token, or bad .rancher.env use a Rancher API token, not a node registration token
join waits forever for Kubernetes node Rancher agent installed but RKE2/CNI is unhealthy check rancher-system-agent, rke2-agent, and CNI pod logs
preflight rejects kernel node does not match cluster-approved kernel run playbooks/node-kernel.yml or update the approved kernel after testing

Useful one-node test:

ansible-playbook playbooks/system-info.yml \
  --limit gem-c01-w03 \
  --check \
  --diff

Inspect resolved inventory:

ansible-inventory --host gem-c01-w03
ansible-inventory --graph