Files
nixos-servers/CLAUDE.md
Torjus Håkestad f5904738b0
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Run nix flake check / flake-check (pull_request) Successful in 2m30s
vault: implement bootstrap integration
2026-02-03 01:09:43 +01:00

10 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Repository Overview

This is a Nix Flake-based NixOS configuration repository for managing a homelab infrastructure consisting of 16 server configurations. The repository uses a modular architecture with shared system configurations, reusable service modules, and per-host customization.

Common Commands

Building Configurations

# List all available configurations
nix flake show

# Build a specific host configuration locally (without deploying)
nixos-rebuild build --flake .#<hostname>

# Build and check a configuration
nix build .#nixosConfigurations.<hostname>.config.system.build.toplevel

Important: Do NOT pipe nix build commands to other commands like tail or head. Piping can hide errors and make builds appear successful when they actually failed. Always run nix build without piping to see the full output.

# BAD - hides errors
nix build .#create-host 2>&1 | tail -20

# GOOD - shows all output and errors
nix build .#create-host

Deployment

Do not automatically deploy changes. Deployments are usually done by updating the master branch, and then triggering the auto update on the specific host.

Flake Management

# Check flake for errors
nix flake check

Do not run nix flake update. Should only be done manually by user.

Development Environment

# Enter development shell (provides ansible, python3)
nix develop

Secrets Management

Secrets are handled by sops. Do not edit any .sops.yaml or any file within secrets/. Ask the user to modify if necessary.

Git Commit Messages

Commit messages should follow the format: topic: short description

Examples:

  • flake: add opentofu to devshell
  • template2: add proxmox image configuration
  • terraform: add VM deployment configuration

Architecture

Directory Structure

  • /flake.nix - Central flake defining all 16 NixOS configurations
  • /hosts/<hostname>/ - Per-host configurations
    • default.nix - Entry point, imports configuration.nix and services
    • configuration.nix - Host-specific settings (networking, hardware, users)
  • /system/ - Shared system-level configurations applied to ALL hosts
    • Core modules: nix.nix, sshd.nix, sops.nix, acme.nix, autoupgrade.nix
    • Monitoring: node-exporter and promtail on every host
  • /services/ - Reusable service modules, selectively imported by hosts
    • home-assistant/ - Home automation stack
    • monitoring/ - Observability stack (Prometheus, Grafana, Loki, Tempo)
    • ns/ - DNS services (authoritative, resolver)
    • http-proxy/, ca/, postgres/, nats/, jellyfin/, etc.
  • /secrets/ - SOPS-encrypted secrets with age encryption
  • /common/ - Shared configurations (e.g., VM guest agent)
  • /playbooks/ - Ansible playbooks for fleet management
  • /.sops.yaml - SOPS configuration with age keys for all servers

Configuration Inheritance

Each host follows this import pattern:

hosts/<hostname>/default.nix
  └─> configuration.nix (host-specific)
      ├─> ../../system (ALL shared system configs - applied to every host)
      ├─> ../../services/<service> (selective service imports)
      └─> ../../common/vm (if VM)

All hosts automatically get:

  • Nix binary cache (nix-cache.home.2rjus.net)
  • SSH with root login enabled
  • SOPS secrets management with auto-generated age keys
  • Internal ACME CA integration (ca.home.2rjus.net)
  • Daily auto-upgrades with auto-reboot
  • Prometheus node-exporter + Promtail (logs to monitoring01)
  • Custom root CA trust

Active Hosts

Production servers managed by rebuild-all.sh:

  • ns1, ns2 - Primary/secondary DNS servers (10.69.13.5/6)
  • ca - Internal Certificate Authority
  • ha1 - Home Assistant + Zigbee2MQTT + Mosquitto
  • http-proxy - Reverse proxy
  • monitoring01 - Full observability stack (Prometheus, Grafana, Loki, Tempo, Pyroscope)
  • jelly01 - Jellyfin media server
  • nix-cache01 - Binary cache server
  • pgdb1 - PostgreSQL database
  • nats1 - NATS messaging server
  • auth01 - Authentication service

Template/test hosts:

  • template1 - Base template for cloning new hosts
  • nixos-test1 - Test environment

Flake Inputs

  • nixpkgs - NixOS 25.11 stable (primary)
  • nixpkgs-unstable - Unstable channel (available via overlay as pkgs.unstable.<package>)
  • sops-nix - Secrets management
  • Custom packages from git.t-juice.club:
    • backup-helper - Backup automation module
    • alerttonotify - Alert routing
    • labmon - Lab monitoring

Network Architecture

  • Domain: home.2rjus.net
  • Infrastructure subnet: 10.69.13.x
  • DNS: ns1/ns2 provide authoritative DNS with primary-secondary setup
  • Internal CA for ACME certificates (no Let's Encrypt)
  • Centralized monitoring at monitoring01
  • Static networking via systemd-networkd

Secrets Management

  • Uses SOPS with age encryption
  • Each server has unique age key in .sops.yaml
  • Keys auto-generated at /var/lib/sops-nix/key.txt on first boot
  • Shared secrets: /secrets/secrets.yaml
  • Per-host secrets: /secrets/<hostname>/
  • All production servers can decrypt shared secrets; host-specific secrets require specific host keys

Auto-Upgrade System

All hosts pull updates daily from:

git+https://git.t-juice.club/torjus/nixos-servers.git

Configured in /system/autoupgrade.nix:

  • Random delay to avoid simultaneous upgrades
  • Auto-reboot after successful upgrade
  • Systemd service: nixos-upgrade.service

Proxmox VM Provisioning with OpenTofu

The repository includes automated workflows for building Proxmox VM templates and deploying VMs using OpenTofu (Terraform).

Building and Deploying Templates

Template VMs are built from hosts/template2 and deployed to Proxmox using Ansible:

# Build NixOS image and deploy to Proxmox as template
nix develop -c ansible-playbook -i playbooks/inventory.ini playbooks/build-and-deploy-template.yml

This playbook:

  1. Builds the Proxmox image using nixos-rebuild build-image --image-variant proxmox
  2. Uploads the .vma.zst image to Proxmox at /var/lib/vz/dump
  3. Restores it as VM ID 9000
  4. Converts it to a template

Template configuration (hosts/template2):

  • Minimal base system with essential packages (age, vim, wget, git)
  • Cloud-init configured for NoCloud datasource (no EC2 metadata timeout)
  • DHCP networking on ens18
  • SSH key-based root login
  • prepare-host.sh script for cleaning machine-id, SSH keys, and regenerating age keys

Deploying VMs with OpenTofu

VMs are deployed from templates using OpenTofu in the /terraform directory:

cd terraform
tofu init     # First time only
tofu apply    # Deploy VMs

Configuration files:

  • main.tf - Proxmox provider configuration
  • variables.tf - Provider variables (API credentials)
  • vm.tf - VM resource definitions
  • terraform.tfvars - Actual credentials (gitignored)

Example VM deployment includes:

  • Clone from template VM
  • Cloud-init configuration (SSH keys, network, DNS)
  • Custom CPU/memory/disk sizing
  • VLAN tagging
  • QEMU guest agent

OpenTofu outputs the VM's IP address after deployment for easy SSH access.

Template Rebuilding and Terraform State

When the Proxmox template is rebuilt (via build-and-deploy-template.yml), the template name may change. This would normally cause Terraform to want to recreate all existing VMs, but that's unnecessary since VMs are independent once cloned.

Solution: The terraform/vms.tf file includes a lifecycle rule to ignore certain attributes that don't need management:

lifecycle {
  ignore_changes = [
    clone,            # Template name can change without recreating VMs
    startup_shutdown, # Proxmox sets defaults (-1) that we don't need to manage
  ]
}

This means:

  • clone: Existing VMs are not affected by template name changes; only new VMs use the updated template
  • startup_shutdown: Proxmox sets default startup order/delay values (-1) that Terraform would otherwise try to remove
  • You can safely update default_template_name in terraform/variables.tf without recreating VMs
  • tofu plan won't show spurious changes for Proxmox-managed defaults

When rebuilding the template:

  1. Run nix develop -c ansible-playbook -i playbooks/inventory.ini playbooks/build-and-deploy-template.yml
  2. Update default_template_name in terraform/variables.tf if the name changed
  3. Run tofu plan - should show no VM recreations (only template name in state)
  4. Run tofu apply - updates state without touching existing VMs
  5. New VMs created after this point will use the new template

Adding a New Host

  1. Create /hosts/<hostname>/ directory
  2. Copy structure from template1 or similar host
  3. Add host entry to flake.nix nixosConfigurations
  4. Add hostname to dns zone files. Merge to master. Run auto-upgrade on dns servers.
  5. User clones template host
  6. User runs prepare-host.sh on new host, this deletes files which should be regenerated, like ssh host keys, machine-id etc. It also creates a new age key, and prints the public key
  7. This key is then added to .sops.yaml
  8. Create /secrets/<hostname>/ if needed
  9. Configure networking (static IP, DNS servers)
  10. Commit changes, and merge to master.
  11. Deploy by running nixos-rebuild boot --flake URL#<hostname> on the host.

Important Patterns

Overlay usage: Access unstable packages via pkgs.unstable.<package> (defined in flake.nix overlay-unstable)

Service composition: Services in /services/ are designed to be imported by multiple hosts. Keep them modular and reusable.

Hardware configuration reuse: Multiple hosts share /hosts/template/hardware-configuration.nix for VM instances.

State version: All hosts use stateVersion "23.11" - do not change this on existing hosts.

Firewall: Disabled on most hosts (trusted network). Enable selectively in host configuration if needed.

Monitoring Stack

All hosts ship metrics and logs to monitoring01:

  • Metrics: Prometheus scrapes node-exporter from all hosts
  • Logs: Promtail ships logs to Loki on monitoring01
  • Access: Grafana at monitoring01 for visualization
  • Tracing: Tempo for distributed tracing
  • Profiling: Pyroscope for continuous profiling

DNS Architecture

  • ns1 (10.69.13.5) - Primary authoritative DNS + resolver
  • ns2 (10.69.13.6) - Secondary authoritative DNS (AXFR from ns1)
  • Zone files managed in /services/ns/
  • All hosts point to ns1/ns2 for DNS resolution