# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Repository Overview This is a Nix Flake-based NixOS configuration repository for managing a homelab infrastructure consisting of 16 server configurations. The repository uses a modular architecture with shared system configurations, reusable service modules, and per-host customization. ## Common Commands ### Building Configurations ```bash # List all available configurations nix flake show # Build a specific host configuration locally (without deploying) nixos-rebuild build --flake .# # Build and check a configuration nix build .#nixosConfigurations..config.system.build.toplevel ``` **Important:** Do NOT pipe `nix build` commands to other commands like `tail` or `head`. Piping can hide errors and make builds appear successful when they actually failed. Always run `nix build` without piping to see the full output. ```bash # BAD - hides errors nix build .#create-host 2>&1 | tail -20 # GOOD - shows all output and errors nix build .#create-host ``` ### Deployment Do not automatically deploy changes. Deployments are usually done by updating the master branch, and then triggering the auto update on the specific host. ### SSH Commands Do not run SSH commands directly. If a command needs to be run on a remote host, provide the command to the user and ask them to run it manually. ### Sharing Command Output via Loki All hosts have the `pipe-to-loki` script for sending command output or terminal sessions to Loki, allowing users to share output with Claude without copy-pasting. **Pipe mode** - send command output: ```bash command | pipe-to-loki # Auto-generated ID command | pipe-to-loki --id my-test # Custom ID ``` **Session mode** - record interactive terminal session: ```bash pipe-to-loki --record # Start recording, exit to send pipe-to-loki --record --id my-session # With custom ID ``` The script prints the session ID which the user can share. Query results with: ```logql {job="pipe-to-loki"} # All entries {job="pipe-to-loki", id="my-test"} # Specific ID {job="pipe-to-loki", host="testvm01"} # From specific host {job="pipe-to-loki", type="session"} # Only sessions ``` ### Testing Feature Branches on Hosts All hosts have the `nixos-rebuild-test` helper script for testing feature branches before merging: ```bash # On the target host, test a feature branch nixos-rebuild-test boot nixos-rebuild-test switch # Additional arguments are passed through to nixos-rebuild nixos-rebuild-test boot my-feature --show-trace ``` When working on a feature branch that requires testing on a live host, suggest using this command instead of the full flake URL syntax. ### Flake Management ```bash # Check flake for errors nix flake check ``` Do not run `nix flake update`. Should only be done manually by user. ### Development Environment ```bash # Enter development shell nix develop ``` The devshell provides: `ansible`, `tofu` (OpenTofu), `bao` (OpenBao CLI), `create-host`, and `homelab-deploy`. **Important:** When suggesting commands that use devshell tools, always use `nix develop -c ` syntax rather than assuming the user is already in a devshell. For example: ```bash # Good - works regardless of current shell nix develop -c tofu plan # Avoid - requires user to be in devshell tofu plan ``` **OpenTofu:** Use the `-chdir` option instead of `cd` when running tofu commands in subdirectories: ```bash # Good - uses -chdir option nix develop -c tofu -chdir=terraform plan nix develop -c tofu -chdir=terraform/vault apply # Avoid - changing directories cd terraform && tofu plan ``` ### Secrets Management Secrets are managed by OpenBao (Vault) using AppRole authentication. Most hosts use the `vault.secrets` option defined in `system/vault-secrets.nix` to fetch secrets at boot. Terraform manages the secrets and AppRole policies in `terraform/vault/`. ### Git Workflow **Important:** Never commit directly to `master` unless the user explicitly asks for it. Always create a feature branch for changes. **Important:** Never amend commits to `master` unless the user explicitly asks for it. Amending rewrites history and causes issues for deployed configurations. **Important:** Do not use `gh pr create` to create pull requests. The git server does not support GitHub CLI for PR creation. Instead, push the branch and let the user create the PR manually via the web interface. When starting a new plan or task, the first step should typically be to create and checkout a new branch with an appropriate name (e.g., `git checkout -b dns-automation` or `git checkout -b fix-nginx-config`). ### Plan Management When creating plans for large features, follow this workflow: 1. When implementation begins, save a copy of the plan to `docs/plans/` (e.g., `docs/plans/feature-name.md`) 2. Once the feature is fully implemented, move the plan to `docs/plans/completed/` ### Git Commit Messages Commit messages should follow the format: `topic: short description` Examples: - `flake: add opentofu to devshell` - `template2: add proxmox image configuration` - `terraform: add VM deployment configuration` ### Clipboard To copy text to the clipboard, pipe to `wl-copy` (Wayland): ```bash echo "text" | wl-copy ``` ### NixOS Options and Packages Lookup Two MCP servers are available for searching NixOS options and packages: - **nixpkgs-options** - Search and lookup NixOS configuration option documentation - **nixpkgs-packages** - Search and lookup Nix packages from nixpkgs **Session Setup:** At the start of each session, index the nixpkgs revision from `flake.lock` to ensure documentation matches the project's nixpkgs version: 1. Read `flake.lock` and find the `nixpkgs` node's `rev` field 2. Call `index_revision` with that git hash (both servers share the same index) **Options Tools (nixpkgs-options):** - `search_options` - Search for options by name or description (e.g., query "nginx" or "postgresql") - `get_option` - Get full details for a specific option (e.g., `services.loki.configuration`) - `get_file` - Fetch the source file from nixpkgs that declares an option **Package Tools (nixpkgs-packages):** - `search_packages` - Search for packages by name or description (e.g., query "nginx" or "python") - `get_package` - Get full details for a specific package by attribute path (e.g., `firefox`, `python312Packages.requests`) - `get_file` - Fetch the source file from nixpkgs that defines a package This ensures documentation matches the exact nixpkgs version (currently NixOS 25.11) used by this flake. ### Lab Monitoring The **lab-monitoring** MCP server provides access to Prometheus metrics and Loki logs. Use the `/observability` skill for detailed reference on: - Available Prometheus jobs and exporters - Loki labels and LogQL query syntax - Bootstrap log monitoring for new VMs - Common troubleshooting workflows The skill contains up-to-date information about all scrape targets, host labels, and example queries. ### Deploying to Test Hosts The **homelab-deploy** MCP server enables remote deployments to test-tier hosts via NATS messaging. **Available Tools:** - `deploy` - Deploy NixOS configuration to test-tier hosts - `list_hosts` - List available deployment targets **Deploy Parameters:** - `hostname` - Target a specific host (e.g., `vaulttest01`) - `role` - Deploy to all hosts with a specific role (e.g., `vault`) - `all` - Deploy to all test-tier hosts - `action` - nixos-rebuild action: `switch` (default), `boot`, `test`, `dry-activate` - `branch` - Git branch or commit to deploy (default: `master`) **Examples:** ``` # List available hosts list_hosts() # Deploy to a specific host deploy(hostname="vaulttest01", action="switch") # Dry-run deployment deploy(hostname="vaulttest01", action="dry-activate") # Deploy to all hosts with a role deploy(role="vault", action="switch") ``` **Note:** Only test-tier hosts with `homelab.deploy.enable = true` and the listener service running will respond to deployments. **Deploying to Prod Hosts:** The MCP server only deploys to test-tier hosts. For prod hosts, use the CLI directly: ```bash nix develop -c homelab-deploy -- deploy \ --nats-url nats://nats1.home.2rjus.net:4222 \ --nkey-file ~/.config/homelab-deploy/admin-deployer.nkey \ --branch \ --action switch \ deploy.prod. ``` Subject format: `deploy..` (e.g., `deploy.prod.monitoring01`, `deploy.test.testvm01`) **Verifying Deployments:** After deploying, use the `nixos_flake_info` metric from nixos-exporter to verify the host is running the expected revision: ```promql nixos_flake_info{instance=~"vaulttest01.*"} ``` The `current_rev` label contains the git commit hash of the deployed flake configuration. ## Architecture ### Directory Structure - `/flake.nix` - Central flake defining all NixOS configurations - `/hosts//` - Per-host configurations - `default.nix` - Entry point, imports configuration.nix and services - `configuration.nix` - Host-specific settings (networking, hardware, users) - `/system/` - Shared system-level configurations applied to ALL hosts - Core modules: nix.nix, sshd.nix, vault-secrets.nix, acme.nix, autoupgrade.nix - Additional modules: motd.nix (dynamic MOTD), packages.nix (base packages), root-user.nix (root config), homelab-deploy.nix (NATS listener) - Monitoring: node-exporter and promtail on every host - `/modules/` - Custom NixOS modules - `homelab/` - Homelab-specific options (see "Homelab Module Options" section below) - `/lib/` - Nix library functions - `dns-zone.nix` - DNS zone generation functions - `monitoring.nix` - Prometheus scrape target generation functions - `/services/` - Reusable service modules, selectively imported by hosts - `home-assistant/` - Home automation stack - `monitoring/` - Observability stack (Prometheus, Grafana, Loki, Tempo) - `ns/` - DNS services (authoritative, resolver, zone generation) - `vault/` - OpenBao (Vault) secrets server - `actions-runner/` - GitHub Actions runner - `http-proxy/`, `postgres/`, `nats/`, `jellyfin/`, etc. - `/common/` - Shared configurations (e.g., VM guest agent) - `/docs/` - Documentation and plans - `plans/` - Future plans and proposals - `plans/completed/` - Completed plans (moved here when done) - `/playbooks/` - Ansible playbooks for fleet management ### Configuration Inheritance Each host follows this import pattern: ``` hosts//default.nix └─> configuration.nix (host-specific) ├─> ../../system (ALL shared system configs - applied to every host) ├─> ../../services/ (selective service imports) └─> ../../common/vm (if VM) ``` All hosts automatically get: - Nix binary cache (nix-cache.home.2rjus.net) - SSH with root login enabled - OpenBao (Vault) secrets management via AppRole - Internal ACME CA integration (OpenBao PKI at vault.home.2rjus.net) - Daily auto-upgrades with auto-reboot - Prometheus node-exporter + Promtail (logs to monitoring01) - Monitoring scrape target auto-registration via `homelab.monitoring` options - Custom root CA trust - DNS zone auto-registration via `homelab.dns` options ### Active Hosts Production servers: - `ns1`, `ns2` - Primary/secondary DNS servers (10.69.13.5/6) - `vault01` - OpenBao (Vault) secrets server + PKI CA - `ha1` - Home Assistant + Zigbee2MQTT + Mosquitto - `http-proxy` - Reverse proxy - `monitoring01` - Full observability stack (Prometheus, Grafana, Loki, Tempo, Pyroscope) - `jelly01` - Jellyfin media server - `nix-cache01` - Binary cache server + GitHub Actions runner - `pgdb1` - PostgreSQL database - `nats1` - NATS messaging server Test/staging hosts: - `testvm01`, `testvm02`, `testvm03` - Test-tier VMs for branch testing and deployment validation Template hosts: - `template1`, `template2` - Base templates for cloning new hosts ### Flake Inputs - `nixpkgs` - NixOS 25.11 stable (primary) - `nixpkgs-unstable` - Unstable channel (available via overlay as `pkgs.unstable.`) - `nixos-exporter` - NixOS module for exposing flake revision metrics (used to verify deployments) - `homelab-deploy` - NATS-based remote deployment tool for test-tier hosts - Custom packages from git.t-juice.club: - `alerttonotify` - Alert routing ### Network Architecture - Domain: `home.2rjus.net` - Infrastructure subnet: `10.69.13.x` - DNS: ns1/ns2 provide authoritative DNS with primary-secondary setup - Internal CA for ACME certificates (no Let's Encrypt) - Centralized monitoring at monitoring01 - Static networking via systemd-networkd ### Secrets Management Most hosts use OpenBao (Vault) for secrets: - Vault server at `vault01.home.2rjus.net:8200` - AppRole authentication with credentials at `/var/lib/vault/approle/` - Secrets defined in Terraform (`terraform/vault/secrets.tf`) - AppRole policies in Terraform (`terraform/vault/approle.tf`) - NixOS module: `system/vault-secrets.nix` with `vault.secrets.` options - `extractKey` option extracts a single key from vault JSON as a plain file - Secrets fetched at boot by `vault-secret-.service` systemd units - Fallback to cached secrets in `/var/lib/vault/cache/` when Vault is unreachable - Provision AppRole credentials: `nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=` ### Auto-Upgrade System All hosts pull updates daily from: ``` git+https://git.t-juice.club/torjus/nixos-servers.git ``` Configured in `/system/autoupgrade.nix`: - Random delay to avoid simultaneous upgrades - Auto-reboot after successful upgrade - Systemd service: `nixos-upgrade.service` ### Proxmox VM Provisioning with OpenTofu The repository includes automated workflows for building Proxmox VM templates and deploying VMs using OpenTofu (Terraform). #### Building and Deploying Templates Template VMs are built from `hosts/template2` and deployed to Proxmox using Ansible: ```bash # Build NixOS image and deploy to Proxmox as template nix develop -c ansible-playbook -i playbooks/inventory.ini playbooks/build-and-deploy-template.yml ``` This playbook: 1. Builds the Proxmox image using `nixos-rebuild build-image --image-variant proxmox` 2. Uploads the `.vma.zst` image to Proxmox at `/var/lib/vz/dump` 3. Restores it as VM ID 9000 4. Converts it to a template Template configuration (`hosts/template2`): - Minimal base system with essential packages (age, vim, wget, git) - Cloud-init configured for NoCloud datasource (no EC2 metadata timeout) - DHCP networking on ens18 - SSH key-based root login - `prepare-host.sh` script for cleaning machine-id, SSH keys, and regenerating age keys #### Deploying VMs with OpenTofu VMs are deployed from templates using OpenTofu in the `/terraform` directory: ```bash cd terraform tofu init # First time only tofu apply # Deploy VMs ``` Configuration files: - `main.tf` - Proxmox provider configuration - `variables.tf` - Provider variables (API credentials) - `vm.tf` - VM resource definitions - `terraform.tfvars` - Actual credentials (gitignored) Example VM deployment includes: - Clone from template VM - Cloud-init configuration (SSH keys, network, DNS) - Custom CPU/memory/disk sizing - VLAN tagging - QEMU guest agent - Automatic Vault credential provisioning via `vault_wrapped_token` OpenTofu outputs the VM's IP address after deployment for easy SSH access. **Automatic Vault Credential Provisioning:** VMs can receive Vault (OpenBao) credentials automatically during bootstrap: 1. OpenTofu generates a wrapped token via `terraform/vault/` and stores it in the VM configuration 2. Cloud-init passes `VAULT_WRAPPED_TOKEN` and `NIXOS_FLAKE_BRANCH` to the bootstrap script 3. The bootstrap script unwraps the token to obtain AppRole credentials 4. Credentials are written to `/var/lib/vault/approle/` before the NixOS rebuild This eliminates the need for manual `provision-approle.yml` playbook runs on new VMs. Bootstrap progress is logged to Loki with `job="bootstrap"` labels. #### Template Rebuilding and Terraform State When the Proxmox template is rebuilt (via `build-and-deploy-template.yml`), the template name may change. This would normally cause Terraform to want to recreate all existing VMs, but that's unnecessary since VMs are independent once cloned. **Solution**: The `terraform/vms.tf` file includes a lifecycle rule to ignore certain attributes that don't need management: ```hcl lifecycle { ignore_changes = [ clone, # Template name can change without recreating VMs startup_shutdown, # Proxmox sets defaults (-1) that we don't need to manage ] } ``` This means: - **clone**: Existing VMs are not affected by template name changes; only new VMs use the updated template - **startup_shutdown**: Proxmox sets default startup order/delay values (-1) that Terraform would otherwise try to remove - You can safely update `default_template_name` in `terraform/variables.tf` without recreating VMs - `tofu plan` won't show spurious changes for Proxmox-managed defaults **When rebuilding the template:** 1. Run `nix develop -c ansible-playbook -i playbooks/inventory.ini playbooks/build-and-deploy-template.yml` 2. Update `default_template_name` in `terraform/variables.tf` if the name changed 3. Run `tofu plan` - should show no VM recreations (only template name in state) 4. Run `tofu apply` - updates state without touching existing VMs 5. New VMs created after this point will use the new template ### Adding a New Host See [docs/host-creation.md](docs/host-creation.md) for the complete host creation pipeline, including: - Using the `create-host` script to generate host configurations - Deploying VMs and secrets with OpenTofu - Monitoring the bootstrap process via Loki - Verification and troubleshooting steps **Note:** DNS A records and Prometheus node-exporter scrape targets are auto-generated from the host's `systemd.network.networks` static IP configuration. No manual zone file or Prometheus config editing is required. ### Important Patterns **Overlay usage**: Access unstable packages via `pkgs.unstable.` (defined in flake.nix overlay-unstable) **Service composition**: Services in `/services/` are designed to be imported by multiple hosts. Keep them modular and reusable. **Hardware configuration reuse**: Multiple hosts share `/hosts/template/hardware-configuration.nix` for VM instances. **State version**: All hosts use stateVersion `"23.11"` - do not change this on existing hosts. **Firewall**: Disabled on most hosts (trusted network). Enable selectively in host configuration if needed. **Shell scripts**: Use `pkgs.writeShellApplication` instead of `pkgs.writeShellScript` or `pkgs.writeShellScriptBin` for creating shell scripts. `writeShellApplication` provides automatic shellcheck validation, sets strict bash options (`set -euo pipefail`), and allows declaring `runtimeInputs` for dependencies. When referencing the executable path (e.g., in `ExecStart`), use `lib.getExe myScript` to get the proper `bin/` path. ### Monitoring Stack All hosts ship metrics and logs to `monitoring01`: - **Metrics**: Prometheus scrapes node-exporter from all hosts - **Logs**: Promtail ships logs to Loki on monitoring01 - **Access**: Grafana at monitoring01 for visualization - **Tracing**: Tempo for distributed tracing - **Profiling**: Pyroscope for continuous profiling **Scrape Target Auto-Generation:** Prometheus scrape targets are automatically generated from host configurations, following the same pattern as DNS zone generation: - **Node-exporter**: All flake hosts with static IPs are automatically added as node-exporter targets - **Service targets**: Defined via `homelab.monitoring.scrapeTargets` in service modules - **External targets**: Non-flake hosts defined in `/services/monitoring/external-targets.nix` - **Library**: `lib/monitoring.nix` provides `generateNodeExporterTargets` and `generateScrapeConfigs` Service modules declare their scrape targets directly via `homelab.monitoring.scrapeTargets`. The Prometheus config on monitoring01 auto-generates scrape configs from all hosts. See "Homelab Module Options" section for available options. To add monitoring targets for non-NixOS hosts, edit `/services/monitoring/external-targets.nix`. ### DNS Architecture - `ns1` (10.69.13.5) - Primary authoritative DNS + resolver - `ns2` (10.69.13.6) - Secondary authoritative DNS (AXFR from ns1) - All hosts point to ns1/ns2 for DNS resolution **Zone Auto-Generation:** DNS zone entries are automatically generated from host configurations: - **Flake-managed hosts**: A records extracted from `systemd.network.networks` static IPs - **CNAMEs**: Defined via `homelab.dns.cnames` option in host configs - **External hosts**: Non-flake hosts defined in `/services/ns/external-hosts.nix` - **Serial number**: Uses `self.sourceInfo.lastModified` (git commit timestamp) Hosts are automatically excluded from DNS if: - `homelab.dns.enable = false` (e.g., template hosts) - No static IP configured (e.g., DHCP-only hosts) - Network interface is a VPN/tunnel (wg*, tun*, tap*) To add DNS entries for non-NixOS hosts, edit `/services/ns/external-hosts.nix`. ### Homelab Module Options The `modules/homelab/` directory defines custom options used across hosts for automation and metadata. **Host options (`homelab.host.*`):** - `tier` - Deployment tier: `test` or `prod`. Test-tier hosts can receive remote deployments and have different credential access. - `priority` - Alerting priority: `high` or `low`. Controls alerting thresholds for the host. - `role` - Primary role designation (e.g., `dns`, `database`, `bastion`, `vault`) - `labels` - Free-form key-value metadata for host categorization **DNS options (`homelab.dns.*`):** - `enable` (default: `true`) - Include host in DNS zone generation - `cnames` (default: `[]`) - List of CNAME aliases pointing to this host **Monitoring options (`homelab.monitoring.*`):** - `enable` (default: `true`) - Include host in Prometheus node-exporter scrape targets - `scrapeTargets` (default: `[]`) - Additional scrape targets exposed by this host **Deploy options (`homelab.deploy.*`):** - `enable` (default: `false`) - Enable NATS-based remote deployment listener. When enabled, the host listens for deployment commands via NATS and can be targeted by the `homelab-deploy` MCP server.