# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Repository Overview This is a Nix Flake-based NixOS configuration repository for managing a homelab infrastructure consisting of 16 server configurations. The repository uses a modular architecture with shared system configurations, reusable service modules, and per-host customization. ## Common Commands ### Building Configurations ```bash # List all available configurations nix flake show # Build a specific host configuration locally (without deploying) nixos-rebuild build --flake .# # Build and check a configuration nix build .#nixosConfigurations..config.system.build.toplevel ``` **Important:** Do NOT pipe `nix build` commands to other commands like `tail` or `head`. Piping can hide errors and make builds appear successful when they actually failed. Always run `nix build` without piping to see the full output. ```bash # BAD - hides errors nix build .#create-host 2>&1 | tail -20 # GOOD - shows all output and errors nix build .#create-host ``` ### Deployment Do not automatically deploy changes. Deployments are usually done by updating the master branch, and then triggering the auto update on the specific host. ### Flake Management ```bash # Check flake for errors nix flake check ``` Do not run `nix flake update`. Should only be done manually by user. ### Development Environment ```bash # Enter development shell (provides ansible, python3) nix develop ``` ### Secrets Management Secrets are handled by sops. Do not edit any `.sops.yaml` or any file within `secrets/`. Ask the user to modify if necessary. ### Git Workflow **Important:** Never commit directly to `master` unless the user explicitly asks for it. Always create a feature branch for changes. When starting a new plan or task, the first step should typically be to create and checkout a new branch with an appropriate name (e.g., `git checkout -b dns-automation` or `git checkout -b fix-nginx-config`). ### Plan Management When creating plans for large features, follow this workflow: 1. When implementation begins, save a copy of the plan to `docs/plans/` (e.g., `docs/plans/feature-name.md`) 2. Once the feature is fully implemented, move the plan to `docs/plans/completed/` ### Git Commit Messages Commit messages should follow the format: `topic: short description` Examples: - `flake: add opentofu to devshell` - `template2: add proxmox image configuration` - `terraform: add VM deployment configuration` ### Clipboard To copy text to the clipboard, pipe to `wl-copy` (Wayland): ```bash echo "text" | wl-copy ``` ### NixOS Options and Packages Lookup Two MCP servers are available for searching NixOS options and packages: - **nixpkgs-options** - Search and lookup NixOS configuration option documentation - **nixpkgs-packages** - Search and lookup Nix packages from nixpkgs **Session Setup:** At the start of each session, index the nixpkgs revision from `flake.lock` to ensure documentation matches the project's nixpkgs version: 1. Read `flake.lock` and find the `nixpkgs` node's `rev` field 2. Call `index_revision` with that git hash (both servers share the same index) **Options Tools (nixpkgs-options):** - `search_options` - Search for options by name or description (e.g., query "nginx" or "postgresql") - `get_option` - Get full details for a specific option (e.g., `services.loki.configuration`) - `get_file` - Fetch the source file from nixpkgs that declares an option **Package Tools (nixpkgs-packages):** - `search_packages` - Search for packages by name or description (e.g., query "nginx" or "python") - `get_package` - Get full details for a specific package by attribute path (e.g., `firefox`, `python312Packages.requests`) - `get_file` - Fetch the source file from nixpkgs that defines a package This ensures documentation matches the exact nixpkgs version (currently NixOS 25.11) used by this flake. ## Architecture ### Directory Structure - `/flake.nix` - Central flake defining all NixOS configurations - `/hosts//` - Per-host configurations - `default.nix` - Entry point, imports configuration.nix and services - `configuration.nix` - Host-specific settings (networking, hardware, users) - `/system/` - Shared system-level configurations applied to ALL hosts - Core modules: nix.nix, sshd.nix, sops.nix, acme.nix, autoupgrade.nix - Monitoring: node-exporter and promtail on every host - `/modules/` - Custom NixOS modules - `homelab/` - Homelab-specific options (DNS automation, monitoring scrape targets) - `/lib/` - Nix library functions - `dns-zone.nix` - DNS zone generation functions - `monitoring.nix` - Prometheus scrape target generation functions - `/services/` - Reusable service modules, selectively imported by hosts - `home-assistant/` - Home automation stack - `monitoring/` - Observability stack (Prometheus, Grafana, Loki, Tempo) - `ns/` - DNS services (authoritative, resolver, zone generation) - `http-proxy/`, `ca/`, `postgres/`, `nats/`, `jellyfin/`, etc. - `/secrets/` - SOPS-encrypted secrets with age encryption - `/common/` - Shared configurations (e.g., VM guest agent) - `/docs/` - Documentation and plans - `plans/` - Future plans and proposals - `plans/completed/` - Completed plans (moved here when done) - `/playbooks/` - Ansible playbooks for fleet management - `/.sops.yaml` - SOPS configuration with age keys for all servers ### Configuration Inheritance Each host follows this import pattern: ``` hosts//default.nix └─> configuration.nix (host-specific) ├─> ../../system (ALL shared system configs - applied to every host) ├─> ../../services/ (selective service imports) └─> ../../common/vm (if VM) ``` All hosts automatically get: - Nix binary cache (nix-cache.home.2rjus.net) - SSH with root login enabled - SOPS secrets management with auto-generated age keys - Internal ACME CA integration (ca.home.2rjus.net) - Daily auto-upgrades with auto-reboot - Prometheus node-exporter + Promtail (logs to monitoring01) - Monitoring scrape target auto-registration via `homelab.monitoring` options - Custom root CA trust - DNS zone auto-registration via `homelab.dns` options ### Active Hosts Production servers managed by `rebuild-all.sh`: - `ns1`, `ns2` - Primary/secondary DNS servers (10.69.13.5/6) - `ca` - Internal Certificate Authority - `ha1` - Home Assistant + Zigbee2MQTT + Mosquitto - `http-proxy` - Reverse proxy - `monitoring01` - Full observability stack (Prometheus, Grafana, Loki, Tempo, Pyroscope) - `jelly01` - Jellyfin media server - `nix-cache01` - Binary cache server - `pgdb1` - PostgreSQL database - `nats1` - NATS messaging server - `auth01` - Authentication service Template/test hosts: - `template1` - Base template for cloning new hosts ### Flake Inputs - `nixpkgs` - NixOS 25.11 stable (primary) - `nixpkgs-unstable` - Unstable channel (available via overlay as `pkgs.unstable.`) - `sops-nix` - Secrets management - Custom packages from git.t-juice.club: - `alerttonotify` - Alert routing - `labmon` - Lab monitoring ### Network Architecture - Domain: `home.2rjus.net` - Infrastructure subnet: `10.69.13.x` - DNS: ns1/ns2 provide authoritative DNS with primary-secondary setup - Internal CA for ACME certificates (no Let's Encrypt) - Centralized monitoring at monitoring01 - Static networking via systemd-networkd ### Secrets Management - Uses SOPS with age encryption - Each server has unique age key in `.sops.yaml` - Keys auto-generated at `/var/lib/sops-nix/key.txt` on first boot - Shared secrets: `/secrets/secrets.yaml` - Per-host secrets: `/secrets//` - All production servers can decrypt shared secrets; host-specific secrets require specific host keys ### Auto-Upgrade System All hosts pull updates daily from: ``` git+https://git.t-juice.club/torjus/nixos-servers.git ``` Configured in `/system/autoupgrade.nix`: - Random delay to avoid simultaneous upgrades - Auto-reboot after successful upgrade - Systemd service: `nixos-upgrade.service` ### Proxmox VM Provisioning with OpenTofu The repository includes automated workflows for building Proxmox VM templates and deploying VMs using OpenTofu (Terraform). #### Building and Deploying Templates Template VMs are built from `hosts/template2` and deployed to Proxmox using Ansible: ```bash # Build NixOS image and deploy to Proxmox as template nix develop -c ansible-playbook -i playbooks/inventory.ini playbooks/build-and-deploy-template.yml ``` This playbook: 1. Builds the Proxmox image using `nixos-rebuild build-image --image-variant proxmox` 2. Uploads the `.vma.zst` image to Proxmox at `/var/lib/vz/dump` 3. Restores it as VM ID 9000 4. Converts it to a template Template configuration (`hosts/template2`): - Minimal base system with essential packages (age, vim, wget, git) - Cloud-init configured for NoCloud datasource (no EC2 metadata timeout) - DHCP networking on ens18 - SSH key-based root login - `prepare-host.sh` script for cleaning machine-id, SSH keys, and regenerating age keys #### Deploying VMs with OpenTofu VMs are deployed from templates using OpenTofu in the `/terraform` directory: ```bash cd terraform tofu init # First time only tofu apply # Deploy VMs ``` Configuration files: - `main.tf` - Proxmox provider configuration - `variables.tf` - Provider variables (API credentials) - `vm.tf` - VM resource definitions - `terraform.tfvars` - Actual credentials (gitignored) Example VM deployment includes: - Clone from template VM - Cloud-init configuration (SSH keys, network, DNS) - Custom CPU/memory/disk sizing - VLAN tagging - QEMU guest agent OpenTofu outputs the VM's IP address after deployment for easy SSH access. #### Template Rebuilding and Terraform State When the Proxmox template is rebuilt (via `build-and-deploy-template.yml`), the template name may change. This would normally cause Terraform to want to recreate all existing VMs, but that's unnecessary since VMs are independent once cloned. **Solution**: The `terraform/vms.tf` file includes a lifecycle rule to ignore certain attributes that don't need management: ```hcl lifecycle { ignore_changes = [ clone, # Template name can change without recreating VMs startup_shutdown, # Proxmox sets defaults (-1) that we don't need to manage ] } ``` This means: - **clone**: Existing VMs are not affected by template name changes; only new VMs use the updated template - **startup_shutdown**: Proxmox sets default startup order/delay values (-1) that Terraform would otherwise try to remove - You can safely update `default_template_name` in `terraform/variables.tf` without recreating VMs - `tofu plan` won't show spurious changes for Proxmox-managed defaults **When rebuilding the template:** 1. Run `nix develop -c ansible-playbook -i playbooks/inventory.ini playbooks/build-and-deploy-template.yml` 2. Update `default_template_name` in `terraform/variables.tf` if the name changed 3. Run `tofu plan` - should show no VM recreations (only template name in state) 4. Run `tofu apply` - updates state without touching existing VMs 5. New VMs created after this point will use the new template ### Adding a New Host 1. Create `/hosts//` directory 2. Copy structure from `template1` or similar host 3. Add host entry to `flake.nix` nixosConfigurations 4. Configure networking in `configuration.nix` (static IP via `systemd.network.networks`, DNS servers) 5. (Optional) Add `homelab.dns.cnames` if the host needs CNAME aliases 6. User clones template host 7. User runs `prepare-host.sh` on new host, this deletes files which should be regenerated, like ssh host keys, machine-id etc. It also creates a new age key, and prints the public key 8. This key is then added to `.sops.yaml` 9. Create `/secrets//` if needed 10. Commit changes, and merge to master. 11. Deploy by running `nixos-rebuild boot --flake URL#` on the host. 12. Run auto-upgrade on DNS servers (ns1, ns2) to pick up the new host's DNS entry **Note:** DNS A records and Prometheus node-exporter scrape targets are auto-generated from the host's `systemd.network.networks` static IP configuration. No manual zone file or Prometheus config editing is required. ### Important Patterns **Overlay usage**: Access unstable packages via `pkgs.unstable.` (defined in flake.nix overlay-unstable) **Service composition**: Services in `/services/` are designed to be imported by multiple hosts. Keep them modular and reusable. **Hardware configuration reuse**: Multiple hosts share `/hosts/template/hardware-configuration.nix` for VM instances. **State version**: All hosts use stateVersion `"23.11"` - do not change this on existing hosts. **Firewall**: Disabled on most hosts (trusted network). Enable selectively in host configuration if needed. ### Monitoring Stack All hosts ship metrics and logs to `monitoring01`: - **Metrics**: Prometheus scrapes node-exporter from all hosts - **Logs**: Promtail ships logs to Loki on monitoring01 - **Access**: Grafana at monitoring01 for visualization - **Tracing**: Tempo for distributed tracing - **Profiling**: Pyroscope for continuous profiling **Scrape Target Auto-Generation:** Prometheus scrape targets are automatically generated from host configurations, following the same pattern as DNS zone generation: - **Node-exporter**: All flake hosts with static IPs are automatically added as node-exporter targets - **Service targets**: Defined via `homelab.monitoring.scrapeTargets` in service modules - **External targets**: Non-flake hosts defined in `/services/monitoring/external-targets.nix` - **Library**: `lib/monitoring.nix` provides `generateNodeExporterTargets` and `generateScrapeConfigs` Host monitoring options (`homelab.monitoring.*`): - `enable` (default: `true`) - Include host in Prometheus node-exporter scrape targets - `scrapeTargets` (default: `[]`) - Additional scrape targets exposed by this host (job_name, port, metrics_path, scheme, scrape_interval, honor_labels) Service modules declare their scrape targets directly (e.g., `services/ca/default.nix` declares step-ca on port 9000). The Prometheus config on monitoring01 auto-generates scrape configs from all hosts. To add monitoring targets for non-NixOS hosts, edit `/services/monitoring/external-targets.nix`. ### DNS Architecture - `ns1` (10.69.13.5) - Primary authoritative DNS + resolver - `ns2` (10.69.13.6) - Secondary authoritative DNS (AXFR from ns1) - All hosts point to ns1/ns2 for DNS resolution **Zone Auto-Generation:** DNS zone entries are automatically generated from host configurations: - **Flake-managed hosts**: A records extracted from `systemd.network.networks` static IPs - **CNAMEs**: Defined via `homelab.dns.cnames` option in host configs - **External hosts**: Non-flake hosts defined in `/services/ns/external-hosts.nix` - **Serial number**: Uses `self.sourceInfo.lastModified` (git commit timestamp) Host DNS options (`homelab.dns.*`): - `enable` (default: `true`) - Include host in DNS zone generation - `cnames` (default: `[]`) - List of CNAME aliases pointing to this host Hosts are automatically excluded from DNS if: - `homelab.dns.enable = false` (e.g., template hosts) - No static IP configured (e.g., DHCP-only hosts) - Network interface is a VPN/tunnel (wg*, tun*, tap*) To add DNS entries for non-NixOS hosts, edit `/services/ns/external-hosts.nix`.