Document Loki log query labels and patterns, and Prometheus job names with example queries for the lab-monitoring MCP server. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
18 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Repository Overview
This is a Nix Flake-based NixOS configuration repository for managing a homelab infrastructure consisting of 16 server configurations. The repository uses a modular architecture with shared system configurations, reusable service modules, and per-host customization.
Common Commands
Building Configurations
# List all available configurations
nix flake show
# Build a specific host configuration locally (without deploying)
nixos-rebuild build --flake .#<hostname>
# Build and check a configuration
nix build .#nixosConfigurations.<hostname>.config.system.build.toplevel
Important: Do NOT pipe nix build commands to other commands like tail or head. Piping can hide errors and make builds appear successful when they actually failed. Always run nix build without piping to see the full output.
# BAD - hides errors
nix build .#create-host 2>&1 | tail -20
# GOOD - shows all output and errors
nix build .#create-host
Deployment
Do not automatically deploy changes. Deployments are usually done by updating the master branch, and then triggering the auto update on the specific host.
Flake Management
# Check flake for errors
nix flake check
Do not run nix flake update. Should only be done manually by user.
Development Environment
# Enter development shell (provides ansible, python3)
nix develop
Secrets Management
Secrets are managed by OpenBao (Vault) using AppRole authentication. Most hosts use the
vault.secrets option defined in system/vault-secrets.nix to fetch secrets at boot.
Terraform manages the secrets and AppRole policies in terraform/vault/.
Legacy sops-nix is still present but only actively used by the ca host. Do not edit any
.sops.yaml or any file within secrets/. Ask the user to modify if necessary.
Git Workflow
Important: Never commit directly to master unless the user explicitly asks for it. Always create a feature branch for changes.
When starting a new plan or task, the first step should typically be to create and checkout a new branch with an appropriate name (e.g., git checkout -b dns-automation or git checkout -b fix-nginx-config).
Plan Management
When creating plans for large features, follow this workflow:
- When implementation begins, save a copy of the plan to
docs/plans/(e.g.,docs/plans/feature-name.md) - Once the feature is fully implemented, move the plan to
docs/plans/completed/
Git Commit Messages
Commit messages should follow the format: topic: short description
Examples:
flake: add opentofu to devshelltemplate2: add proxmox image configurationterraform: add VM deployment configuration
Clipboard
To copy text to the clipboard, pipe to wl-copy (Wayland):
echo "text" | wl-copy
NixOS Options and Packages Lookup
Two MCP servers are available for searching NixOS options and packages:
- nixpkgs-options - Search and lookup NixOS configuration option documentation
- nixpkgs-packages - Search and lookup Nix packages from nixpkgs
Session Setup: At the start of each session, index the nixpkgs revision from flake.lock to ensure documentation matches the project's nixpkgs version:
- Read
flake.lockand find thenixpkgsnode'srevfield - Call
index_revisionwith that git hash (both servers share the same index)
Options Tools (nixpkgs-options):
search_options- Search for options by name or description (e.g., query "nginx" or "postgresql")get_option- Get full details for a specific option (e.g.,services.loki.configuration)get_file- Fetch the source file from nixpkgs that declares an option
Package Tools (nixpkgs-packages):
search_packages- Search for packages by name or description (e.g., query "nginx" or "python")get_package- Get full details for a specific package by attribute path (e.g.,firefox,python312Packages.requests)get_file- Fetch the source file from nixpkgs that defines a package
This ensures documentation matches the exact nixpkgs version (currently NixOS 25.11) used by this flake.
Lab Monitoring Log Queries
The lab-monitoring MCP server can query logs from Loki. All hosts ship systemd journal logs via Promtail.
Loki Label Reference:
host- Hostname (e.g.,ns1,ns2,monitoring01,ha1). Use this label, nothostname.systemd_unit- Systemd unit name (e.g.,nsd.service,prometheus.service,nixos-upgrade.service)job- Eithersystemd-journal(most logs) orvarlog(file-based logs like caddy access logs)filename- Forvarlogjob, the log file path (e.g.,/var/log/caddy/nix-cache.log)
Journal log entries are JSON-formatted with the actual log message in the MESSAGE field. Other useful fields include PRIORITY and SYSLOG_IDENTIFIER.
Example LogQL queries:
# Logs from a specific service on a host
{host="ns2", systemd_unit="nsd.service"}
# Substring match on log content
{host="ns1", systemd_unit="nsd.service"} |= "error"
# File-based logs (e.g., caddy access logs)
{job="varlog", hostname="nix-cache01"}
Default lookback is 1 hour. Use the start parameter with relative durations (e.g., 24h, 168h) for older logs.
Lab Monitoring Prometheus Queries
The lab-monitoring MCP server can query Prometheus metrics via PromQL. The instance label uses the FQDN format <host>.home.2rjus.net:<port>.
Prometheus Job Names:
node-exporter- System metrics from all hosts (CPU, memory, disk, network)caddy- Reverse proxy metrics (http-proxy)nix-cache_caddy- Nix binary cache metricshome-assistant- Home automation metricsjellyfin- Media server metricsloki/prometheus/grafana- Monitoring stack self-metricsstep-ca- Internal CA metricspve-exporter- Proxmox hypervisor metricssmartctl- Disk SMART health (gunter)wireguard- VPN metrics (http-proxy)pushgateway- Push-based metrics (e.g., backup results)restic_rest- Backup server metricslabmon/ghettoptt/alertmanager- Other service metrics
Example PromQL queries:
# Check all targets are up
up
# CPU usage for a specific host
rate(node_cpu_seconds_total{instance=~"ns1.*", mode!="idle"}[5m])
# Memory usage across all hosts
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
# Disk space
node_filesystem_avail_bytes{mountpoint="/"}
Architecture
Directory Structure
/flake.nix- Central flake defining all NixOS configurations/hosts/<hostname>/- Per-host configurationsdefault.nix- Entry point, imports configuration.nix and servicesconfiguration.nix- Host-specific settings (networking, hardware, users)
/system/- Shared system-level configurations applied to ALL hosts- Core modules: nix.nix, sshd.nix, sops.nix (legacy), vault-secrets.nix, acme.nix, autoupgrade.nix
- Monitoring: node-exporter and promtail on every host
/modules/- Custom NixOS moduleshomelab/- Homelab-specific options (DNS automation, monitoring scrape targets)
/lib/- Nix library functionsdns-zone.nix- DNS zone generation functionsmonitoring.nix- Prometheus scrape target generation functions
/services/- Reusable service modules, selectively imported by hostshome-assistant/- Home automation stackmonitoring/- Observability stack (Prometheus, Grafana, Loki, Tempo)ns/- DNS services (authoritative, resolver, zone generation)http-proxy/,ca/,postgres/,nats/,jellyfin/, etc.
/secrets/- SOPS-encrypted secrets with age encryption (legacy, only used by ca)/common/- Shared configurations (e.g., VM guest agent)/docs/- Documentation and plansplans/- Future plans and proposalsplans/completed/- Completed plans (moved here when done)
/playbooks/- Ansible playbooks for fleet management/.sops.yaml- SOPS configuration with age keys (legacy, only used by ca)
Configuration Inheritance
Each host follows this import pattern:
hosts/<hostname>/default.nix
└─> configuration.nix (host-specific)
├─> ../../system (ALL shared system configs - applied to every host)
├─> ../../services/<service> (selective service imports)
└─> ../../common/vm (if VM)
All hosts automatically get:
- Nix binary cache (nix-cache.home.2rjus.net)
- SSH with root login enabled
- OpenBao (Vault) secrets management via AppRole
- Internal ACME CA integration (ca.home.2rjus.net)
- Daily auto-upgrades with auto-reboot
- Prometheus node-exporter + Promtail (logs to monitoring01)
- Monitoring scrape target auto-registration via
homelab.monitoringoptions - Custom root CA trust
- DNS zone auto-registration via
homelab.dnsoptions
Active Hosts
Production servers managed by rebuild-all.sh:
ns1,ns2- Primary/secondary DNS servers (10.69.13.5/6)ca- Internal Certificate Authorityha1- Home Assistant + Zigbee2MQTT + Mosquittohttp-proxy- Reverse proxymonitoring01- Full observability stack (Prometheus, Grafana, Loki, Tempo, Pyroscope)jelly01- Jellyfin media servernix-cache01- Binary cache serverpgdb1- PostgreSQL databasenats1- NATS messaging server
Template/test hosts:
template1- Base template for cloning new hosts
Flake Inputs
nixpkgs- NixOS 25.11 stable (primary)nixpkgs-unstable- Unstable channel (available via overlay aspkgs.unstable.<package>)sops-nix- Secrets management (legacy, only used by ca)- Custom packages from git.t-juice.club:
alerttonotify- Alert routinglabmon- Lab monitoring
Network Architecture
- Domain:
home.2rjus.net - Infrastructure subnet:
10.69.13.x - DNS: ns1/ns2 provide authoritative DNS with primary-secondary setup
- Internal CA for ACME certificates (no Let's Encrypt)
- Centralized monitoring at monitoring01
- Static networking via systemd-networkd
Secrets Management
Most hosts use OpenBao (Vault) for secrets:
- Vault server at
vault01.home.2rjus.net:8200 - AppRole authentication with credentials at
/var/lib/vault/approle/ - Secrets defined in Terraform (
terraform/vault/secrets.tf) - AppRole policies in Terraform (
terraform/vault/approle.tf) - NixOS module:
system/vault-secrets.nixwithvault.secrets.<name>options extractKeyoption extracts a single key from vault JSON as a plain file- Secrets fetched at boot by
vault-secret-<name>.servicesystemd units - Fallback to cached secrets in
/var/lib/vault/cache/when Vault is unreachable - Provision AppRole credentials:
nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<host>
Legacy SOPS (only used by ca host):
- SOPS with age encryption, keys in
.sops.yaml - Shared secrets:
/secrets/secrets.yaml - Per-host secrets:
/secrets/<hostname>/
Auto-Upgrade System
All hosts pull updates daily from:
git+https://git.t-juice.club/torjus/nixos-servers.git
Configured in /system/autoupgrade.nix:
- Random delay to avoid simultaneous upgrades
- Auto-reboot after successful upgrade
- Systemd service:
nixos-upgrade.service
Proxmox VM Provisioning with OpenTofu
The repository includes automated workflows for building Proxmox VM templates and deploying VMs using OpenTofu (Terraform).
Building and Deploying Templates
Template VMs are built from hosts/template2 and deployed to Proxmox using Ansible:
# Build NixOS image and deploy to Proxmox as template
nix develop -c ansible-playbook -i playbooks/inventory.ini playbooks/build-and-deploy-template.yml
This playbook:
- Builds the Proxmox image using
nixos-rebuild build-image --image-variant proxmox - Uploads the
.vma.zstimage to Proxmox at/var/lib/vz/dump - Restores it as VM ID 9000
- Converts it to a template
Template configuration (hosts/template2):
- Minimal base system with essential packages (age, vim, wget, git)
- Cloud-init configured for NoCloud datasource (no EC2 metadata timeout)
- DHCP networking on ens18
- SSH key-based root login
prepare-host.shscript for cleaning machine-id, SSH keys, and regenerating age keys
Deploying VMs with OpenTofu
VMs are deployed from templates using OpenTofu in the /terraform directory:
cd terraform
tofu init # First time only
tofu apply # Deploy VMs
Configuration files:
main.tf- Proxmox provider configurationvariables.tf- Provider variables (API credentials)vm.tf- VM resource definitionsterraform.tfvars- Actual credentials (gitignored)
Example VM deployment includes:
- Clone from template VM
- Cloud-init configuration (SSH keys, network, DNS)
- Custom CPU/memory/disk sizing
- VLAN tagging
- QEMU guest agent
OpenTofu outputs the VM's IP address after deployment for easy SSH access.
Template Rebuilding and Terraform State
When the Proxmox template is rebuilt (via build-and-deploy-template.yml), the template name may change. This would normally cause Terraform to want to recreate all existing VMs, but that's unnecessary since VMs are independent once cloned.
Solution: The terraform/vms.tf file includes a lifecycle rule to ignore certain attributes that don't need management:
lifecycle {
ignore_changes = [
clone, # Template name can change without recreating VMs
startup_shutdown, # Proxmox sets defaults (-1) that we don't need to manage
]
}
This means:
- clone: Existing VMs are not affected by template name changes; only new VMs use the updated template
- startup_shutdown: Proxmox sets default startup order/delay values (-1) that Terraform would otherwise try to remove
- You can safely update
default_template_nameinterraform/variables.tfwithout recreating VMs tofu planwon't show spurious changes for Proxmox-managed defaults
When rebuilding the template:
- Run
nix develop -c ansible-playbook -i playbooks/inventory.ini playbooks/build-and-deploy-template.yml - Update
default_template_nameinterraform/variables.tfif the name changed - Run
tofu plan- should show no VM recreations (only template name in state) - Run
tofu apply- updates state without touching existing VMs - New VMs created after this point will use the new template
Adding a New Host
- Create
/hosts/<hostname>/directory - Copy structure from
template1or similar host - Add host entry to
flake.nixnixosConfigurations - Configure networking in
configuration.nix(static IP viasystemd.network.networks, DNS servers) - (Optional) Add
homelab.dns.cnamesif the host needs CNAME aliases - Add
vault.enable = true;to the host configuration - Add AppRole policy in
terraform/vault/approle.tfand any secrets insecrets.tf - Run
tofu applyinterraform/vault/ - User clones template host
- User runs
prepare-host.shon new host - Provision AppRole credentials:
nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<host> - Commit changes, and merge to master.
- Deploy by running
nixos-rebuild boot --flake URL#<hostname>on the host. - Run auto-upgrade on DNS servers (ns1, ns2) to pick up the new host's DNS entry
Note: DNS A records and Prometheus node-exporter scrape targets are auto-generated from the host's systemd.network.networks static IP configuration. No manual zone file or Prometheus config editing is required.
Important Patterns
Overlay usage: Access unstable packages via pkgs.unstable.<package> (defined in flake.nix overlay-unstable)
Service composition: Services in /services/ are designed to be imported by multiple hosts. Keep them modular and reusable.
Hardware configuration reuse: Multiple hosts share /hosts/template/hardware-configuration.nix for VM instances.
State version: All hosts use stateVersion "23.11" - do not change this on existing hosts.
Firewall: Disabled on most hosts (trusted network). Enable selectively in host configuration if needed.
Monitoring Stack
All hosts ship metrics and logs to monitoring01:
- Metrics: Prometheus scrapes node-exporter from all hosts
- Logs: Promtail ships logs to Loki on monitoring01
- Access: Grafana at monitoring01 for visualization
- Tracing: Tempo for distributed tracing
- Profiling: Pyroscope for continuous profiling
Scrape Target Auto-Generation:
Prometheus scrape targets are automatically generated from host configurations, following the same pattern as DNS zone generation:
- Node-exporter: All flake hosts with static IPs are automatically added as node-exporter targets
- Service targets: Defined via
homelab.monitoring.scrapeTargetsin service modules - External targets: Non-flake hosts defined in
/services/monitoring/external-targets.nix - Library:
lib/monitoring.nixprovidesgenerateNodeExporterTargetsandgenerateScrapeConfigs
Host monitoring options (homelab.monitoring.*):
enable(default:true) - Include host in Prometheus node-exporter scrape targetsscrapeTargets(default:[]) - Additional scrape targets exposed by this host (job_name, port, metrics_path, scheme, scrape_interval, honor_labels)
Service modules declare their scrape targets directly (e.g., services/ca/default.nix declares step-ca on port 9000). The Prometheus config on monitoring01 auto-generates scrape configs from all hosts.
To add monitoring targets for non-NixOS hosts, edit /services/monitoring/external-targets.nix.
DNS Architecture
ns1(10.69.13.5) - Primary authoritative DNS + resolverns2(10.69.13.6) - Secondary authoritative DNS (AXFR from ns1)- All hosts point to ns1/ns2 for DNS resolution
Zone Auto-Generation:
DNS zone entries are automatically generated from host configurations:
- Flake-managed hosts: A records extracted from
systemd.network.networksstatic IPs - CNAMEs: Defined via
homelab.dns.cnamesoption in host configs - External hosts: Non-flake hosts defined in
/services/ns/external-hosts.nix - Serial number: Uses
self.sourceInfo.lastModified(git commit timestamp)
Host DNS options (homelab.dns.*):
enable(default:true) - Include host in DNS zone generationcnames(default:[]) - List of CNAME aliases pointing to this host
Hosts are automatically excluded from DNS if:
homelab.dns.enable = false(e.g., template hosts)- No static IP configured (e.g., DHCP-only hosts)
- Network interface is a VPN/tunnel (wg*, tun*, tap*)
To add DNS entries for non-NixOS hosts, edit /services/ns/external-hosts.nix.