Files

docs: add lab-monitoring query reference to CLAUDE.md

Document Loki log query labels and patterns, and Prometheus job names
with example queries for the lab-monitoring MCP server.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-05 21:18:17 +01:00

18 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Repository Overview

This is a Nix Flake-based NixOS configuration repository for managing a homelab infrastructure consisting of 16 server configurations. The repository uses a modular architecture with shared system configurations, reusable service modules, and per-host customization.

Common Commands

Building Configurations

# List all available configurations
nix flake show

# Build a specific host configuration locally (without deploying)
nixos-rebuild build --flake .#<hostname>

# Build and check a configuration
nix build .#nixosConfigurations.<hostname>.config.system.build.toplevel

Important: Do NOT pipe nix build commands to other commands like tail or head. Piping can hide errors and make builds appear successful when they actually failed. Always run nix build without piping to see the full output.

# BAD - hides errors
nix build .#create-host 2>&1 | tail -20

# GOOD - shows all output and errors
nix build .#create-host

Deployment

Do not automatically deploy changes. Deployments are usually done by updating the master branch, and then triggering the auto update on the specific host.

Flake Management

# Check flake for errors
nix flake check

Do not run nix flake update. Should only be done manually by user.

Development Environment

# Enter development shell (provides ansible, python3)
nix develop

Secrets Management

Secrets are managed by OpenBao (Vault) using AppRole authentication. Most hosts use the vault.secrets option defined in system/vault-secrets.nix to fetch secrets at boot. Terraform manages the secrets and AppRole policies in terraform/vault/.

Legacy sops-nix is still present but only actively used by the ca host. Do not edit any .sops.yaml or any file within secrets/. Ask the user to modify if necessary.

Git Workflow

Important: Never commit directly to master unless the user explicitly asks for it. Always create a feature branch for changes.

When starting a new plan or task, the first step should typically be to create and checkout a new branch with an appropriate name (e.g., git checkout -b dns-automation or git checkout -b fix-nginx-config).

Plan Management

When creating plans for large features, follow this workflow:

When implementation begins, save a copy of the plan to docs/plans/ (e.g., docs/plans/feature-name.md)
Once the feature is fully implemented, move the plan to docs/plans/completed/

Git Commit Messages

Commit messages should follow the format: topic: short description

Examples:

flake: add opentofu to devshell
template2: add proxmox image configuration
terraform: add VM deployment configuration

Clipboard

To copy text to the clipboard, pipe to wl-copy (Wayland):

echo "text" | wl-copy

NixOS Options and Packages Lookup

Two MCP servers are available for searching NixOS options and packages:

nixpkgs-options - Search and lookup NixOS configuration option documentation
nixpkgs-packages - Search and lookup Nix packages from nixpkgs

Session Setup: At the start of each session, index the nixpkgs revision from flake.lock to ensure documentation matches the project's nixpkgs version:

Read flake.lock and find the nixpkgs node's rev field
Call index_revision with that git hash (both servers share the same index)

Options Tools (nixpkgs-options):

search_options - Search for options by name or description (e.g., query "nginx" or "postgresql")
get_option - Get full details for a specific option (e.g., services.loki.configuration)
get_file - Fetch the source file from nixpkgs that declares an option

Package Tools (nixpkgs-packages):

search_packages - Search for packages by name or description (e.g., query "nginx" or "python")
get_package - Get full details for a specific package by attribute path (e.g., firefox, python312Packages.requests)
get_file - Fetch the source file from nixpkgs that defines a package

This ensures documentation matches the exact nixpkgs version (currently NixOS 25.11) used by this flake.

Lab Monitoring Log Queries

The lab-monitoring MCP server can query logs from Loki. All hosts ship systemd journal logs via Promtail.

Loki Label Reference:

host - Hostname (e.g., ns1, ns2, monitoring01, ha1). Use this label, not hostname.
systemd_unit - Systemd unit name (e.g., nsd.service, prometheus.service, nixos-upgrade.service)
job - Either systemd-journal (most logs) or varlog (file-based logs like caddy access logs)
filename - For varlog job, the log file path (e.g., /var/log/caddy/nix-cache.log)

Journal log entries are JSON-formatted with the actual log message in the MESSAGE field. Other useful fields include PRIORITY and SYSLOG_IDENTIFIER.

Example LogQL queries:

# Logs from a specific service on a host
{host="ns2", systemd_unit="nsd.service"}

# Substring match on log content
{host="ns1", systemd_unit="nsd.service"} |= "error"

# File-based logs (e.g., caddy access logs)
{job="varlog", hostname="nix-cache01"}

Default lookback is 1 hour. Use the start parameter with relative durations (e.g., 24h, 168h) for older logs.

Lab Monitoring Prometheus Queries

The lab-monitoring MCP server can query Prometheus metrics via PromQL. The instance label uses the FQDN format <host>.home.2rjus.net:<port>.

Prometheus Job Names:

node-exporter - System metrics from all hosts (CPU, memory, disk, network)
caddy - Reverse proxy metrics (http-proxy)
nix-cache_caddy - Nix binary cache metrics
home-assistant - Home automation metrics
jellyfin - Media server metrics
loki / prometheus / grafana - Monitoring stack self-metrics
step-ca - Internal CA metrics
pve-exporter - Proxmox hypervisor metrics
smartctl - Disk SMART health (gunter)
wireguard - VPN metrics (http-proxy)
pushgateway - Push-based metrics (e.g., backup results)
restic_rest - Backup server metrics
labmon / ghettoptt / alertmanager - Other service metrics

Example PromQL queries:

# Check all targets are up
up

# CPU usage for a specific host
rate(node_cpu_seconds_total{instance=~"ns1.*", mode!="idle"}[5m])

# Memory usage across all hosts
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes

# Disk space
node_filesystem_avail_bytes{mountpoint="/"}

Architecture

Directory Structure

/flake.nix - Central flake defining all NixOS configurations
/hosts/<hostname>/ - Per-host configurations
- default.nix - Entry point, imports configuration.nix and services
- configuration.nix - Host-specific settings (networking, hardware, users)
/system/ - Shared system-level configurations applied to ALL hosts
- Core modules: nix.nix, sshd.nix, sops.nix (legacy), vault-secrets.nix, acme.nix, autoupgrade.nix
- Monitoring: node-exporter and promtail on every host
/modules/ - Custom NixOS modules
- homelab/ - Homelab-specific options (DNS automation, monitoring scrape targets)
/lib/ - Nix library functions
- dns-zone.nix - DNS zone generation functions
- monitoring.nix - Prometheus scrape target generation functions
/services/ - Reusable service modules, selectively imported by hosts
- home-assistant/ - Home automation stack
- monitoring/ - Observability stack (Prometheus, Grafana, Loki, Tempo)
- ns/ - DNS services (authoritative, resolver, zone generation)
- http-proxy/, ca/, postgres/, nats/, jellyfin/, etc.
/secrets/ - SOPS-encrypted secrets with age encryption (legacy, only used by ca)
/common/ - Shared configurations (e.g., VM guest agent)
/docs/ - Documentation and plans
- plans/ - Future plans and proposals
- plans/completed/ - Completed plans (moved here when done)
/playbooks/ - Ansible playbooks for fleet management
/.sops.yaml - SOPS configuration with age keys (legacy, only used by ca)

Configuration Inheritance

Each host follows this import pattern:

hosts/<hostname>/default.nix
  └─> configuration.nix (host-specific)
      ├─> ../../system (ALL shared system configs - applied to every host)
      ├─> ../../services/<service> (selective service imports)
      └─> ../../common/vm (if VM)

All hosts automatically get:

Nix binary cache (nix-cache.home.2rjus.net)
SSH with root login enabled
OpenBao (Vault) secrets management via AppRole
Internal ACME CA integration (ca.home.2rjus.net)
Daily auto-upgrades with auto-reboot
Prometheus node-exporter + Promtail (logs to monitoring01)
Monitoring scrape target auto-registration via homelab.monitoring options
Custom root CA trust
DNS zone auto-registration via homelab.dns options

Active Hosts

Production servers managed by rebuild-all.sh:

ns1, ns2 - Primary/secondary DNS servers (10.69.13.5/6)
ca - Internal Certificate Authority
ha1 - Home Assistant + Zigbee2MQTT + Mosquitto
http-proxy - Reverse proxy
monitoring01 - Full observability stack (Prometheus, Grafana, Loki, Tempo, Pyroscope)
jelly01 - Jellyfin media server
nix-cache01 - Binary cache server
pgdb1 - PostgreSQL database
nats1 - NATS messaging server

Template/test hosts:

template1 - Base template for cloning new hosts

Flake Inputs

nixpkgs - NixOS 25.11 stable (primary)
nixpkgs-unstable - Unstable channel (available via overlay as pkgs.unstable.<package>)
sops-nix - Secrets management (legacy, only used by ca)
Custom packages from git.t-juice.club:
- alerttonotify - Alert routing
- labmon - Lab monitoring

Network Architecture

Domain: home.2rjus.net
Infrastructure subnet: 10.69.13.x
DNS: ns1/ns2 provide authoritative DNS with primary-secondary setup
Internal CA for ACME certificates (no Let's Encrypt)
Centralized monitoring at monitoring01
Static networking via systemd-networkd

Secrets Management

Most hosts use OpenBao (Vault) for secrets:

Vault server at vault01.home.2rjus.net:8200
AppRole authentication with credentials at /var/lib/vault/approle/
Secrets defined in Terraform (terraform/vault/secrets.tf)
AppRole policies in Terraform (terraform/vault/approle.tf)
NixOS module: system/vault-secrets.nix with vault.secrets.<name> options
extractKey option extracts a single key from vault JSON as a plain file
Secrets fetched at boot by vault-secret-<name>.service systemd units
Fallback to cached secrets in /var/lib/vault/cache/ when Vault is unreachable
Provision AppRole credentials: nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<host>

Legacy SOPS (only used by ca host):

SOPS with age encryption, keys in .sops.yaml
Shared secrets: /secrets/secrets.yaml
Per-host secrets: /secrets/<hostname>/

Auto-Upgrade System

All hosts pull updates daily from:

git+https://git.t-juice.club/torjus/nixos-servers.git

Configured in /system/autoupgrade.nix:

Random delay to avoid simultaneous upgrades
Auto-reboot after successful upgrade
Systemd service: nixos-upgrade.service

Proxmox VM Provisioning with OpenTofu

The repository includes automated workflows for building Proxmox VM templates and deploying VMs using OpenTofu (Terraform).

Building and Deploying Templates

Template VMs are built from hosts/template2 and deployed to Proxmox using Ansible:

# Build NixOS image and deploy to Proxmox as template
nix develop -c ansible-playbook -i playbooks/inventory.ini playbooks/build-and-deploy-template.yml

This playbook:

Builds the Proxmox image using nixos-rebuild build-image --image-variant proxmox
Uploads the .vma.zst image to Proxmox at /var/lib/vz/dump
Restores it as VM ID 9000
Converts it to a template

Template configuration (hosts/template2):

Minimal base system with essential packages (age, vim, wget, git)
Cloud-init configured for NoCloud datasource (no EC2 metadata timeout)
DHCP networking on ens18
SSH key-based root login
prepare-host.sh script for cleaning machine-id, SSH keys, and regenerating age keys

Deploying VMs with OpenTofu

VMs are deployed from templates using OpenTofu in the /terraform directory:

cd terraform
tofu init     # First time only
tofu apply    # Deploy VMs

Configuration files:

main.tf - Proxmox provider configuration
variables.tf - Provider variables (API credentials)
vm.tf - VM resource definitions
terraform.tfvars - Actual credentials (gitignored)

Example VM deployment includes:

Clone from template VM
Cloud-init configuration (SSH keys, network, DNS)
Custom CPU/memory/disk sizing
VLAN tagging
QEMU guest agent

OpenTofu outputs the VM's IP address after deployment for easy SSH access.

Template Rebuilding and Terraform State

When the Proxmox template is rebuilt (via build-and-deploy-template.yml), the template name may change. This would normally cause Terraform to want to recreate all existing VMs, but that's unnecessary since VMs are independent once cloned.

Solution: The terraform/vms.tf file includes a lifecycle rule to ignore certain attributes that don't need management:

lifecycle {
  ignore_changes = [
    clone,            # Template name can change without recreating VMs
    startup_shutdown, # Proxmox sets defaults (-1) that we don't need to manage
  ]
}

This means:

clone: Existing VMs are not affected by template name changes; only new VMs use the updated template
startup_shutdown: Proxmox sets default startup order/delay values (-1) that Terraform would otherwise try to remove
You can safely update default_template_name in terraform/variables.tf without recreating VMs
tofu plan won't show spurious changes for Proxmox-managed defaults

When rebuilding the template:

Run nix develop -c ansible-playbook -i playbooks/inventory.ini playbooks/build-and-deploy-template.yml
Update default_template_name in terraform/variables.tf if the name changed
Run tofu plan - should show no VM recreations (only template name in state)
Run tofu apply - updates state without touching existing VMs
New VMs created after this point will use the new template

Adding a New Host

Create /hosts/<hostname>/ directory
Copy structure from template1 or similar host
Add host entry to flake.nix nixosConfigurations
Configure networking in configuration.nix (static IP via systemd.network.networks, DNS servers)
(Optional) Add homelab.dns.cnames if the host needs CNAME aliases
Add vault.enable = true; to the host configuration
Add AppRole policy in terraform/vault/approle.tf and any secrets in secrets.tf
Run tofu apply in terraform/vault/
User clones template host
User runs prepare-host.sh on new host
Provision AppRole credentials: nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<host>
Commit changes, and merge to master.
Deploy by running nixos-rebuild boot --flake URL#<hostname> on the host.
Run auto-upgrade on DNS servers (ns1, ns2) to pick up the new host's DNS entry

Note: DNS A records and Prometheus node-exporter scrape targets are auto-generated from the host's systemd.network.networks static IP configuration. No manual zone file or Prometheus config editing is required.

Important Patterns

Overlay usage: Access unstable packages via pkgs.unstable.<package> (defined in flake.nix overlay-unstable)

Service composition: Services in /services/ are designed to be imported by multiple hosts. Keep them modular and reusable.

Hardware configuration reuse: Multiple hosts share /hosts/template/hardware-configuration.nix for VM instances.

State version: All hosts use stateVersion "23.11" - do not change this on existing hosts.

Firewall: Disabled on most hosts (trusted network). Enable selectively in host configuration if needed.

Monitoring Stack

All hosts ship metrics and logs to monitoring01:

Metrics: Prometheus scrapes node-exporter from all hosts
Logs: Promtail ships logs to Loki on monitoring01
Access: Grafana at monitoring01 for visualization
Tracing: Tempo for distributed tracing
Profiling: Pyroscope for continuous profiling

Scrape Target Auto-Generation:

Prometheus scrape targets are automatically generated from host configurations, following the same pattern as DNS zone generation:

Node-exporter: All flake hosts with static IPs are automatically added as node-exporter targets
Service targets: Defined via homelab.monitoring.scrapeTargets in service modules
External targets: Non-flake hosts defined in /services/monitoring/external-targets.nix
Library: lib/monitoring.nix provides generateNodeExporterTargets and generateScrapeConfigs

Host monitoring options (homelab.monitoring.*):

enable (default: true) - Include host in Prometheus node-exporter scrape targets
scrapeTargets (default: []) - Additional scrape targets exposed by this host (job_name, port, metrics_path, scheme, scrape_interval, honor_labels)

Service modules declare their scrape targets directly (e.g., services/ca/default.nix declares step-ca on port 9000). The Prometheus config on monitoring01 auto-generates scrape configs from all hosts.

To add monitoring targets for non-NixOS hosts, edit /services/monitoring/external-targets.nix.

DNS Architecture

ns1 (10.69.13.5) - Primary authoritative DNS + resolver
ns2 (10.69.13.6) - Secondary authoritative DNS (AXFR from ns1)
All hosts point to ns1/ns2 for DNS resolution

Zone Auto-Generation:

DNS zone entries are automatically generated from host configurations:

Flake-managed hosts: A records extracted from systemd.network.networks static IPs
CNAMEs: Defined via homelab.dns.cnames option in host configs
External hosts: Non-flake hosts defined in /services/ns/external-hosts.nix
Serial number: Uses self.sourceInfo.lastModified (git commit timestamp)

Host DNS options (homelab.dns.*):

enable (default: true) - Include host in DNS zone generation
cnames (default: []) - List of CNAME aliases pointing to this host

Hosts are automatically excluded from DNS if:

homelab.dns.enable = false (e.g., template hosts)
No static IP configured (e.g., DHCP-only hosts)
Network interface is a VPN/tunnel (wg*, tun*, tap*)

To add DNS entries for non-NixOS hosts, edit /services/ns/external-hosts.nix.

18 KiB Raw Blame History

CLAUDE.md

Repository Overview

Common Commands

Building Configurations

Deployment

Flake Management

Development Environment

Secrets Management

Git Workflow

Plan Management

Git Commit Messages

Clipboard

NixOS Options and Packages Lookup

Lab Monitoring Log Queries

Lab Monitoring Prometheus Queries

Architecture

Directory Structure

Configuration Inheritance

Active Hosts

Flake Inputs

Network Architecture

Secrets Management

Auto-Upgrade System

Proxmox VM Provisioning with OpenTofu

Building and Deploying Templates

Deploying VMs with OpenTofu

Template Rebuilding and Terraform State

Adding a New Host

Important Patterns

Monitoring Stack

DNS Architecture

18 KiB

Raw Blame History