Add a systemd timer that triggers builds for all hosts every 2 hours
via NATS, keeping the binary cache warm.
- Add scheduler.nix with timer (every 2h) and oneshot service
- Add scheduler NATS user to DEPLOY account
- Add Vault secret and variable for scheduler NKey
- Increase nix-cache02 memory from 16GB to 20GB
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Phase 4 fully complete. nix-cache01 has been:
- Removed from repo (host config, build scripts, flake entry)
- Vault resources cleaned up
- VM deleted from Proxmox
nix-cache02 is now the sole binary cache host.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Phase 4 now in progress
- Harmonia configured on nix-cache02 with new signing key
- Trusted public key deployed to all hosts
- Cache tested successfully from testvm01
- Actions runner removed from scope
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Parameterize harmonia.nix to use hostname-based Vault paths
- Add nix-cache services to nix-cache02
- Add Vault secret and variable for nix-cache02 signing key
- Add nix-cache02 public key to trusted-public-keys on all hosts
- Update plan doc to remove actions runner references
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add Phase 6 for alerting and Grafana dashboards
- Document available Prometheus metrics
- Include example alerting rules for build failures
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Mark Phase 1 (new build host) and Phase 2 (NATS build triggering) complete
- Document nix-cache02 configuration and tested build times
- Add remaining work for Harmonia, Actions runner, and DNS cutover
- Enable --enable-builds flag in MCP config
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Document findings from false positive host_reboot alert caused by
NTP clock adjustment affecting node_boot_time_seconds metric.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- auth-system-replacement.md: Mark OAuth2 client (Grafana) as completed,
document key findings (PKCE, attribute paths, user requirements)
- monitoring-migration-victoriametrics.md: Note Grafana deployment on
monitoring02 with Kanidm OIDC as test instance
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- CLI workflows for creating users and groups
- Troubleshooting guide (nscd, cache invalidation)
- Home directory behavior (UUID-based with symlinks)
- Update auth-system-replacement plan with progress
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add homelab.kanidm.enable option for central authentication via Kanidm.
The module configures:
- PAM/NSS integration with kanidm-unixd
- Client connection to auth.home.2rjus.net
- Login authorization for ssh-users group
Enable on testvm01-03 for testing.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Based on security review findings, covering SSH hardening, firewall
enablement, log transport TLS, security alerting, and secrets management.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Document the end-to-end host creation workflow including:
- Prerequisites and step-by-step process
- Tier specification (test vs prod)
- Bootstrap observability via Loki
- Verification steps
- Troubleshooting guide
- Related files reference
Update CLAUDE.md to reference the new document.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Plan for migrating from Prometheus to VictoriaMetrics on new monitoring02
host with parallel operation, declarative Grafana dashboards, and CNAME-based
cutover.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Mark completed implementation steps
- Document deployed kanidm01 configuration
- Record UID/GID range decision (65,536-69,999)
- Add verified working items (WebUI, LDAP, certs)
- Update next steps and resolved questions
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Old VM had incorrect hardware-configuration.nix with hardcoded UUIDs
that didn't match actual disk layout, causing boot failure (emergency mode).
Recreated using template2-based configuration for OpenTofu provisioning.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove pgdb1 host configuration and postgres service module.
The only consumer (Open WebUI on gunter) has migrated to local PostgreSQL.
Removed:
- hosts/pgdb1/ - host configuration
- services/postgres/ - service module (only used by pgdb1)
- postgres_rules from monitoring rules
- rebuild-all.sh (obsolete script)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Only consumer was Open WebUI on gunter, which will migrate to local
PostgreSQL. Removed pgdb1 backup/migration phases and added to
decommission list.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- ns2 migrated to OpenTofu
- testvm02, testvm03 added to managed hosts
- Remove vaulttest01 (no longer exists)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Bootstrap times can be improved by configuring the base template
to use the local nix cache during initial builds.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove secrets/ directory (sops-nix no longer in use, all hosts use Vault)
- Move TODO.md to docs/plans/completed/automated-host-deployment-pipeline.md
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Switch all ACME certificate issuance from step-ca (ca.home.2rjus.net)
to OpenBao PKI (vault.home.2rjus.net:8200/v1/pki_int/acme/directory).
- Update default ACME server in system/acme.nix
- Update Caddy acme_ca in http-proxy and nix-cache services
- Remove labmon service from monitoring01 (step-ca monitoring)
- Remove labmon scrape target and certificate_rules alerts
- Remove alloy.nix (only used for labmon profiling)
- Add docs/plans/cert-monitoring.md for future cert monitoring needs
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extract homelab.host metadata (tier, priority, role, labels) from host
configurations and propagate them to Prometheus scrape targets. This
enables semantic alert filtering using labels instead of hardcoded
instance names.
Changes:
- lib/monitoring.nix: Extract host metadata, group targets by labels
- prometheus.nix: Use structured static_configs with labels
- rules.yml: Replace instance filters with role-based filters
Example labels in Prometheus:
- ns1/ns2: role=dns, dns_role=primary/secondary
- nix-cache01: role=build-host
- testvm*: tier=test
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Move nats-deploy-service.md to completed/ folder
- Update prometheus-scrape-target-labels.md with implementation status
- Add status table showing which steps are complete/partial/not started
- Update cross-references to point to new location
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Compare VictoriaMetrics and Thanos as options for extending
metrics retention beyond 30 days while managing disk usage.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add a shared `homelab.host` module that provides host metadata for
multiple consumers:
- tier: deployment tier (test/prod) for future homelab-deploy service
- priority: alerting priority (high/low) for Prometheus label filtering
- role: primary role of the host (dns, database, monitoring, etc.)
- labels: free-form labels for additional metadata
Host configurations updated with appropriate values:
- ns1, ns2: role=dns with dns_role labels
- nix-cache01: priority=low, role=build-host
- vault01: role=vault
- jump: role=bastion
- template, template2, testvm01, vaulttest01: tier=test, priority=low
The module is now imported via commonModules in flake.nix, making it
available to all hosts including minimal configurations like template2.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
MCP exposes two tools:
- deploy: test-tier only, always available
- deploy_admin: all tiers, requires --enable-admin flag
Three security layers: CLI flag, NATS authz, Claude Code permissions.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Support deploying to all hosts in a tier or all hosts with a role:
- deploy.<tier>.all - broadcast to all hosts in tier
- deploy.<tier>.role.<role> - broadcast to hosts with matching role
MCP can deploy to all test hosts at once, admin can deploy to any group.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add plan for NATS-based deployment service (homelab-deploy) that enables
on-demand NixOS configuration updates via messaging. Features tiered
permissions (test/prod) enforced at NATS layer.
Update prometheus-scrape-target-labels plan to share the homelab.host
module for host metadata (tier, priority, role, labels) - single source
of truth for both deployment tiers and prometheus labels.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>