- Switch vmalert from blackhole mode to sending alerts to local
Alertmanager
- Import alerttonotify service so alerts route to NATS notifications
- Move alertmanager and grafana CNAMEs from http-proxy to monitoring02
- Add monitoring CNAME to monitoring02
- Add Caddy reverse proxy entries for alertmanager and grafana
- Remove prometheus, alertmanager, and grafana Caddy entries from
http-proxy (now served directly by monitoring02)
- Move monitoring02 Vault AppRole to hosts-generated.tf with
extra_policies support and prometheus-metrics policy
- Update Promtail to use authenticated loki.home.2rjus.net endpoint
only (remove unauthenticated monitoring01 client)
- Update pipe-to-loki and bootstrap to use loki.home.2rjus.net with
basic auth from Vault secret
- Move migration plan to completed
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add standalone Loki service module (services/loki/) with same config as
monitoring01 and import it on monitoring02. Update Grafana Loki datasource
to localhost. Defer Tempo and Pyroscope migration (not actively used).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add metrics.home.2rjus.net and vmalert.home.2rjus.net CNAMEs with
Caddy TLS termination via internal ACME CA.
Refactors Grafana's Caddy config from configFile to globalConfig +
virtualHosts so both modules can contribute routes to the same
Caddy instance.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Set up the core metrics stack on monitoring02 as Phase 2 of the
monitoring migration. VictoriaMetrics replaces Prometheus with
identical scrape configs (22 jobs including auto-generated targets).
- VictoriaMetrics with 3-month retention and all scrape configs
- vmalert evaluating existing rules.yml (notifier disabled)
- Alertmanager with same routing config (no alerts during parallel op)
- Grafana datasources updated: local VictoriaMetrics as default
- Static user override for credential file access (OpenBao, Apiary)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update all LogQL examples, agent instructions, and scripts to use
the hostname label instead of host, matching the Prometheus label
naming convention. Also update pipe-to-loki and bootstrap scripts
to push hostname instead of host.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Configure Garage object storage on garage01 with S3 API, Vault secrets
for RPC secret and admin token, and Caddy reverse proxy for HTTPS access
at s3.home.2rjus.net via internal ACME CA. Includes flake entry, VM
definition, and Vault policy for the host.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a systemd timer that triggers builds for all hosts every 2 hours
via NATS, keeping the binary cache warm.
- Add scheduler.nix with timer (every 2h) and oneshot service
- Add scheduler NATS user to DEPLOY account
- Add Vault secret and variable for scheduler NKey
- Increase nix-cache02 memory from 16GB to 20GB
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Prevents lock conflicts when multiple backup jobs targeting the same
repository run concurrently. Jobs will now retry acquiring the lock
every 10 seconds for up to 5 minutes before failing.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Removed:
- hosts/nix-cache01/ directory
- services/nix-cache/build-flakes.{nix,sh} (replaced by NATS builder)
- Vault secret and AppRole for nix-cache01
- Old signing key variable from terraform
- Old trusted public key from system/nix.nix
Updated:
- flake.nix: removed nixosConfiguration
- README.md: nix-cache01 -> nix-cache02
- Monitoring rules: removed build-flakes alerts, updated harmonia to nix-cache02
- Simplified proxy.nix (no longer needs hostname conditional)
nix-cache02 is now the sole binary cache host.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Move nix-cache CNAME from nix-cache01 to nix-cache02
- Remove actions1 CNAME (service removed)
- Update proxy.nix to serve canonical domain on nix-cache02
- Promote nix-cache02 to prod tier with build-host role
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Parameterize harmonia.nix to use hostname-based Vault paths
- Add nix-cache services to nix-cache02
- Add Vault secret and variable for nix-cache02 signing key
- Add nix-cache02 public key to trusted-public-keys on all hosts
- Update plan doc to remove actions runner references
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The actions runner on nix-cache01 was never actively used.
Removing it before migrating to nix-cache02.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Configure builder to build nixos-servers and nixos (gunter) repos
- Add builder NKey to Vault secrets
- Update NATS permissions for builder, test-deployer, and admin-deployer
- Grant nix-cache02 access to shared homelab-deploy secrets
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
New build host to replace nix-cache01 with:
- 8 CPU cores, 16GB RAM, 200GB disk
- Static IP 10.69.13.25
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Move playbooks/ to ansible/playbooks/
- Add dynamic inventory script that extracts hosts from flake
- Groups by tier (tier_test, tier_prod) and role (role_dns, etc.)
- Reads homelab.host.* options for metadata
- Add static inventory for non-flake hosts (Proxmox)
- Add ansible.cfg with inventory path and SSH optimizations
- Add group_vars/all.yml for common variables
- Add restart-service.yml playbook for restarting systemd services
- Update provision-approle.yml with single-host safeguard
- Add ANSIBLE_CONFIG to devshell for automatic inventory discovery
- Add ansible = "false" label to template2 to exclude from inventory
- Update CLAUDE.md to reference ansible/README.md for details
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Assign roles to hosts for better organization and filtering:
- ha1: home-automation
- monitoring01, monitoring02: monitoring
- jelly01: media
- nats1: messaging
- http-proxy: proxy
- testvm01-03: test
Also promote kanidm01 and monitoring02 from test to prod tier.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Deploy Grafana test instance on monitoring02 with:
- Kanidm OIDC authentication (admins -> Admin role, others -> Viewer)
- PKCE enabled for secure OAuth2 flow (required by Kanidm)
- Declarative datasources for Prometheus and Loki on monitoring01
- Local Caddy for TLS termination via internal ACME CA
- DNS CNAME grafana-test.home.2rjus.net
Terraform changes add OAuth2 client secret and AppRole policies for
kanidm01 and monitoring02.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
New test-tier host for monitoring stack expansion with:
- Static IP 10.69.13.24
- 4 CPU cores, 4GB RAM, 20GB disk
- Vault integration and NATS-based deployment enabled
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add homelab.kanidm.enable option for central authentication via Kanidm.
The module configures:
- PAM/NSS integration with kanidm-unixd
- Client connection to auth.home.2rjus.net
- Login authorization for ssh-users group
Enable on testvm01-03 for testing.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Provides compressed swap in RAM to prevent OOM kills during
nixos-rebuild on low-memory VMs (2GB). Removes duplicate zram
configs from jelly01 and nix-cache01.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Enable Linux audit to log execve syscalls from interactive SSH sessions.
Uses auid filter to exclude system services and nix builds.
Logs forwarded to journald for Loki ingestion. Query with:
{host="testvmXX"} |= "EXECVE"
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- New test-tier VM at 10.69.13.23 with role=auth
- Kanidm 1.8 server with HTTPS (443) and LDAPS (636)
- ACME certificate from internal CA (auth.home.2rjus.net)
- Provisioned groups: admins, users, ssh-users
- Provisioned user: torjus
- Daily backups at 22:00 (7 versions)
- Prometheus monitoring scrape target
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
New VMs bootstrapped from template2 will now use the local nix cache
during initial nixos-rebuild, speeding up bootstrap times.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Old VM had incorrect hardware-configuration.nix with hardcoded UUIDs
that didn't match actual disk layout, causing boot failure (emergency mode).
Recreated using template2-based configuration for OpenTofu provisioning.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove pgdb1 host configuration and postgres service module.
The only consumer (Open WebUI on gunter) has migrated to local PostgreSQL.
Removed:
- hosts/pgdb1/ - host configuration
- services/postgres/ - service module (only used by pgdb1)
- postgres_rules from monitoring rules
- rebuild-all.sh (obsolete script)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove hosts/template/ (legacy template1) and give each legacy host
its own hardware-configuration.nix copy
- Recreate ns2 using create-host with template2 base
- Add secondary DNS services (NSD + Unbound resolver)
- Configure Vault policy for shared DNS secrets
- Fix create-host IP uniqueness validator to check CIDR notation
(prevents false positives from DNS resolver entries)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
All secrets are now managed by OpenBao (Vault). Remove the legacy
sops-nix infrastructure that is no longer in use.
Removed:
- sops-nix flake input
- system/sops.nix module
- .sops.yaml configuration file
- Age key generation from template prepare-host scripts
Updated:
- flake.nix - removed sops-nix references from all hosts
- flake.lock - removed sops-nix input
- scripts/create-host/ - removed sops references
- CLAUDE.md - removed SOPS documentation
Note: secrets/ directory should be manually removed by the user.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove the step-ca host and labmon flake input now that ACME has been
migrated to OpenBao PKI.
Removed:
- hosts/ca/ - step-ca host configuration
- services/ca/ - step-ca service module
- labmon flake input and module (no longer used)
Updated:
- flake.nix - removed ca host and labmon references
- flake.lock - removed labmon input
- rebuild-all.sh - removed ca from host list
- CLAUDE.md - updated documentation
Note: secrets/ca/ should be manually removed by the user.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Set up a simple nginx server with an ACME certificate from the new
OpenBao PKI infrastructure. This allows testing the ACME migration
before deploying to production hosts.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Switch all ACME certificate issuance from step-ca (ca.home.2rjus.net)
to OpenBao PKI (vault.home.2rjus.net:8200/v1/pki_int/acme/directory).
- Update default ACME server in system/acme.nix
- Update Caddy acme_ca in http-proxy and nix-cache services
- Remove labmon service from monitoring01 (step-ca monitoring)
- Remove labmon scrape target and certificate_rules alerts
- Remove alloy.nix (only used for labmon profiling)
- Add docs/plans/cert-monitoring.md for future cert monitoring needs
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Enable vault.enable and homelab.deploy.enable on vault01 so it can
receive NATS-based remote deployments. Vault fetches secrets from
itself using AppRole after auto-unseal.
Add systemd ordering to ensure vault-secret services wait for openbao
to be unsealed before attempting to fetch secrets.
Also adds vault01 AppRole entry to Terraform.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Enable vault.enable and homelab.deploy.enable for these hosts to
allow NATS-based remote deployments and expose metrics on port 9972.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds log_to_loki function that pushes structured log entries to Loki
at key bootstrap stages (starting, network_ok, vault_*, building,
success, failed). Enables querying bootstrap state via LogQL without
console access.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
TTY output was causing nixos-rebuild to fail. Keep the custom
greeting line to indicate bootstrap image, but use journal+console
for reliable logging.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Display bootstrap banner and live progress on tty1 instead of login prompt
- Add custom getty greeting on other ttys indicating this is a bootstrap image
- Disable getty on tty1 during bootstrap so output is visible
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add vault.enable = true to testvm01, testvm02, testvm03
- Add homelab.deploy.enable = true for remote deployment via NATS
- Update create-host template to include these by default
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Three permanent test hosts for validating deployment and bootstrapping
workflow. Each host configured with:
- Static IP (10.69.13.20-22/24)
- Vault AppRole integration
- Bootstrap from deploy-test-hosts branch
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add homelab.deploy.enable option (requires vault.enable)
- Create shared homelab-deploy Vault policy for all hosts
- Enable homelab.deploy on all vault-enabled hosts
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add system/homelab-deploy.nix module that automatically enables the
listener on all hosts with vault.enable=true. Uses homelab.host.tier
and homelab.host.role for NATS subject subscriptions.
- Add homelab-deploy access to all host AppRole policies
- Remove manual listener config from vaulttest01 (now handled by system module)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Ensure homelab-deploy-listener waits for the NKey secret to be
fetched from Vault before starting.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add homelab-deploy flake input and NixOS module for message-based
deployments across the fleet. Configure DEPLOY account in NATS with
tiered access control (listener, test-deployer, admin-deployer).
Enable listener on vaulttest01 as initial test host.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>