Commit Graph

16 Commits

Author SHA1 Message Date
fa4a418007 restic: add --retry-lock=5m to all backup jobs
Some checks failed
Run nix flake check / flake-check (push) Failing after 23m42s
Prevents lock conflicts when multiple backup jobs targeting the same
repository run concurrently. Jobs will now retry acquiring the lock
every 10 seconds for up to 5 minutes before failing.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-11 01:22:00 +01:00
287141c623 hosts: add role metadata to all hosts
Some checks failed
Run nix flake check / flake-check (push) Failing after 13m51s
Assign roles to hosts for better organization and filtering:
- ha1: home-automation
- monitoring01, monitoring02: monitoring
- jelly01: media
- nats1: messaging
- http-proxy: proxy
- testvm01-03: test

Also promote kanidm01 and monitoring02 from test to prod tier.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 16:21:08 +01:00
536daee4c7 ns2: migrate to OpenTofu management
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
- Remove hosts/template/ (legacy template1) and give each legacy host
  its own hardware-configuration.nix copy
- Recreate ns2 using create-host with template2 base
- Add secondary DNS services (NSD + Unbound resolver)
- Configure Vault policy for shared DNS secrets
- Fix create-host IP uniqueness validator to check CIDR notation
  (prevents false positives from DNS resolver entries)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 19:28:35 +01:00
21db7e9573 acme: migrate from step-ca to OpenBao PKI
Switch all ACME certificate issuance from step-ca (ca.home.2rjus.net)
to OpenBao PKI (vault.home.2rjus.net:8200/v1/pki_int/acme/directory).

- Update default ACME server in system/acme.nix
- Update Caddy acme_ca in http-proxy and nix-cache services
- Remove labmon service from monitoring01 (step-ca monitoring)
- Remove labmon scrape target and certificate_rules alerts
- Remove alloy.nix (only used for labmon profiling)
- Add docs/plans/cert-monitoring.md for future cert monitoring needs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 18:20:10 +01:00
c214f8543c homelab: add deploy.enable option with assertion
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m6s
Run nix flake check / flake-check (pull_request) Successful in 2m7s
- Add homelab.deploy.enable option (requires vault.enable)
- Create shared homelab-deploy Vault policy for all hosts
- Enable homelab.deploy on all vault-enabled hosts

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 06:54:42 +01:00
eee3dde04f restic: add randomized delay to backup timers
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Backups to the shared restic repository were all scheduled at exactly
midnight, causing lock conflicts. Adding RandomizedDelaySec spreads
them out over a 2-hour window to prevent simultaneous access.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 01:09:38 +01:00
0700033c0a secrets: migrate all hosts from sops to OpenBao vault
Replace sops-nix secrets with OpenBao vault secrets across all hosts.
Hardcode root password hash, add extractKey option to vault-secrets
module, update Terraform with secrets/policies for all hosts, and
create AppRole provisioning playbook.

Hosts migrated: ha1, monitoring01, ns1, ns2, http-proxy, nix-cache01
Wave 1 hosts (nats1, jelly01, pgdb1) get AppRole policies only.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 18:43:09 +01:00
d25fc99e1d backup: migrate to native services.restic.backups
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Run nix flake check / flake-check (pull_request) Successful in 4m0s
Replace custom backup-helper flake input with NixOS native
services.restic.backups module for ha1, monitoring01, and nixos-test1.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 00:41:40 +01:00
ebcdefd0ca Add alloy
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
2025-05-24 12:40:39 +02:00
c32e288273 Add pyroscope to labmon cert monitoring
Some checks failed
Run nix flake check / flake-check (push) Failing after 10m30s
2025-05-24 12:05:14 +02:00
2a46da3761 Add labmon to scrape config
Some checks failed
Run nix flake check / flake-check (push) Failing after 14m32s
2025-05-24 03:37:52 +02:00
6fda081dc8 Add labmon to monitoring01
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
2025-05-24 03:27:59 +02:00
4af1bded61 Add backups for monitoring01
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m30s
2025-01-27 23:03:45 +01:00
02ef7e861b Add qemu guest agent to all VMs 2024-12-05 18:35:06 +01:00
8700e78752 Remove deprecated routeConfig
Some checks failed
Run nix flake check / flake-check (push) Failing after 11m42s
2024-12-01 02:00:57 +01:00
3c3eaaa042 Add monitoring host 2024-12-01 01:51:34 +01:00