Commit Graph

51 Commits

Author SHA1 Message Date
bab59665fd system: fix kanidm PAM user mismatch
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m1s
Configure uid_attr_map and gid_attr_map to use short names instead of
SPN format. This fixes SSH failing with "PAM user mismatch" because
getent returned "torjus@home.2rjus.net" instead of "torjus".

Also add user-management documentation.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 13:55:11 +01:00
1d7eec7ad3 system: add kanidm PAM/NSS client module
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Add homelab.kanidm.enable option for central authentication via Kanidm.
The module configures:
- PAM/NSS integration with kanidm-unixd
- Client connection to auth.home.2rjus.net
- Login authorization for ssh-users group

Enable on testvm01-03 for testing.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 13:43:41 +01:00
1674b6a844 system: enable zram swap for all hosts
Some checks failed
Run nix flake check / flake-check (push) Failing after 12m6s
Provides compressed swap in RAM to prevent OOM kills during
nixos-rebuild on low-memory VMs (2GB). Removes duplicate zram
configs from jelly01 and nix-cache01.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 13:02:58 +01:00
aedccbd9a0 flake: remove sops-nix (no longer used)
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
All secrets are now managed by OpenBao (Vault). Remove the legacy
sops-nix infrastructure that is no longer in use.

Removed:
- sops-nix flake input
- system/sops.nix module
- .sops.yaml configuration file
- Age key generation from template prepare-host scripts

Updated:
- flake.nix - removed sops-nix references from all hosts
- flake.lock - removed sops-nix input
- scripts/create-host/ - removed sops references
- CLAUDE.md - removed SOPS documentation

Note: secrets/ directory should be manually removed by the user.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 18:46:24 +01:00
21db7e9573 acme: migrate from step-ca to OpenBao PKI
Switch all ACME certificate issuance from step-ca (ca.home.2rjus.net)
to OpenBao PKI (vault.home.2rjus.net:8200/v1/pki_int/acme/directory).

- Update default ACME server in system/acme.nix
- Update Caddy acme_ca in http-proxy and nix-cache services
- Remove labmon service from monitoring01 (step-ca monitoring)
- Remove labmon scrape target and certificate_rules alerts
- Remove alloy.nix (only used for labmon profiling)
- Add docs/plans/cert-monitoring.md for future cert monitoring needs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 18:20:10 +01:00
26ca6817f0 homelab-deploy: enable prometheus metrics
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m57s
- Update homelab-deploy input to get metrics support
- Enable metrics endpoint on port 9972
- Add scrape target for prometheus auto-discovery

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 08:04:23 +01:00
c214f8543c homelab: add deploy.enable option with assertion
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m6s
Run nix flake check / flake-check (pull_request) Successful in 2m7s
- Add homelab.deploy.enable option (requires vault.enable)
- Create shared homelab-deploy Vault policy for all hosts
- Enable homelab.deploy on all vault-enabled hosts

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 06:54:42 +01:00
7933127d77 system: enable homelab-deploy listener for all vault hosts
Add system/homelab-deploy.nix module that automatically enables the
listener on all hosts with vault.enable=true. Uses homelab.host.tier
and homelab.host.role for NATS subject subscriptions.

- Add homelab-deploy access to all host AppRole policies
- Remove manual listener config from vaulttest01 (now handled by system module)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 06:54:42 +01:00
12bf0683f5 modules: add homelab.host for host metadata
Add a shared `homelab.host` module that provides host metadata for
multiple consumers:
- tier: deployment tier (test/prod) for future homelab-deploy service
- priority: alerting priority (high/low) for Prometheus label filtering
- role: primary role of the host (dns, database, monitoring, etc.)
- labels: free-form labels for additional metadata

Host configurations updated with appropriate values:
- ns1, ns2: role=dns with dns_role labels
- nix-cache01: priority=low, role=build-host
- vault01: role=vault
- jump: role=bastion
- template, template2, testvm01, vaulttest01: tier=test, priority=low

The module is now imported via commonModules in flake.nix, making it
available to all hosts including minimal configurations like template2.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 02:49:58 +01:00
97ff774d3f monitoring: add nixos-exporter to all hosts
All checks were successful
Run nix flake check / flake-check (push) Successful in 3m16s
Run nix flake check / flake-check (pull_request) Successful in 3m14s
Add nixos-exporter prometheus exporter to track NixOS generation metrics
and flake revision status across all hosts.

Changes:
- Add nixos-exporter flake input
- Add commonModules list in flake.nix for modules shared by all hosts
- Enable nixos-exporter in system/monitoring/metrics.nix
- Configure Prometheus to scrape nixos-exporter on all hosts

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 23:55:29 +01:00
1f5b7b13e2 monitoring: enable restart-count and ip-accounting collectors
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m11s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 21:30:47 +01:00
c53e36c3f3 Revert "monitoring: enable additional systemd-exporter collectors"
This reverts commit 04a252b857.
2026-02-06 21:30:05 +01:00
04a252b857 monitoring: enable additional systemd-exporter collectors
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Enables restart-count, file-descriptor-size, and ip-accounting collectors.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 21:28:44 +01:00
5d26f52e0d Revert "monitoring: enable cpu, memory, io collectors for systemd-exporter"
This reverts commit 506a692548.
2026-02-06 21:26:20 +01:00
506a692548 monitoring: enable cpu, memory, io collectors for systemd-exporter
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 21:23:19 +01:00
b6c41aa910 system: add UTC suffix to MOTD commit timestamp
Some checks failed
Run nix flake check / flake-check (push) Failing after 7m32s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 00:34:24 +01:00
258e350b89 system: add MOTD banner with hostname and commit info
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m8s
Run nix flake check / flake-check (push) Failing after 3m53s
Displays FQDN and flake commit hash with timestamp on login.
Templates can override with their own MOTD via mkDefault.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 00:26:01 +01:00
bbb22e588e system: replace writeShellScript with writeShellApplication
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m3s
Run nix flake check / flake-check (push) Failing after 5m57s
Convert remaining writeShellScript usages to writeShellApplication for
shellcheck validation and strict bash options.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 00:17:24 +01:00
39a4ea98ab system: add nixos-rebuild-test helper script
Adds a helper script deployed to all hosts for testing feature branches.
Usage: nixos-rebuild-test <action> <branch>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 00:12:16 +01:00
3cccfc0487 monitoring: implement monitoring gaps coverage
Some checks failed
Run nix flake check / flake-check (push) Failing after 7m36s
Add exporters and scrape targets for services lacking monitoring:
- PostgreSQL: postgres-exporter on pgdb1
- Authelia: native telemetry metrics on auth01
- Unbound: unbound-exporter with remote-control on ns1/ns2
- NATS: HTTP monitoring endpoint on nats1
- OpenBao: telemetry config and Prometheus scrape with token auth
- Systemd: systemd-exporter on all hosts for per-service metrics

Add alert rules for postgres, auth (authelia + lldap), jellyfin,
vault (openbao), plus extend existing nats and unbound rules.

Add Terraform config for Prometheus metrics policy and token. The
token is created via vault_token resource and stored in KV, so no
manual token creation is needed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 21:44:13 +01:00
0700033c0a secrets: migrate all hosts from sops to OpenBao vault
Replace sops-nix secrets with OpenBao vault secrets across all hosts.
Hardcode root password hash, add extractKey option to vault-secrets
module, update Terraform with secrets/policies for all hosts, and
create AppRole provisioning playbook.

Hosts migrated: ha1, monitoring01, ns1, ns2, http-proxy, nix-cache01
Wave 1 hosts (nats1, jelly01, pgdb1) get AppRole policies only.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 18:43:09 +01:00
64f2688349 nix: configure gc to delete generations older than 14d
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m27s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 02:21:19 +01:00
cee1b264cd dns: auto-generate zone entries from host configurations
Replace static zone file with dynamically generated records:
- Add homelab.dns module with enable/cnames options
- Extract IPs from systemd.network configs (filters VPN interfaces)
- Use git commit timestamp as zone serial number
- Move external hosts to separate external-hosts.nix

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 21:43:44 +01:00
7ae474fd3e pki: add new vault root ca to pki 2026-02-03 06:53:59 +01:00
01d4812280 vault: implement bootstrap integration
Some checks failed
Run nix flake check / flake-check (push) Successful in 2m31s
Run nix flake check / flake-check (pull_request) Failing after 14m16s
2026-02-03 01:10:36 +01:00
cba1821f3b Add lldap to auth01 host 2025-04-01 22:23:59 +02:00
dd86298253 Change substituter override
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m4s
Periodic flake update / flake-update (push) Successful in 1m20s
2025-02-26 18:44:45 +01:00
844449b899 Disable using itself as substituter for nix-cache
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
2025-02-26 18:34:44 +01:00
298f2372ca Add some default packages
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
2025-02-24 18:54:59 +01:00
4d2fbff6d0 Fix error in journald config
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m0s
2025-02-07 13:22:50 +01:00
f29edfe34a Configure journald storage
Some checks failed
Run nix flake check / flake-check (push) Failing after 34s
2025-02-07 13:21:43 +01:00
002f934c70 Add ansible and playbook to trigger upgrade
Some checks failed
Run nix flake check / flake-check (push) Failing after 27m26s
Periodic flake update / flake-update (push) Successful in 1m24s
2025-02-07 00:28:05 +01:00
fbcb81291b Enable gc and optimise
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m52s
2025-02-06 23:39:54 +01:00
44d4dc6cdf Remove weekly-rebuild
Some checks failed
Run nix flake check / flake-check (push) Failing after 11m1s
2025-02-06 20:00:22 +01:00
5866a2be8f Add autoupgrade
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
2025-02-06 19:58:01 +01:00
60b2a24271 Add kitty terminfo
Some checks failed
Run nix flake check / flake-check (push) Failing after 15m6s
2025-02-06 11:38:07 +01:00
e366a05204 Fix caddy logging
Some checks failed
Run nix flake check / flake-check (push) Failing after 9m1s
Periodic flake update / flake-update (push) Successful in 1m36s
2025-01-28 00:49:22 +01:00
006d0b9213 Finish nix-cache
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m3s
2025-01-24 15:48:03 +01:00
8545807dd8 Add job label to promtail journald logs
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m51s
2025-01-23 19:50:25 +01:00
02ef7e861b Add qemu guest agent to all VMs 2024-12-05 18:35:06 +01:00
a4592ffda3 Improve monitoring stuff
Some checks failed
Run nix flake check / flake-check (push) Failing after 23m19s
2024-12-01 20:51:14 +01:00
32425807fc Add promtail for journal
Some checks failed
Run nix flake check / flake-check (push) Failing after 7m47s
2024-12-01 03:00:07 +01:00
5844e7b32b Add internal CA
All checks were successful
Run nix flake check / flake-check (push) Successful in 3m31s
2024-11-30 20:24:43 +01:00
1da20471a8 Add jq to system packages
All checks were successful
Run nix flake check / flake-check (push) Successful in 3m45s
2024-11-30 12:53:20 +01:00
c089cbedee Remove rebuild switch, messes with running unit
All checks were successful
Run nix flake check / flake-check (push) Successful in 1m45s
Periodic flake update / flake-update (push) Successful in 1m57s
2024-10-12 21:59:28 +02:00
b7d9a12786 Collect garbage after rebuild
All checks were successful
Run nix flake check / flake-check (push) Successful in 1m36s
2024-10-12 21:53:34 +02:00
c4e1026d5e Add weekly-rebuild timer
All checks were successful
Run nix flake check / flake-check (push) Successful in 1m37s
2024-10-12 21:38:37 +02:00
07f519bf36 Add monitoring services 2024-06-03 04:08:16 +02:00
2576748c38 Add prometheus monitoring 2024-06-03 03:44:34 +02:00
7ba862f21d Add template host 2024-03-08 20:10:50 +01:00