53 Commits

Author SHA1 Message Date
35924c7b01 mcp: move config to .mcp.json.example, gitignore real config
Some checks failed
Run nix flake check / flake-check (push) Failing after 15m57s
Run nix flake check / flake-check (pull_request) Failing after 16m45s
The real .mcp.json now contains Loki credentials for basic auth,
so it should not be committed. The example file has placeholders.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 20:35:14 +01:00
87d8571d62 promtail: fix vault secret ownership for loki auth
Some checks failed
Run nix flake check / flake-check (push) Failing after 12m24s
The secret file needs to be owned by promtail since Promtail runs
as a dedicated user and can't read root-owned files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 20:17:02 +01:00
43c81f6688 terraform: fix loki-push policy for generated hosts
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Revert ns1/ns2 from approle.tf (they're in hosts-generated.tf) and add
loki-push policy to generated AppRoles instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 20:13:22 +01:00
58f901ad3e terraform: add ns1 and ns2 to AppRole policies
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
They were missing from the host_policies map, so they didn't get
shared policies like loki-push.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 20:10:37 +01:00
c13921d302 loki: add basic auth for log push and dual-ship promtail
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m36s
- Loki bound to localhost, Caddy reverse proxy with basic_auth
- Vault secret (shared/loki/push-auth) for password, bcrypt hash
  generated at boot for Caddy environment
- Promtail dual-ships to monitoring01 (direct) and loki.home.2rjus.net
  (with basic auth), conditional on vault.enable
- Terraform: new shared loki-push policy added to all AppRoles

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 20:00:08 +01:00
2903873d52 monitoring02: add loki CNAME and Caddy reverse proxy
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 19:48:06 +01:00
74e7c9faa4 monitoring02: add Loki service
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m19s
Add standalone Loki service module (services/loki/) with same config as
monitoring01 and import it on monitoring02. Update Grafana Loki datasource
to localhost. Defer Tempo and Pyroscope migration (not actively used).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 19:42:19 +01:00
471f536f1f Merge pull request 'victoriametrics-monitoring02' (#40) from victoriametrics-monitoring02 into master
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m3s
Periodic flake update / flake-update (push) Successful in 3m29s
Reviewed-on: #40
2026-02-16 23:56:04 +00:00
a013e80f1a terraform: grant monitoring02 access to apiary-token secret
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m59s
Run nix flake check / flake-check (pull_request) Failing after 4m20s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 00:55:08 +01:00
4cbaa33475 monitoring02: add Caddy reverse proxy for VictoriaMetrics and vmalert
Add metrics.home.2rjus.net and vmalert.home.2rjus.net CNAMEs with
Caddy TLS termination via internal ACME CA.

Refactors Grafana's Caddy config from configFile to globalConfig +
virtualHosts so both modules can contribute routes to the same
Caddy instance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 00:55:08 +01:00
e329f87b0b monitoring02: add VictoriaMetrics, vmalert, and Alertmanager
Set up the core metrics stack on monitoring02 as Phase 2 of the
monitoring migration. VictoriaMetrics replaces Prometheus with
identical scrape configs (22 jobs including auto-generated targets).

- VictoriaMetrics with 3-month retention and all scrape configs
- vmalert evaluating existing rules.yml (notifier disabled)
- Alertmanager with same routing config (no alerts during parallel op)
- Grafana datasources updated: local VictoriaMetrics as default
- Static user override for credential file access (OpenBao, Apiary)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 00:55:08 +01:00
c151f31011 grafana: fix apiary dashboard panels empty on short time ranges
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m54s
Set interval=60s on rate() panels to match the actual Prometheus scrape
interval, so Grafana calculates $__rate_interval correctly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-16 20:03:26 +01:00
f5362d6936 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/6c5e707c6b5339359a9a9e215c5e66d6d802fd7a?narHash=sha256-iKZMkr6Cm9JzWlRYW/VPoL0A9jVKtZYiU4zSrVeetIs%3D' (2026-02-11)
  → 'github:nixos/nixpkgs/3aadb7ca9eac2891d52a9dec199d9580a6e2bf44?narHash=sha256-O1XDr7EWbRp%2BkHrNNgLWgIrB0/US5wvw9K6RERWAj6I%3D' (2026-02-14)
• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/ec7c70d12ce2fc37cb92aff673dcdca89d187bae?narHash=sha256-9xejG0KoqsoKEGp2kVbXRlEYtFFcDTHjidiuX8hGO44%3D' (2026-02-11)
  → 'github:nixos/nixpkgs/a82ccc39b39b621151d6732718e3e250109076fa?narHash=sha256-gf2AmWVTs8lEq7z/3ZAsgnZDhWIckkb%2BZnAo5RzSxJg%3D' (2026-02-13)
2026-02-16 00:07:10 +00:00
3e7aabc73a grafana: fix apiary geomap and make it full-width
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m6s
Periodic flake update / flake-update (push) Successful in 5m25s
Add gazetteer reference for country code lookup resolution.
Remove unnecessary reduce transformation. Make geomap panel
full-width (24 cols) and taller (h=10) on its own row.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 21:36:24 +01:00
361e7f2a1b grafana: add apiary honeypot dashboard
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 21:31:06 +01:00
1942591d2e monitoring: add apiary metrics scraping with bearer token auth
Some checks failed
Run nix flake check / flake-check (push) Failing after 12m52s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 16:36:26 +01:00
4d614d8716 docs: add new service candidates and NixOS router plans
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m22s
Periodic flake update / flake-update (push) Failing after 1s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 13:21:34 +01:00
fd7caf7f00 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/d6c71932130818840fc8fe9509cf50be8c64634f?narHash=sha256-ub1gpAONMFsT/GU2hV6ZWJjur8rJ6kKxdm9IlCT0j84%3D' (2026-02-08)
  → 'github:nixos/nixpkgs/ec7c70d12ce2fc37cb92aff673dcdca89d187bae?narHash=sha256-9xejG0KoqsoKEGp2kVbXRlEYtFFcDTHjidiuX8hGO44%3D' (2026-02-11)
2026-02-14 00:01:24 +00:00
af8e385b6e docs: finalize remote access plan with WireGuard gateway design
Some checks failed
Run nix flake check / flake-check (push) Failing after 21m7s
Periodic flake update / flake-update (push) Successful in 2m16s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 00:31:52 +01:00
0db9fc6802 docs: update Loki improvements plan with implementation status
Some checks failed
Run nix flake check / flake-check (push) Failing after 13m55s
Mark retention, limits, labels, and level mapping as done. Add
JSON logging audit results with per-service details. Update current
state and disk usage notes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 00:04:16 +01:00
5d68662035 loki: add 30-day retention policy and ingestion limits
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Enable compactor-based retention with 30-day period to prevent
unbounded disk growth. Add basic rate limits and stream guards
to protect against runaway log generators.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 23:55:27 +01:00
d485948df0 docs: update Loki queries from host to hostname label
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Update all LogQL examples, agent instructions, and scripts to use
the hostname label instead of host, matching the Prometheus label
naming convention. Also update pipe-to-loki and bootstrap scripts
to push hostname instead of host.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 23:43:47 +01:00
7b804450a3 promtail: add hostname/tier/role labels and journal priority level mapping
Align Promtail labels with Prometheus by adding hostname, tier, and role
static labels to both journal and varlog scrape configs. Add pipeline
stages to map journal PRIORITY field to a level label for reliable
severity filtering across the fleet.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 23:40:14 +01:00
2f0dad1acc docs: add JSON logging audit to Loki improvements plan
Some checks failed
Run nix flake check / flake-check (push) Failing after 15m38s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 22:44:05 +01:00
1544415ef3 docs: add Loki improvements plan
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Covers retention policy, limits config, Promtail label improvements
(tier/role/level), and journal PRIORITY extraction. Also adds Alloy
consideration to VictoriaMetrics migration plan.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 22:39:16 +01:00
5babd7f507 docs: move garage S3 storage plan to completed
Some checks failed
Run nix flake check / flake-check (push) Failing after 15m36s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 21:54:23 +01:00
7e0c5fbf0f garage01: fix Caddy metrics deprecation warning
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Use handle directive instead of path in site address for the metrics
endpoint, as the latter is deprecated in Caddy 2.10.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 21:53:48 +01:00
ffaf95d109 terraform: add Vault secret for garage01 environment
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m13s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 21:27:43 +01:00
b2b6ab4799 garage01: add Garage S3 service with Caddy HTTPS proxy
Configure Garage object storage on garage01 with S3 API, Vault secrets
for RPC secret and admin token, and Caddy reverse proxy for HTTPS access
at s3.home.2rjus.net via internal ACME CA. Includes flake entry, VM
definition, and Vault policy for the host.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 21:24:25 +01:00
5d3d93b280 docs: move completed plans to completed folder
Some checks failed
Run nix flake check / flake-check (push) Failing after 13m22s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 21:08:17 +01:00
ae823e439d monitoring: lower unbound cache hit ratio alert threshold to 20%
Some checks failed
Run nix flake check / flake-check (push) Failing after 9m2s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 18:55:03 +01:00
0d9f49a3b4 flake.lock: Update homelab-deploy
Some checks failed
Run nix flake check / flake-check (push) Failing after 12m25s
Improves builder logging: build failure output is now logged as
individual lines instead of a single JSON blob, making errors
readable in Loki/Grafana.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 18:36:18 +01:00
08d9e1ec3f docs: add garage S3 storage plan
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m26s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 18:06:53 +01:00
fa8d65b612 nix-cache02: increase builder timeout to 2 hours
Some checks failed
Run nix flake check / flake-check (push) Failing after 14m21s
Periodic flake update / flake-update (push) Successful in 5m17s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-12 14:44:55 +01:00
6726f111e3 flake.lock: Update homelab-deploy
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-12 14:42:23 +01:00
3a083285cb flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/2db38e08fdadcc0ce3232f7279bab59a15b94482?narHash=sha256-1jZvgZoAagZZB6NwGRv2T2ezPy%2BX6EFDsJm%2BYSlsvEs%3D' (2026-02-09)
  → 'github:nixos/nixpkgs/6c5e707c6b5339359a9a9e215c5e66d6d802fd7a?narHash=sha256-iKZMkr6Cm9JzWlRYW/VPoL0A9jVKtZYiU4zSrVeetIs%3D' (2026-02-11)
2026-02-12 00:01:27 +00:00
ed1821b073 nix-cache02: add scheduled builds timer
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m7s
Periodic flake update / flake-update (push) Successful in 2m18s
Add a systemd timer that triggers builds for all hosts every 2 hours
via NATS, keeping the binary cache warm.

- Add scheduler.nix with timer (every 2h) and oneshot service
- Add scheduler NATS user to DEPLOY account
- Add Vault secret and variable for scheduler NKey
- Increase nix-cache02 memory from 16GB to 20GB

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-12 00:50:09 +01:00
fa4a418007 restic: add --retry-lock=5m to all backup jobs
Some checks failed
Run nix flake check / flake-check (push) Failing after 23m42s
Prevents lock conflicts when multiple backup jobs targeting the same
repository run concurrently. Jobs will now retry acquiring the lock
every 10 seconds for up to 5 minutes before failing.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-11 01:22:00 +01:00
963e5f6d3c flake.lock: Update
Flake lock file updates:

• Updated input 'homelab-deploy':
    'git+https://git.t-juice.club/torjus/homelab-deploy?ref=master&rev=a8aab16d0e7400aaa00500d08c12734da3b638e0' (2026-02-10)
  → 'git+https://git.t-juice.club/torjus/homelab-deploy?ref=master&rev=c13914bf5acdcda33de63ad5ed9d661e4dc3118c' (2026-02-10)
• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/23d72dabcb3b12469f57b37170fcbc1789bd7457?narHash=sha256-z5NJPSBwsLf/OfD8WTmh79tlSU8XgIbwmk6qB1/TFzY%3D' (2026-02-07)
  → 'github:nixos/nixpkgs/2db38e08fdadcc0ce3232f7279bab59a15b94482?narHash=sha256-1jZvgZoAagZZB6NwGRv2T2ezPy%2BX6EFDsJm%2BYSlsvEs%3D' (2026-02-09)
2026-02-11 00:01:28 +00:00
0bc10cb1fe grafana: add build service panels to nixos-fleet dashboard
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m48s
Periodic flake update / flake-update (push) Successful in 2m20s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-11 00:49:50 +01:00
b03e2e8ee4 monitoring: add alerts for homelab-deploy build failures
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-11 00:45:07 +01:00
ddcbc30665 docs: mark nix-cache01 decommission complete
Some checks failed
Run nix flake check / flake-check (push) Failing after 16m38s
Phase 4 fully complete. nix-cache01 has been:
- Removed from repo (host config, build scripts, flake entry)
- Vault resources cleaned up
- VM deleted from Proxmox

nix-cache02 is now the sole binary cache host.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 23:43:12 +01:00
75210805d5 nix-cache01: decommission and remove all references
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Removed:
- hosts/nix-cache01/ directory
- services/nix-cache/build-flakes.{nix,sh} (replaced by NATS builder)
- Vault secret and AppRole for nix-cache01
- Old signing key variable from terraform
- Old trusted public key from system/nix.nix

Updated:
- flake.nix: removed nixosConfiguration
- README.md: nix-cache01 -> nix-cache02
- Monitoring rules: removed build-flakes alerts, updated harmonia to nix-cache02
- Simplified proxy.nix (no longer needs hostname conditional)

nix-cache02 is now the sole binary cache host.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 23:40:51 +01:00
ade0538717 docs: mark nix-cache DNS cutover complete
Some checks are pending
Run nix flake check / flake-check (push) Has started running
nix-cache.home.2rjus.net now served by nix-cache02.
nix-cache01 ready for decommission.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 23:34:04 +01:00
83fce5f927 nix-cache: switch DNS to nix-cache02
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
- Move nix-cache CNAME from nix-cache01 to nix-cache02
- Remove actions1 CNAME (service removed)
- Update proxy.nix to serve canonical domain on nix-cache02
- Promote nix-cache02 to prod tier with build-host role

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 23:22:23 +01:00
afff3f28ca docs: update nix-cache-reprovision plan with Harmonia progress
Some checks failed
Run nix flake check / flake-check (push) Failing after 52s
- Phase 4 now in progress
- Harmonia configured on nix-cache02 with new signing key
- Trusted public key deployed to all hosts
- Cache tested successfully from testvm01
- Actions runner removed from scope

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 23:17:51 +01:00
49f7e3ae2e nix-cache: use hostname-based domain for Caddy proxy
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m18s
nix-cache01 serves nix-cache.home.2rjus.net (canonical)
nix-cache02 serves nix-cache02.home.2rjus.net (for testing)

This allows testing nix-cache02 independently before DNS cutover.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 23:14:14 +01:00
751edfc11d nix-cache02: add Harmonia binary cache service
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
- Parameterize harmonia.nix to use hostname-based Vault paths
- Add nix-cache services to nix-cache02
- Add Vault secret and variable for nix-cache02 signing key
- Add nix-cache02 public key to trusted-public-keys on all hosts
- Update plan doc to remove actions runner references

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 23:08:48 +01:00
98a7301985 nix-cache: remove unused Gitea Actions runner
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m23s
The actions runner on nix-cache01 was never actively used.
Removing it before migrating to nix-cache02.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 22:57:08 +01:00
34efa58cfe Merge pull request 'nix-cache02-builder' (#39) from nix-cache02-builder into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m27s
Reviewed-on: #39
2026-02-10 21:47:58 +00:00
5bfb51a497 docs: add observability phase to nix-cache plan
Some checks failed
Run nix flake check / flake-check (push) Successful in 2m35s
Run nix flake check / flake-check (pull_request) Failing after 16m1s
- Add Phase 6 for alerting and Grafana dashboards
- Document available Prometheus metrics
- Include example alerting rules for build failures

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 22:46:38 +01:00
f83145d97a docs: update nix-cache-reprovision plan with progress
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
- Mark Phase 1 (new build host) and Phase 2 (NATS build triggering) complete
- Document nix-cache02 configuration and tested build times
- Add remaining work for Harmonia, Actions runner, and DNS cutover
- Enable --enable-builds flag in MCP config

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 22:43:48 +01:00
47747329c4 nix-cache02: add homelab-deploy builder service
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m51s
- Configure builder to build nixos-servers and nixos (gunter) repos
- Add builder NKey to Vault secrets
- Update NATS permissions for builder, test-deployer, and admin-deployer
- Grant nix-cache02 access to shared homelab-deploy secrets

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 22:26:40 +01:00
57 changed files with 2400 additions and 737 deletions

View File

@@ -19,7 +19,7 @@ You may receive:
## Audit Log Structure
Logs are shipped to Loki via promtail. Audit events use these labels:
- `host` - hostname
- `hostname` - hostname
- `systemd_unit` - typically `auditd.service` for audit logs
- `job` - typically `systemd-journal`
@@ -36,7 +36,7 @@ Audit log entries contain structured data:
Find SSH logins and session activity:
```logql
{host="<hostname>", systemd_unit="sshd.service"}
{hostname="<hostname>", systemd_unit="sshd.service"}
```
Look for:
@@ -48,7 +48,7 @@ Look for:
Query executed commands (filter out noise):
```logql
{host="<hostname>"} |= "EXECVE" != "PATH item" != "PROCTITLE" != "SYSCALL" != "BPF"
{hostname="<hostname>"} |= "EXECVE" != "PATH item" != "PROCTITLE" != "SYSCALL" != "BPF"
```
Further filtering:
@@ -60,28 +60,28 @@ Further filtering:
Check for privilege escalation:
```logql
{host="<hostname>"} |= "sudo" |= "COMMAND"
{hostname="<hostname>"} |= "sudo" |= "COMMAND"
```
Or via audit:
```logql
{host="<hostname>"} |= "USER_CMD"
{hostname="<hostname>"} |= "USER_CMD"
```
### 4. Service Manipulation
Check if services were manually stopped/started:
```logql
{host="<hostname>"} |= "EXECVE" |= "systemctl"
{hostname="<hostname>"} |= "EXECVE" |= "systemctl"
```
### 5. File Operations
Look for file modifications (if auditd rules are configured):
```logql
{host="<hostname>"} |= "EXECVE" |= "vim"
{host="<hostname>"} |= "EXECVE" |= "nano"
{host="<hostname>"} |= "EXECVE" |= "rm"
{hostname="<hostname>"} |= "EXECVE" |= "vim"
{hostname="<hostname>"} |= "EXECVE" |= "nano"
{hostname="<hostname>"} |= "EXECVE" |= "rm"
```
## Query Guidelines
@@ -99,7 +99,7 @@ Look for file modifications (if auditd rules are configured):
**Time-bounded queries:**
When investigating around a specific event:
```logql
{host="<hostname>"} |= "EXECVE" != "systemd"
{hostname="<hostname>"} |= "EXECVE" != "systemd"
```
With `start: "2026-02-08T14:30:00Z"` and `end: "2026-02-08T14:35:00Z"`

View File

@@ -41,13 +41,13 @@ Search for relevant log entries using `query_logs`. Focus on service-specific lo
**Query strategies (start narrow, expand if needed):**
- Start with `limit: 20-30`, increase only if needed
- Use tight time windows: `start: "15m"` or `start: "30m"` initially
- Filter to specific services: `{host="<hostname>", systemd_unit="<service>.service"}`
- Search for errors: `{host="<hostname>"} |= "error"` or `|= "failed"`
- Filter to specific services: `{hostname="<hostname>", systemd_unit="<service>.service"}`
- Search for errors: `{hostname="<hostname>"} |= "error"` or `|= "failed"`
**Common patterns:**
- Service logs: `{host="<hostname>", systemd_unit="<service>.service"}`
- All errors on host: `{host="<hostname>"} |= "error"`
- Journal for a unit: `{host="<hostname>", systemd_unit="nginx.service"} |= "failed"`
- Service logs: `{hostname="<hostname>", systemd_unit="<service>.service"}`
- All errors on host: `{hostname="<hostname>"} |= "error"`
- Journal for a unit: `{hostname="<hostname>", systemd_unit="nginx.service"} |= "failed"`
**Avoid:**
- Using `start: "1h"` with no filters on busy hosts

View File

@@ -30,11 +30,13 @@ Use the `lab-monitoring` MCP server tools:
### Label Reference
Available labels for log queries:
- `host` - Hostname (e.g., `ns1`, `monitoring01`, `ha1`)
- `hostname` - Hostname (e.g., `ns1`, `monitoring01`, `ha1`) - matches the Prometheus `hostname` label
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `nixos-upgrade.service`)
- `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
- `filename` - For `varlog` job, the log file path
- `hostname` - Alternative to `host` for some streams
- `tier` - Deployment tier (`test` or `prod`)
- `role` - Host role (e.g., `dns`, `vault`, `monitoring`) - matches the Prometheus `role` label
- `level` - Log level mapped from journal PRIORITY (`critical`, `error`, `warning`, `notice`, `info`, `debug`) - journal scrape only
### Log Format
@@ -47,12 +49,12 @@ Journal logs are JSON-formatted. Key fields:
**Logs from a specific service on a host:**
```logql
{host="ns1", systemd_unit="nsd.service"}
{hostname="ns1", systemd_unit="nsd.service"}
```
**All logs from a host:**
```logql
{host="monitoring01"}
{hostname="monitoring01"}
```
**Logs from a service across all hosts:**
@@ -62,12 +64,12 @@ Journal logs are JSON-formatted. Key fields:
**Substring matching (case-sensitive):**
```logql
{host="ha1"} |= "error"
{hostname="ha1"} |= "error"
```
**Exclude pattern:**
```logql
{host="ns1"} != "routine"
{hostname="ns1"} != "routine"
```
**Regex matching:**
@@ -75,6 +77,20 @@ Journal logs are JSON-formatted. Key fields:
{systemd_unit="prometheus.service"} |~ "scrape.*failed"
```
**Filter by level (journal scrape only):**
```logql
{level="error"} # All errors across the fleet
{level=~"critical|error", tier="prod"} # Prod errors and criticals
{hostname="ns1", level="warning"} # Warnings from a specific host
```
**Filter by tier/role:**
```logql
{tier="prod"} |= "error" # All errors on prod hosts
{role="dns"} # All DNS server logs
{tier="test", job="systemd-journal"} # Journal logs from test hosts
```
**File-based logs (caddy access logs, etc):**
```logql
{job="varlog", hostname="nix-cache01"}
@@ -106,7 +122,7 @@ Useful systemd units for troubleshooting:
VMs provisioned from template2 send bootstrap progress directly to Loki via curl (before promtail is available). These logs use `job="bootstrap"` with additional labels:
- `host` - Target hostname
- `hostname` - Target hostname
- `branch` - Git branch being deployed
- `stage` - Bootstrap stage (see table below)
@@ -127,7 +143,7 @@ VMs provisioned from template2 send bootstrap progress directly to Loki via curl
```logql
{job="bootstrap"} # All bootstrap logs
{job="bootstrap", host="myhost"} # Specific host
{job="bootstrap", hostname="myhost"} # Specific host
{job="bootstrap", stage="failed"} # All failures
{job="bootstrap", stage=~"building|success"} # Track build progress
```
@@ -308,8 +324,8 @@ Current host labels:
1. Check `up{job="<service>"}` or `up{hostname="<host>"}` for scrape failures
2. Use `list_targets` to see target health details
3. Query service logs: `{host="<host>", systemd_unit="<service>.service"}`
4. Search for errors: `{host="<host>"} |= "error"`
3. Query service logs: `{hostname="<host>", systemd_unit="<service>.service"}`
4. Search for errors: `{hostname="<host>"} |= "error"`
5. Check `list_alerts` for related alerts
6. Use role filters for group issues: `up{role="dns"}` to check all DNS servers
@@ -324,17 +340,17 @@ Current host labels:
When provisioning new VMs, track bootstrap progress:
1. Watch bootstrap logs: `{job="bootstrap", host="<hostname>"}`
2. Check for failures: `{job="bootstrap", host="<hostname>", stage="failed"}`
1. Watch bootstrap logs: `{job="bootstrap", hostname="<hostname>"}`
2. Check for failures: `{job="bootstrap", hostname="<hostname>", stage="failed"}`
3. After success, verify host appears in metrics: `up{hostname="<hostname>"}`
4. Check logs are flowing: `{host="<hostname>"}`
4. Check logs are flowing: `{hostname="<hostname>"}`
See [docs/host-creation.md](../../../docs/host-creation.md) for the full host creation pipeline.
### Debug SSH/Access Issues
```logql
{host="<host>", systemd_unit="sshd.service"}
{hostname="<host>", systemd_unit="sshd.service"}
```
### Check Recent Upgrades

3
.gitignore vendored
View File

@@ -2,6 +2,9 @@
result
result-*
# MCP config (contains secrets)
.mcp.json
# Terraform/OpenTofu
terraform/.terraform/
terraform/.terraform.lock.hcl

View File

@@ -20,7 +20,9 @@
"env": {
"PROMETHEUS_URL": "https://prometheus.home.2rjus.net",
"ALERTMANAGER_URL": "https://alertmanager.home.2rjus.net",
"LOKI_URL": "http://monitoring01.home.2rjus.net:3100"
"LOKI_URL": "https://loki.home.2rjus.net",
"LOKI_USERNAME": "promtail",
"LOKI_PASSWORD": "<password from: bao kv get -field=password secret/shared/loki/push-auth>"
}
},
"homelab-deploy": {
@@ -31,7 +33,8 @@
"--",
"mcp",
"--nats-url", "nats://nats1.home.2rjus.net:4222",
"--nkey-file", "/home/torjus/.config/homelab-deploy/test-deployer.nkey"
"--nkey-file", "/home/torjus/.config/homelab-deploy/test-deployer.nkey",
"--enable-builds"
]
},
"git-explorer": {
@@ -43,4 +46,3 @@
}
}
}

View File

@@ -59,7 +59,7 @@ The script prints the session ID which the user can share. Query results with:
```logql
{job="pipe-to-loki"} # All entries
{job="pipe-to-loki", id="my-test"} # Specific ID
{job="pipe-to-loki", host="testvm01"} # From specific host
{job="pipe-to-loki", hostname="testvm01"} # From specific host
{job="pipe-to-loki", type="session"} # Only sessions
```

View File

@@ -12,7 +12,7 @@ NixOS Flake-based configuration repository for a homelab infrastructure. All hos
| `http-proxy` | Reverse proxy |
| `monitoring01` | Prometheus, Grafana, Loki, Tempo, Pyroscope |
| `jelly01` | Jellyfin media server |
| `nix-cache01` | Nix binary cache |
| `nix-cache02` | Nix binary cache + NATS-based build service |
| `nats1` | NATS messaging |
| `vault01` | OpenBao (Vault) secrets management |
| `template1`, `template2` | VM templates for cloning new hosts |

View File

@@ -50,7 +50,7 @@ homelab.host.tier = "test"; # or "prod"
During the bootstrap process, status updates are sent to Loki. Query bootstrap logs with:
```
{job="bootstrap", host="<hostname>"}
{job="bootstrap", hostname="<hostname>"}
```
### Bootstrap Stages
@@ -72,7 +72,7 @@ The bootstrap process reports these stages via the `stage` label:
```
# All bootstrap activity for a host
{job="bootstrap", host="myhost"}
{job="bootstrap", hostname="myhost"}
# Track all failures
{job="bootstrap", stage="failed"}
@@ -87,7 +87,7 @@ Once the VM reboots with its full configuration, it will start publishing metric
1. Check bootstrap completed successfully:
```
{job="bootstrap", host="<hostname>", stage="success"}
{job="bootstrap", hostname="<hostname>", stage="success"}
```
2. Verify the host is up and reporting metrics:
@@ -102,7 +102,7 @@ Once the VM reboots with its full configuration, it will start publishing metric
4. Check logs are flowing:
```
{host="<hostname>"}
{hostname="<hostname>"}
```
5. Confirm expected services are running and producing logs
@@ -119,7 +119,7 @@ Once the VM reboots with its full configuration, it will start publishing metric
1. Check bootstrap logs in Loki - if they never progress past `building`, the rebuild likely consumed all resources:
```
{job="bootstrap", host="<hostname>"}
{job="bootstrap", hostname="<hostname>"}
```
2. **USER**: SSH into the host and check the bootstrap service:
@@ -149,7 +149,7 @@ Usually caused by running the `create-host` script without proper credentials, o
2. Check bootstrap logs for vault-related stages:
```
{job="bootstrap", host="<hostname>", stage=~"vault.*"}
{job="bootstrap", hostname="<hostname>", stage=~"vault.*"}
```
3. **USER**: Regenerate and provision credentials manually:

View File

@@ -0,0 +1,46 @@
# Garage S3 Storage Server
## Overview
Deploy a Garage instance for self-hosted S3-compatible object storage.
## Garage Basics
- S3-compatible distributed object storage designed for self-hosting
- Supports per-key, per-bucket permissions (read/write/owner)
- Keys without explicit grants have no access
## NixOS Module
Available as `services.garage` with these key options:
- `services.garage.enable` - Enable the service
- `services.garage.package` - Must be set explicitly
- `services.garage.settings` - Freeform TOML config (replication mode, ports, RPC, etc.)
- `services.garage.settings.metadata_dir` - Metadata storage (SSD recommended)
- `services.garage.settings.data_dir` - Data block storage (supports multiple dirs since v0.9)
- `services.garage.environmentFile` - For secrets like `GARAGE_RPC_SECRET`
- `services.garage.logLevel` - error/warn/info/debug/trace
The NixOS module only manages the server daemon. Buckets and keys are managed externally.
## Bucket/Key Management
No declarative NixOS options for buckets or keys. Two options:
1. **Terraform provider** - `jkossis/terraform-provider-garage` manages buckets, keys, and permissions via the Garage Admin API v2. Could live in `terraform/garage/` similar to `terraform/vault/`.
2. **CLI** - `garage key create`, `garage bucket create`, `garage bucket allow`
## Integration Ideas
- Store Garage API keys in Vault, fetch via `vault.secrets` on consuming hosts
- Terraform manages both Vault secrets and Garage buckets/keys
- Enable admin API with token for Terraform provider access
- Add Prometheus metrics scraping (Garage exposes metrics endpoint)
## Open Questions
- Single-node or multi-node replication?
- Which host to deploy on?
- What to store? (backups, media, app data)
- Expose via HTTP proxy or direct S3 API only?

View File

@@ -0,0 +1,156 @@
# Nix Cache Host Reprovision
## Overview
Reprovision `nix-cache01` using the OpenTofu workflow, and improve the build/cache system with:
1. NATS-based remote build triggering (replacing the current bash script)
2. Safer flake update workflow that validates builds before pushing to master
## Status
**Phase 1: New Build Host** - COMPLETE
**Phase 2: NATS Build Triggering** - COMPLETE
**Phase 3: Safe Flake Update Workflow** - NOT STARTED
**Phase 4: Complete Migration** - COMPLETE
**Phase 5: Scheduled Builds** - COMPLETE
## Completed Work
### New Build Host (nix-cache02)
Instead of reprovisioning nix-cache01 in-place, we created a new host `nix-cache02` at 10.69.13.25:
- **Specs**: 8 CPU cores, 16GB RAM (temporarily, will increase to 24GB after nix-cache01 decommissioned), 200GB disk
- **Provisioned via OpenTofu** with automatic Vault credential bootstrapping
- **Builder service** configured with two repos:
- `nixos-servers``git+https://git.t-juice.club/torjus/nixos-servers.git`
- `nixos` (gunter) → `git+https://git.t-juice.club/torjus/nixos.git`
### NATS-Based Build Triggering
The `homelab-deploy` tool was extended with a builder mode:
**NATS Subjects:**
- `build.<repo>.<target>` - e.g., `build.nixos-servers.all` or `build.nixos-servers.ns1`
**NATS Permissions (in DEPLOY account):**
| User | Publish | Subscribe |
|------|---------|-----------|
| Builder | `build.responses.>` | `build.>` |
| Test deployer | `deploy.test.>`, `deploy.discover`, `build.>` | `deploy.responses.>`, `deploy.discover`, `build.responses.>` |
| Admin deployer | `deploy.>`, `build.>` | `deploy.>`, `build.responses.>` |
**Vault Secrets:**
- `shared/homelab-deploy/builder-nkey` - NKey seed for builder authentication
**NixOS Configuration:**
- `hosts/nix-cache02/builder.nix` - Builder service configuration
- `services/nats/default.nix` - Updated with builder NATS user
**MCP Integration:**
- `.mcp.json` updated with `--enable-builds` flag
- Build tool available via MCP for Claude Code
**Tested:**
- Single host build: `build nixos-servers testvm01` (~30s)
- All hosts build: `build nixos-servers all` (16 hosts in ~226s)
### Harmonia Binary Cache
- Parameterized `services/nix-cache/harmonia.nix` to use hostname-based Vault paths
- Parameterized `services/nix-cache/proxy.nix` for hostname-based domain
- New signing key: `nix-cache02.home.2rjus.net-1`
- Vault secret: `hosts/nix-cache02/cache-secret`
- Removed unused Gitea Actions runner from nix-cache01
## Current State
### nix-cache02 (Active)
- Running at 10.69.13.25
- Serving `https://nix-cache.home.2rjus.net` (canonical URL)
- Builder service active, responding to NATS build requests
- Metrics exposed on port 9973 (`homelab-deploy-builder` job)
- Harmonia binary cache server running
- Signing key: `nix-cache02.home.2rjus.net-1`
- Prod tier with `build-host` role
### nix-cache01 (Decommissioned)
- VM deleted from Proxmox
- Host configuration removed from repo
- Vault AppRole and secrets removed
- Old signing key removed from trusted-public-keys
## Remaining Work
### Phase 3: Safe Flake Update Workflow
1. Create `.github/workflows/flake-update-safe.yaml`
2. Disable or remove old `flake-update.yaml`
3. Test manually with `workflow_dispatch`
4. Monitor first automated run
### Phase 4: Complete Migration ✅
1. ~~**Add Harmonia to nix-cache02**~~ ✅ Done - new signing key, parameterized service
2. ~~**Add trusted public key to all hosts**~~ ✅ Done - `system/nix.nix` updated
3. ~~**Test cache from other hosts**~~ ✅ Done - verified from testvm01
4. ~~**Update proxy and DNS**~~ ✅ Done - `nix-cache.home.2rjus.net` CNAME now points to nix-cache02
5. ~~**Deploy to all hosts**~~ ✅ Done - all hosts have new trusted key
6. ~~**Decommission nix-cache01**~~ ✅ Done - 2026-02-10:
- Removed `hosts/nix-cache01/` directory
- Removed `services/nix-cache/build-flakes.{nix,sh}`
- Removed Vault AppRole and secrets
- Removed old signing key from `system/nix.nix`
- Removed from `flake.nix`
- Deleted VM from Proxmox
### Phase 5: Scheduled Builds ✅
Implemented a systemd timer on nix-cache02 that triggers builds every 2 hours:
- **Timer**: `scheduled-build.timer` runs every 2 hours with 5m random jitter
- **Service**: `scheduled-build.service` calls `homelab-deploy build` for both repos
- **Authentication**: Dedicated scheduler NKey stored in Vault
- **NATS user**: Added to DEPLOY account with publish `build.>` and subscribe `build.responses.>`
Files:
- `hosts/nix-cache02/scheduler.nix` - Timer and service configuration
- `services/nats/default.nix` - Scheduler NATS user
- `terraform/vault/secrets.tf` - Scheduler NKey secret
- `terraform/vault/variables.tf` - Variable for scheduler NKey
## Resolved Questions
- **Parallel vs sequential builds?** Sequential - hosts share packages, subsequent builds are fast after first
- **What about gunter?** Configured as `nixos` repo in builder settings
- **Disk size?** 200GB for new host
- **Build host specs?** 8 cores, 16-24GB RAM matches current nix-cache01
### Phase 6: Observability
1. **Alerting rules** for build failures:
```promql
# Alert if any build fails
increase(homelab_deploy_build_host_total{status="failure"}[1h]) > 0
# Alert if no successful builds in 24h (scheduled builds stopped)
time() - homelab_deploy_build_last_success_timestamp > 86400
```
2. **Grafana dashboard** for build metrics:
- Build success/failure rate over time
- Average build duration per host (histogram)
- Build frequency (builds per hour/day)
- Last successful build timestamp per repo
Available metrics:
- `homelab_deploy_builds_total{repo, status}` - total builds by repo and status
- `homelab_deploy_build_host_total{repo, host, status}` - per-host build counts
- `homelab_deploy_build_duration_seconds_{bucket,sum,count}` - build duration histogram
- `homelab_deploy_build_last_timestamp{repo}` - last build attempt
- `homelab_deploy_build_last_success_timestamp{repo}` - last successful build
## Open Questions
- [x] ~~When to cut over DNS from nix-cache01 to nix-cache02?~~ Done - 2026-02-10
- [ ] Implement safe flake update workflow before or after full migration?

View File

@@ -0,0 +1,196 @@
# Loki Setup Improvements
## Overview
The current Loki deployment on monitoring01 is functional but minimal. It lacks retention policies, rate limiting, and uses local filesystem storage. This plan evaluates improvement options across several dimensions: retention management, storage backend, resource limits, and operational improvements.
## Current State
**Loki** on monitoring01 (`services/monitoring/loki.nix`):
- Single-node deployment, no HA
- Filesystem storage at `/var/lib/loki/chunks` (~6.8 GB as of 2026-02-13)
- TSDB index (v13 schema, 24h period)
- 30-day compactor-based retention with basic rate limits
- No caching layer
- Auth disabled (trusted network)
**Promtail** on all 16 hosts (`system/monitoring/logs.nix`):
- Ships systemd journal (JSON) + `/var/log/**/*.log`
- Labels: `hostname`, `tier`, `role`, `level`, `job` (systemd-journal/varlog), `systemd_unit`
- `level` label mapped from journal PRIORITY (critical/error/warning/notice/info/debug)
- Hardcoded to `http://monitoring01.home.2rjus.net:3100`
**Additional log sources:**
- `pipe-to-loki` script (manual log submission, `job=pipe-to-loki`)
- Bootstrap logs from template2 (`job=bootstrap`)
**Context:** The VictoriaMetrics migration plan (`docs/plans/monitoring-migration-victoriametrics.md`) includes moving Loki to monitoring02 with "same configuration as current". These improvements could be applied either before or after that migration.
## Improvement Areas
### 1. Retention Policy
**Implemented.** Compactor-based retention with 30-day period. Note: Loki 3.6.3 requires `delete_request_store = "filesystem"` when retention is enabled (not documented in older guides).
```nix
compactor = {
working_directory = "/var/lib/loki/compactor";
compaction_interval = "10m";
retention_enabled = true;
retention_delete_delay = "2h";
retention_delete_worker_count = 150;
delete_request_store = "filesystem";
};
limits_config = {
retention_period = "30d";
};
```
### 2. Storage Backend
**Decision:** Stay with filesystem storage for now. Garage S3 was considered but ruled out - the current single-node Garage (replication_factor=1) offers no real durability benefit over local disk. S3 storage can be revisited after the NAS migration, when a more robust S3-compatible solution will likely be available.
### 3. Limits Configuration
**Implemented.** Basic guardrails added alongside retention in `limits_config`:
```nix
limits_config = {
retention_period = "30d";
ingestion_rate_mb = 10; # MB/s per tenant
ingestion_burst_size_mb = 20; # Burst allowance
max_streams_per_user = 10000; # Prevent label explosion
max_query_series = 500; # Limit query resource usage
max_query_parallelism = 8;
};
```
### 4. Promtail Label Improvements
**Problem:** Label inconsistencies and missing useful metadata:
- The `varlog` scrape config uses `hostname` while journal uses `host` (different label name)
- No `tier` or `role` labels, making it hard to filter logs by deployment tier or host function
**Implemented:** Standardized on `hostname` to match Prometheus labels. The journal scrape previously used a relabel from `__journal__hostname` to `host`; now both scrape configs use a static `hostname` label from `config.networking.hostName`. Also updated `pipe-to-loki` and bootstrap scripts to use `hostname` instead of `host`.
1. **Standardized label:** Both scrape configs use `hostname` (matching Prometheus) via shared `hostLabels`
2. **Added `tier` label:** Static label from `config.homelab.host.tier` (`test`/`prod`) on both scrape configs
3. **Added `role` label:** Static label from `config.homelab.host.role` on both scrape configs (conditionally, only when non-null)
No cardinality impact - `tier` and `role` are 1:1 with `hostname`, so they add metadata to existing streams without creating new ones.
This enables queries like:
- `{tier="prod"} |= "error"` - all errors on prod hosts
- `{role="dns"}` - all DNS server logs
- `{tier="test", job="systemd-journal"}` - journal logs from test hosts
### 5. Journal Priority → Level Label
**Implemented.** Promtail pipeline stages map journal `PRIORITY` to a `level` label:
| PRIORITY | level |
|----------|-------|
| 0-2 | critical |
| 3 | error |
| 4 | warning |
| 5 | notice |
| 6 | info |
| 7 | debug |
Uses a `json` stage to extract PRIORITY, `template` to map to level name, and `labels` to attach it. This gives reliable level filtering for all journal logs, unlike Loki's `detected_level` which only works for apps that embed level keywords in message text.
Example queries:
- `{level="error"}` - all errors across the fleet
- `{level=~"critical|error", tier="prod"}` - prod errors and criticals
- `{level="warning", role="dns"}` - warnings from DNS servers
### 6. Enable JSON Logging on Services
**Problem:** Many services support structured JSON log output but may be using plain text by default. JSON logs are significantly easier to query in Loki - `| json` cleanly extracts all fields, whereas plain text requires fragile regex or pattern matching.
**Audit results (2026-02-13):**
**Already logging JSON:**
- Caddy (all instances) - JSON by default for access logs
- homelab-deploy (listener/builder) - Go app, logs structured JSON
**Supports JSON, not configured (high value):**
| Service | How to enable | Config file |
|---------|--------------|-------------|
| Prometheus | `--log.format=json` | `services/monitoring/prometheus.nix` |
| Alertmanager | `--log.format=json` | `services/monitoring/prometheus.nix` |
| Loki | `--log.format=json` | `services/monitoring/loki.nix` |
| Grafana | `log.console.format = "json"` | `services/monitoring/grafana.nix` |
| Tempo | `log_format: json` in config | `services/monitoring/tempo.nix` |
| OpenBao | `log_format = "json"` | `services/vault/default.nix` |
**Supports JSON, not configured (lower value - minimal log output):**
| Service | How to enable |
|---------|--------------|
| Pyroscope | `--log.format=json` (OCI container) |
| Blackbox Exporter | `--log.format=json` |
| Node Exporter | `--log.format=json` (all 16 hosts) |
| Systemd Exporter | `--log.format=json` (all 16 hosts) |
**No JSON support (syslog/text only):**
- NSD, Unbound, OpenSSH, Mosquitto
**Needs verification:**
- Kanidm, Jellyfin, Home Assistant, Harmonia, Zigbee2MQTT, NATS
**Recommendation:** Start with the monitoring stack (Prometheus, Alertmanager, Loki, Grafana, Tempo) since they're all Go apps with the same `--log.format=json` flag. Then OpenBao. The exporters are lower priority since they produce minimal log output.
### 7. Monitoring CNAME for Promtail Target
**Problem:** Promtail hardcodes `monitoring01.home.2rjus.net:3100`. The VictoriaMetrics migration plan already addresses this by switching to a `monitoring` CNAME.
**Recommendation:** This should happen as part of the monitoring02 migration, not independently. If we do Loki improvements before that migration, keep pointing to monitoring01.
## Priority Ranking
| # | Improvement | Effort | Impact | Status |
|---|-------------|--------|--------|--------|
| 1 | **Retention policy** | Low | High | Done (30d compactor retention) |
| 2 | **Limits config** | Low | Medium | Done (rate limits + stream guards) |
| 3 | **Promtail labels** | Trivial | Low | Done (hostname/tier/role/level) |
| 4 | **Journal priority → level** | Low-medium | Medium | Done (pipeline stages) |
| 5 | **JSON logging audit** | Low-medium | Medium | Audited, not yet enabled |
| 6 | **Monitoring CNAME** | Low | Medium | Part of monitoring02 migration |
## Implementation Steps
### Phase 1: Retention + Labels (done 2026-02-13)
1. ~~Add `compactor` section to `services/monitoring/loki.nix`~~ Done
2. ~~Add `limits_config` with 30-day retention and basic rate limits~~ Done
3. ~~Update `system/monitoring/logs.nix`~~ Done:
- Standardized on `hostname` label (matching Prometheus) for both scrape configs
- Added `tier` and `role` static labels from `homelab.host` options
- Added pipeline stages for journal PRIORITY → `level` label mapping
4. ~~Update `pipe-to-loki` and bootstrap scripts to use `hostname`~~ Done
5. ~~Deploy and verify labels~~ Done - all 15 hosts reporting with correct labels
### Phase 2: JSON Logging (not started)
Enable JSON logging on services that support it, starting with the monitoring stack:
1. Prometheus, Alertmanager, Loki, Grafana, Tempo (`--log.format=json`)
2. OpenBao (`log_format = "json"`)
3. Lower priority: exporters (node-exporter, systemd-exporter, blackbox)
### Phase 3 (future): S3 Storage Migration
Revisit after NAS migration when a proper S3-compatible storage solution is available. At that point, add a new schema period with `object_store = "s3"` - the old filesystem period will continue serving historical data until it ages out past retention.
## Open Questions
- [ ] Do we want per-stream retention (e.g., keep bootstrap/pipe-to-loki longer)?
## Notes
- Loki schema changes require adding a new period entry (not modifying existing ones). The old period continues serving historical data.
- Loki 3.6.3 requires `delete_request_store = "filesystem"` in the compactor config when retention is enabled.
- S3 storage deferred until post-NAS migration when a proper solution is available.
- As of 2026-02-13, Loki uses ~6.8 GB for ~30 days of logs from 16 hosts. Prometheus uses ~7.6 GB on the same disk (33 GB total, ~8 GB free).

View File

@@ -14,8 +14,8 @@ a `monitoring` CNAME for seamless transition.
- Alertmanager (routes to alerttonotify webhook)
- Grafana (dashboards, datasources)
- Loki (log aggregation from all hosts via Promtail)
- Tempo (distributed tracing)
- Pyroscope (continuous profiling)
- Tempo (distributed tracing) - not actively used
- Pyroscope (continuous profiling) - not actively used
**Hardcoded References to monitoring01:**
- `system/monitoring/logs.nix` - Promtail sends logs to `http://monitoring01.home.2rjus.net:3100`
@@ -44,9 +44,7 @@ If multi-year retention with downsampling becomes necessary later, Thanos can be
│ VictoriaMetrics│
│ + Grafana │
monitoring │ + Loki │
CNAME ──────────│ + Tempo
│ + Pyroscope │
│ + Alertmanager │
CNAME ──────────│ + Alertmanager
│ (vmalert) │
└─────────────────┘
@@ -61,53 +59,48 @@ If multi-year retention with downsampling becomes necessary later, Thanos can be
## Implementation Plan
### Phase 1: Create monitoring02 Host
### Phase 1: Create monitoring02 Host [COMPLETE]
Use `create-host` script which handles flake.nix and terraform/vms.tf automatically.
1. **Run create-host**: `nix develop -c create-host monitoring02 10.69.13.24`
2. **Update VM resources** in `terraform/vms.tf`:
- 4 cores (same as monitoring01)
- 8GB RAM (double, for VictoriaMetrics headroom)
- 100GB disk (for 3+ months retention with compression)
3. **Update host configuration**: Import monitoring services
4. **Create Vault AppRole**: Add to `terraform/vault/approle.tf`
Host created and deployed at 10.69.13.24 (prod tier) with:
- 4 CPU cores, 8GB RAM, 60GB disk
- Vault integration enabled
- NATS-based remote deployment enabled
- Grafana with Kanidm OIDC deployed as test instance (`grafana-test.home.2rjus.net`)
### Phase 2: Set Up VictoriaMetrics Stack
Create new service module at `services/monitoring/victoriametrics/` for testing alongside existing
Prometheus config. Once validated, this can replace the Prometheus module.
New service module at `services/victoriametrics/` for VictoriaMetrics + vmalert + Alertmanager.
Imported by monitoring02 alongside the existing Grafana service.
1. **VictoriaMetrics** (port 8428):
1. **VictoriaMetrics** (port 8428): [DONE]
- `services.victoriametrics.enable = true`
- `services.victoriametrics.retentionPeriod = "3m"` (3 months, increase later based on disk usage)
- Migrate scrape configs via `prometheusConfig`
- Use native push support (replaces Pushgateway)
- `retentionPeriod = "3"` (3 months)
- All scrape configs migrated from Prometheus (22 jobs including auto-generated)
- Static user override (DynamicUser disabled) for credential file access
- OpenBao token fetch service + 30min refresh timer
- Apiary bearer token via vault.secrets
2. **vmalert** for alerting rules:
- `services.vmalert.enable = true`
- Point to VictoriaMetrics for metrics evaluation
- Keep rules in separate `rules.yml` file (same format as Prometheus)
- No receiver configured during parallel operation (prevents duplicate alerts)
2. **vmalert** for alerting rules: [DONE]
- Points to VictoriaMetrics datasource at localhost:8428
- Reuses existing `services/monitoring/rules.yml` directly via `settings.rule`
- No notifier configured during parallel operation (prevents duplicate alerts)
3. **Alertmanager** (port 9093):
- Keep existing configuration (alerttonotify webhook routing)
- Only enable receiver after cutover from monitoring01
3. **Alertmanager** (port 9093): [DONE]
- Same configuration as monitoring01 (alerttonotify webhook routing)
- Will only receive alerts after cutover (vmalert notifier disabled)
4. **Loki** (port 3100):
- Same configuration as current
4. **Grafana** (port 3000): [DONE]
- VictoriaMetrics datasource (localhost:8428) as default
- monitoring01 Prometheus datasource kept for comparison during parallel operation
- Loki datasource pointing to localhost (after Loki migrated to monitoring02)
5. **Grafana** (port 3000):
- Define dashboards declaratively via NixOS options (not imported from monitoring01)
- Reference existing dashboards on monitoring01 for content inspiration
- Configure VictoriaMetrics datasource (port 8428)
- Configure Loki datasource
5. **Loki** (port 3100): [DONE]
- Same configuration as monitoring01 in standalone `services/loki/` module
- Grafana datasource updated to localhost:3100
6. **Tempo** (ports 3200, 3201):
- Same configuration
7. **Pyroscope** (port 4040):
- Same Docker-based deployment
**Note:** pve-exporter and pushgateway scrape targets are not included on monitoring02.
pve-exporter requires a local exporter instance; pushgateway is replaced by VictoriaMetrics
native push support.
### Phase 3: Parallel Operation
@@ -147,7 +140,6 @@ Update hardcoded references to use the CNAME:
- prometheus.home.2rjus.net -> monitoring.home.2rjus.net:8428
- alertmanager.home.2rjus.net -> monitoring.home.2rjus.net:9093
- grafana.home.2rjus.net -> monitoring.home.2rjus.net:3000
- pyroscope.home.2rjus.net -> monitoring.home.2rjus.net:4040
Note: `hosts/template2/bootstrap.nix` stays pointed at monitoring01 until decommission.
@@ -171,57 +163,26 @@ Once ready to cut over:
## Current Progress
### monitoring02 Host Created (2026-02-08)
Host deployed at 10.69.13.24 (test tier) with:
- 4 CPU cores, 8GB RAM, 60GB disk
- Vault integration enabled
- NATS-based remote deployment enabled
### Grafana with Kanidm OIDC (2026-02-08)
Grafana deployed on monitoring02 as a test instance (`grafana-test.home.2rjus.net`):
- Kanidm OIDC authentication (PKCE enabled)
- Role mapping: `admins` → Admin, others → Viewer
- Declarative datasources pointing to monitoring01 (Prometheus, Loki)
- Local Caddy for TLS termination via internal ACME CA
This validates the Grafana + OIDC pattern before the full VictoriaMetrics migration. The existing
`services/monitoring/grafana.nix` on monitoring01 can be replaced with the new `services/grafana/`
module once monitoring02 becomes the primary monitoring host.
- **Phase 1** complete (2026-02-08): monitoring02 host created, Grafana with Kanidm OIDC validated
- **Phase 2** complete (2026-02-17): VictoriaMetrics, vmalert, Alertmanager, Loki, Grafana datasources configured
- Tempo and Pyroscope deferred (not actively used; can be added later if needed)
## Open Questions
- [ ] What disk size for monitoring02? Current 60GB may need expansion for 3+ months with VictoriaMetrics
- [ ] Which dashboards to recreate declaratively? (Review monitoring01 Grafana for current set)
- [ ] Consider replacing Promtail with Grafana Alloy (`services.alloy`, v1.12.2 in nixpkgs). Promtail is in maintenance mode and Grafana recommends Alloy as the successor. Alloy is a unified collector (logs, metrics, traces, profiles) but uses its own "River" config format instead of YAML, so less Nix-native ergonomics. Could bundle the migration with monitoring02 to consolidate disruption.
## VictoriaMetrics Service Configuration
Example NixOS configuration for monitoring02:
Implemented in `services/victoriametrics/default.nix`. Key design decisions:
```nix
# VictoriaMetrics replaces Prometheus
services.victoriametrics = {
enable = true;
retentionPeriod = "3m"; # 3 months, increase based on disk usage
prometheusConfig = {
global.scrape_interval = "15s";
scrape_configs = [
# Auto-generated node-exporter targets
# Service-specific scrape targets
# External targets
];
};
};
# vmalert for alerting rules (no receiver during parallel operation)
services.vmalert = {
enable = true;
datasource.url = "http://localhost:8428";
# notifier.alertmanager.url = "http://localhost:9093"; # Enable after cutover
rule = [ ./rules.yml ];
};
```
- **Static user**: VictoriaMetrics NixOS module uses `DynamicUser`, overridden with a static
`victoriametrics` user so vault.secrets and credential files work correctly
- **Shared rules**: vmalert reuses `services/monitoring/rules.yml` via `settings.rule` path
reference (no YAML-to-Nix conversion needed)
- **Scrape config reuse**: Uses the same `lib/monitoring.nix` functions and
`services/monitoring/external-targets.nix` as Prometheus for auto-generated targets
## Rollback Plan

145
docs/plans/new-services.md Normal file
View File

@@ -0,0 +1,145 @@
# New Service Candidates
Ideas for additional services to deploy in the homelab. These lean more enterprise/obscure
than the typical self-hosted fare.
## Litestream
Continuous SQLite replication to S3-compatible storage. Streams WAL changes in near-real-time,
providing point-in-time recovery without scheduled backup jobs.
**Why:** Several services use SQLite (Home Assistant, potentially others). Litestream would
give continuous backup to Garage S3 with minimal resource overhead and near-zero configuration.
Replaces cron-based backup scripts with a small daemon per database.
**Integration points:**
- Garage S3 as replication target (already deployed)
- Home Assistant SQLite database is the primary candidate
- Could also cover any future SQLite-backed services
**Complexity:** Low. Single Go binary, minimal config (source DB path + S3 endpoint).
**NixOS packaging:** Available in nixpkgs as `litestream`.
---
## ntopng
Deep network traffic analysis and flow monitoring. Provides real-time visibility into bandwidth
usage, protocol distribution, top talkers, and anomaly detection via a web UI.
**Why:** We have host-level metrics (node-exporter) and logs (Loki) but no network-level
visibility. ntopng would show traffic patterns across the infrastructure — NFS throughput to
the NAS, DNS query volume, inter-host traffic, and bandwidth anomalies. Useful for capacity
planning and debugging network issues.
**Integration points:**
- Could export metrics to Prometheus via its built-in exporter
- Web UI behind http-proxy with Kanidm OIDC (if supported) or Pomerium
- NetFlow/sFlow from managed switches (if available)
- Passive traffic capture on a mirror port or the monitoring host itself
**Complexity:** Medium. Needs network tap or mirror port for full visibility, or can run
in host-local mode. May need a dedicated interface or VLAN mirror.
**NixOS packaging:** Available in nixpkgs as `ntopng`.
---
## Renovate
Automated dependency update bot that understands Nix flakes natively. Creates branches/PRs
to bump flake inputs on a configurable schedule.
**Why:** Currently `nix flake update` is manual. Renovate can automatically propose updates
to individual flake inputs (nixpkgs, homelab-deploy, nixos-exporter, etc.), group related
updates, and respect schedules. More granular than updating everything at once — can bump
nixpkgs weekly but hold back other inputs, auto-merge patch-level changes, etc.
**Integration points:**
- Runs against git.t-juice.club repositories
- Understands `flake.lock` format natively
- Could target both `nixos-servers` and `nixos` repos
- Update branches would be validated by homelab-deploy builder
**Complexity:** Medium. Needs git forge integration (Gitea/Forgejo API). Self-hosted runner
mode available. Configuration via `renovate.json` in each repo.
**NixOS packaging:** Available in nixpkgs as `renovate`.
---
## Pomerium
Identity-aware reverse proxy implementing zero-trust access. Every request is authenticated
and authorized based on identity, device, and context — not just network location.
**Why:** Currently Caddy terminates TLS but doesn't enforce authentication on most services.
Pomerium would put Kanidm OIDC authentication in front of every internal service, with
per-route authorization policies (e.g., "only admins can access Prometheus," "require re-auth
for Vault UI"). Directly addresses the security hardening plan's goals.
**Integration points:**
- Kanidm as OIDC identity provider (already deployed)
- Could replace or sit in front of Caddy for internal services
- Per-route policies based on Kanidm groups (admins, users, ssh-users)
- Centralizes access logging and audit trail
**Complexity:** Medium-high. Needs careful integration with existing Caddy reverse proxy.
Decision needed on whether Pomerium replaces Caddy or works alongside it (Pomerium for
auth, Caddy for TLS termination and routing, or Pomerium handles everything).
**NixOS packaging:** Available in nixpkgs as `pomerium`.
---
## Apache Guacamole
Clientless remote desktop and SSH gateway. Provides browser-based access to hosts via
RDP, VNC, SSH, and Telnet with no client software required. Supports session recording
and playback.
**Why:** Provides an alternative remote access path that doesn't require VPN software or
SSH keys on the client device. Useful for accessing hosts from untrusted machines (phone,
borrowed laptop) or providing temporary access to others. Session recording gives an audit
trail. Could complement the WireGuard remote access plan rather than replace it.
**Integration points:**
- Kanidm for authentication (OIDC or LDAP)
- Behind http-proxy or Pomerium for TLS
- SSH access to all hosts in the fleet
- Session recordings could be stored on Garage S3
- Could serve as the "emergency access" path when VPN is unavailable
**Complexity:** Medium. Java-based (guacd + web app), typically needs PostgreSQL for
connection/user storage (already available). Docker is the common deployment method but
native packaging exists.
**NixOS packaging:** Available in nixpkgs as `guacamole-server` and `guacamole-client`.
---
## CrowdSec
Collaborative intrusion prevention system with crowd-sourced threat intelligence.
Parses logs to detect attack patterns, applies remediation (firewall bans, CAPTCHA),
and shares/receives threat signals from a global community network.
**Why:** Goes beyond fail2ban with behavioral detection, crowd-sourced IP reputation,
and a scenario-based engine. Fits the security hardening plan. The community blocklist
means we benefit from threat intelligence gathered across thousands of deployments.
Could parse SSH logs, HTTP access logs, and other service logs to detect and block
malicious activity.
**Integration points:**
- Could consume logs from Loki or directly from journald/log files
- Firewall bouncer for iptables/nftables remediation
- Caddy bouncer for HTTP-level blocking
- Prometheus metrics exporter for alert integration
- Scenarios available for SSH brute force, HTTP scanning, and more
- Feeds into existing alerting pipeline (Alertmanager -> alerttonotify)
**Complexity:** Medium. Agent (log parser + decision engine) on each host or centralized.
Bouncers (enforcement) on edge hosts. Free community tier includes threat intel access.
**NixOS packaging:** Available in nixpkgs as `crowdsec`.

View File

@@ -1,212 +0,0 @@
# Nix Cache Host Reprovision
## Overview
Reprovision `nix-cache01` using the OpenTofu workflow, and improve the build/cache system with:
1. NATS-based remote build triggering (replacing the current bash script)
2. Safer flake update workflow that validates builds before pushing to master
## Current State
### Host Configuration
- `nix-cache01` at 10.69.13.15 serves the binary cache via Harmonia
- Runs Gitea Actions runner for CI workflows
- Has `homelab.deploy.enable = true` (already supports NATS-based deployment)
- Uses a dedicated XFS volume at `/nix` for cache storage
### Current Build System (`services/nix-cache/build-flakes.sh`)
- Runs every 30 minutes via systemd timer
- Clones/pulls two repos: `nixos-servers` and `nixos` (gunter)
- Builds all hosts with `nixos-rebuild build` (no blacklist despite docs mentioning it)
- Pushes success/failure metrics to pushgateway
- Simple but has no filtering, no parallelism, no remote triggering
### Current Flake Update Workflow (`.github/workflows/flake-update.yaml`)
- Runs daily at midnight via cron
- Runs `nix flake update --commit-lock-file`
- Pushes directly to master
- No build validation — can push broken inputs
## Improvement 1: NATS-Based Remote Build Triggering
### Design
Extend the existing `homelab-deploy` tool to support a "build" command that triggers builds on the cache host. This reuses the NATS infrastructure already in place.
| Approach | Pros | Cons |
|----------|------|------|
| Extend homelab-deploy | Reuses existing NATS auth, NKey handling, CLI | Adds scope to existing tool |
| New nix-cache-tool | Clean separation | Duplicate NATS boilerplate, new credentials |
| Gitea Actions webhook | No custom tooling | Less flexible, tied to Gitea |
**Recommendation:** Extend `homelab-deploy` with a build subcommand. The tool already has NATS client code, authentication handling, and a listener module in NixOS.
### Implementation
1. Add new message type to homelab-deploy: `build.<host>` subject
2. Listener on nix-cache01 subscribes to `build.>` wildcard
3. On message receipt, builds the specified host and returns success/failure
4. CLI command: `homelab-deploy build <hostname>` or `homelab-deploy build --all`
### Benefits
- Trigger rebuild for specific host to ensure it's cached
- Could be called from CI after merging PRs
- Reuses existing NATS infrastructure and auth
- Progress/status could stream back via NATS reply
## Improvement 2: Smarter Flake Update Workflow
### Current Problems
1. Updates can push breaking changes to master
2. No visibility into what broke when it does
3. Hosts that auto-update can pull broken configs
### Proposed Workflow
```
┌─────────────────────────────────────────────────────────────────┐
│ Flake Update Workflow │
├─────────────────────────────────────────────────────────────────┤
│ 1. nix flake update (on feature branch) │
│ 2. Build ALL hosts locally │
│ 3. If all pass → fast-forward merge to master │
│ 4. If any fail → create PR with failure logs attached │
└─────────────────────────────────────────────────────────────────┘
```
### Implementation Options
| Option | Description | Pros | Cons |
|--------|-------------|------|------|
| **A: Self-hosted runner** | Build on nix-cache01 | Fast (local cache), simple | Ties up cache host during build |
| **B: Gitea Actions only** | Use container runner | Clean separation | Slow (no cache), resource limits |
| **C: Hybrid** | Trigger builds on nix-cache01 via NATS from Actions | Best of both | More complex |
**Recommendation:** Option A with nix-cache01 as the runner. The host is already running Gitea Actions runner and has the cache. Building all ~16 hosts is disk I/O heavy but feasible on dedicated hardware.
### Workflow Steps
1. Workflow runs on schedule (daily or weekly)
2. Creates branch `flake-update/YYYY-MM-DD`
3. Runs `nix flake update --commit-lock-file`
4. Builds each host: `nix build .#nixosConfigurations.<host>.config.system.build.toplevel`
5. If all succeed:
- Fast-forward merge to master
- Delete feature branch
6. If any fail:
- Create PR from the update branch
- Attach build logs as PR comment
- Label PR with `needs-review` or `build-failure`
- Do NOT merge automatically
### Workflow File Changes
```yaml
# New: .github/workflows/flake-update-safe.yaml
name: Safe flake update
on:
schedule:
- cron: "0 2 * * 0" # Weekly on Sunday at 2 AM
workflow_dispatch: # Manual trigger
jobs:
update-and-validate:
runs-on: homelab # Use self-hosted runner on nix-cache01
steps:
- uses: actions/checkout@v4
with:
ref: master
fetch-depth: 0 # Need full history for merge
- name: Create update branch
run: |
BRANCH="flake-update/$(date +%Y-%m-%d)"
git checkout -b "$BRANCH"
- name: Update flake
run: nix flake update --commit-lock-file
- name: Build all hosts
id: build
run: |
FAILED=""
for host in $(nix flake show --json | jq -r '.nixosConfigurations | keys[]'); do
echo "Building $host..."
if ! nix build ".#nixosConfigurations.$host.config.system.build.toplevel" 2>&1 | tee "build-$host.log"; then
FAILED="$FAILED $host"
fi
done
echo "failed=$FAILED" >> $GITHUB_OUTPUT
- name: Merge to master (if all pass)
if: steps.build.outputs.failed == ''
run: |
git checkout master
git merge --ff-only "$BRANCH"
git push origin master
git push origin --delete "$BRANCH"
- name: Create PR (if any fail)
if: steps.build.outputs.failed != ''
run: |
git push origin "$BRANCH"
# Create PR via Gitea API with build logs
# ... (PR creation with log attachment)
```
## Migration Steps
### Phase 1: Reprovision Host via OpenTofu
1. Add `nix-cache01` to `terraform/vms.tf`:
```hcl
"nix-cache01" = {
ip = "10.69.13.15/24"
cpu_cores = 4
memory = 8192
disk_size = "100G" # Larger for nix store
}
```
2. Shut down existing nix-cache01 VM
3. Run `tofu apply` to provision new VM
4. Verify bootstrap completes and cache is serving
**Note:** The cache will be cold after reprovision. Run initial builds to populate.
### Phase 2: Add Build Triggering to homelab-deploy
1. Add `build` command to homelab-deploy CLI
2. Add listener handler in NixOS module for `build.*` subjects
3. Update nix-cache01 config to enable build listener
4. Test with `homelab-deploy build testvm01`
### Phase 3: Implement Safe Flake Update Workflow
1. Create `.github/workflows/flake-update-safe.yaml`
2. Disable or remove old `flake-update.yaml`
3. Test manually with `workflow_dispatch`
4. Monitor first automated run
### Phase 4: Remove Old Build Script
1. After new workflow is stable, remove:
- `services/nix-cache/build-flakes.nix`
- `services/nix-cache/build-flakes.sh`
2. The new workflow handles scheduled builds
## Open Questions
- [ ] What runner labels should the self-hosted runner use for the update workflow?
- [ ] Should we build hosts in parallel (faster) or sequentially (easier to debug)?
- [ ] How long to keep flake-update PRs open before auto-closing stale ones?
- [ ] Should successful updates trigger a NATS notification to rebuild all hosts?
- [ ] What to do about `gunter` (external nixos repo) - include in validation?
- [ ] Disk size for new nix-cache01 - is 100G enough for cache + builds?
## Notes
- The existing `homelab.deploy.enable = true` on nix-cache01 means it already has NATS connectivity
- The Harmonia service and cache signing key will work the same after reprovision
- Actions runner token is in Vault, will be provisioned automatically
- Consider adding a `homelab.host.role = "build-host"` label for monitoring/filtering

162
docs/plans/nixos-router.md Normal file
View File

@@ -0,0 +1,162 @@
# NixOS Router — Replace EdgeRouter
Replace the aging Ubiquiti EdgeRouter (gw, 10.69.10.1) with a NixOS-based router.
The EdgeRouter is suspected to be a throughput bottleneck. A NixOS router integrates
naturally with the existing fleet: same config management, same monitoring pipeline,
same deployment workflow.
## Goals
- Eliminate the EdgeRouter throughput bottleneck
- Full integration with existing monitoring (node-exporter, promtail, Prometheus, Loki)
- Declarative firewall and routing config managed in the flake
- Inter-VLAN routing for all existing subnets
- DHCP server for client subnets
- NetFlow/traffic accounting for future ntopng integration
- Foundation for WireGuard remote access (see remote-access.md)
## Current Network Topology
**Subnets (known VLANs):**
| VLAN/Subnet | Purpose | Notable hosts |
|----------------|------------------|----------------------------------------|
| 10.69.10.0/24 | Gateway | gw (10.69.10.1) |
| 10.69.12.0/24 | Core services | nas, pve1, arr jails, restic |
| 10.69.13.0/24 | Infrastructure | All NixOS servers (static IPs) |
| 10.69.22.0/24 | WLAN | unifi-ctrl |
| 10.69.30.0/24 | Workstations | gunter |
| 10.69.31.0/24 | Media | media |
| 10.69.99.0/24 | Management | sw1 (MikroTik CRS326-24G-2S+) |
**DNS:** ns1 (10.69.13.5) and ns2 (10.69.13.6) handle all resolution. Upstream is
Cloudflare/Google over DoT via Unbound.
**Switch:** MikroTik CRS326-24G-2S+ — L2 switching with VLAN trunking. Capable of
L3 routing via RouterOS but not ideal for sustained routing throughput.
## Hardware
Needs a small x86 box with:
- At least 2 NICs (WAN + LAN trunk). Dual 2.5GbE preferred.
- Enough CPU for nftables NAT at line rate (any modern x86 is fine)
- 4-8 GB RAM (plenty for routing + DHCP + NetFlow accounting)
- Low power consumption, fanless preferred for always-on use
Candidates:
- Topton / CWWK mini PC with dual/quad Intel 2.5GbE (~100-150 EUR)
- Protectli Vault (more expensive, ~200-300 EUR, proven in pfSense/OPNsense community)
- Any mini PC with one onboard NIC + one USB 2.5GbE adapter (cheapest, less ideal)
The LAN port would carry a VLAN trunk to the MikroTik switch, with sub-interfaces
for each VLAN. WAN port connects to the ISP uplink.
## NixOS Configuration
### Stability Policy
The router is treated differently from the rest of the fleet:
- **No auto-upgrade** — `system.autoUpgrade.enable = false`
- **No homelab-deploy listener** — `homelab.deploy.enable = false`
- **Manual updates only** — update every few months, test-build first
- **Use `nixos-rebuild boot`** — changes take effect on next deliberate reboot
- **Tier: prod, priority: high** — alerts treated with highest priority
### Core Services
**Routing & NAT:**
- `systemd-networkd` for all interface config (consistent with rest of fleet)
- VLAN sub-interfaces on the LAN trunk (one per subnet)
- `networking.nftables` for stateful firewall and NAT
- IP forwarding enabled (`net.ipv4.ip_forward = 1`)
- Masquerade outbound traffic on WAN interface
**DHCP:**
- Kea or dnsmasq for DHCP on client subnets (WLAN, workstations, media)
- Infrastructure subnet (10.69.13.0/24) stays static — no DHCP needed
- Static leases for known devices
**Firewall (nftables):**
- Default deny between VLANs
- Explicit allow rules for known cross-VLAN traffic:
- All subnets → ns1/ns2 (DNS)
- All subnets → monitoring01 (metrics/logs)
- Infrastructure → all (management access)
- Workstations → media, core services
- NAT masquerade on WAN
- Rate limiting on WAN-facing services
**Traffic Accounting:**
- nftables flow accounting or softflowd for NetFlow export
- Export to future ntopng instance (see new-services.md)
### Monitoring Integration
Since this is a NixOS host in the flake, it gets the standard monitoring stack for free:
- node-exporter for system metrics (CPU, memory, NIC throughput per interface)
- promtail shipping logs to Loki
- Prometheus scrape target auto-registration
- Alertmanager alerts for host-down, high CPU, etc.
Additional router-specific monitoring:
- Per-VLAN interface traffic metrics via node-exporter (automatic for all interfaces)
- NAT connection tracking table size
- WAN uplink status and throughput
- DHCP lease metrics (if Kea, it has a Prometheus exporter)
This is a significant advantage over the EdgeRouter — full observability through
the existing Grafana dashboards and Loki log search, debuggable via the monitoring
MCP tools.
### WireGuard Integration
The remote access plan (remote-access.md) currently proposes a separate `extgw01`
gateway host. With a NixOS router, there's a decision to make:
**Option A:** WireGuard terminates on the router itself. Simplest topology — the
router is already the gateway, so VPN traffic doesn't need extra hops or firewall
rules. But adds complexity to the router, which should stay simple.
**Option B:** Keep extgw01 as a separate host (original plan). Router just routes
traffic to it. Better separation of concerns, router stays minimal.
Recommendation: Start with option B (keep it separate). The router should do routing
and nothing else. WireGuard can move to the router later if extgw01 feels redundant.
## Migration Plan
### Phase 1: Build and lab test
- Acquire hardware
- Create host config in the flake (routing, NAT, DHCP, firewall)
- Test-build on workstation: `nix build .#nixosConfigurations.router01.config.system.build.toplevel`
- Lab test with a temporary setup if possible (two NICs, isolated VLAN)
### Phase 2: Prepare cutover
- Pre-configure the MikroTik switch trunk port for the new router
- Document current EdgeRouter config (port forwarding, NAT rules, DHCP leases)
- Replicate all rules in the NixOS config
- Verify DNS, DHCP, and inter-VLAN routing work in test
### Phase 3: Cutover
- Schedule a maintenance window (brief downtime expected)
- Swap WAN cable from EdgeRouter to new router
- Swap LAN trunk from EdgeRouter to new router
- Verify connectivity from each VLAN
- Verify internet access, DNS resolution, inter-VLAN routing
- Monitor via Prometheus/Loki (immediately available since it's a fleet host)
### Phase 4: Decommission EdgeRouter
- Keep EdgeRouter available as fallback for a few weeks
- Remove `gw` entry from external-hosts.nix, replace with flake-managed host
- Update any references to 10.69.10.1 if the router IP changes
## Open Questions
- **Router IP:** Keep 10.69.10.1 or move to a different address? Each VLAN
sub-interface needs an IP (the gateway address for that subnet).
- **ISP uplink:** What type of WAN connection? PPPoE, DHCP, static IP?
- **Port forwarding:** What ports are currently forwarded on the EdgeRouter?
These need to be replicated in nftables.
- **DHCP scope:** Which subnets currently get DHCP from the EdgeRouter vs
other sources (UniFi controller for WLAN?)?
- **UPnP/NAT-PMP:** Needed for any devices? (gaming consoles, etc.)
- **Hardware preference:** Fanless mini PC budget and preferred vendor?

View File

@@ -4,119 +4,127 @@
## Goal
Enable remote access to some or all homelab services from outside the internal network, without exposing anything directly to the internet.
Enable personal remote access to selected homelab services from outside the internal network, without exposing anything directly to the internet.
## Current State
- All services are only accessible from the internal 10.69.13.x network
- Exception: jelly01 has a WireGuard link to an external VPS
- No services are directly exposed to the public internet
- http-proxy has a WireGuard tunnel (`wg0`, `10.69.222.0/24`) to a VPS (`docker2.t-juice.club`) on an OpenStack cluster
- VPS runs Traefik which proxies selected services (including Jellyfin) back through the tunnel to http-proxy's Caddy
- No other services are directly exposed to the public internet
## Constraints
## Decision: WireGuard Gateway
- Nothing should be directly accessible from the outside
- Must use VPN or overlay network (no port forwarding of services)
- Self-hosted solutions preferred over managed services
After evaluating WireGuard gateway vs Headscale (self-hosted Tailscale), the **WireGuard gateway** approach was chosen:
## Options
- Only 2 client devices (laptop + phone), so Headscale's device management UX isn't needed
- Split DNS works fine on Linux laptop via systemd-resolved; all-or-nothing DNS on phone is acceptable for occasional use
- Simpler infrastructure - no control server to maintain
- Builds on existing WireGuard experience and setup
### 1. WireGuard Gateway (Internal Router)
## Architecture
A dedicated NixOS host on the internal network with a WireGuard tunnel out to the VPS. The VPS becomes the public entry point, and the gateway routes traffic to internal services. Firewall rules on the gateway control which services are reachable.
```
┌─────────────────────────────────┐
│ VPS (OpenStack) │
Laptop/Phone ──→ │ WireGuard endpoint │
(WireGuard) │ Client peers: laptop, phone │
│ Routes 10.69.13.0/24 via tunnel│
└──────────┬──────────────────────┘
│ WireGuard tunnel
┌─────────────────────────────────┐
│ extgw01 (gateway + bastion) │
│ - WireGuard tunnel to VPS │
│ - Firewall (allowlist only) │
│ - SSH + 2FA (full access) │
└──────────┬──────────────────────┘
│ allowed traffic only
┌─────────────────────────────────┐
│ Internal network 10.69.13.0/24 │
│ - monitoring01:3000 (Grafana) │
│ - jelly01:8096 (Jellyfin) │
│ - *-jail hosts (arr stack) │
└─────────────────────────────────┘
```
**Pros:**
- Simple, well-understood technology
- Already running WireGuard for jelly01
- Full control over routing and firewall rules
- Excellent NixOS module support
- No extra dependencies
### Existing path (unchanged)
**Cons:**
- Hub-and-spoke topology (all traffic goes through VPS)
- Manual peer management
- Adding a new client device means editing configs on both VPS and gateway
The current public access path stays as-is:
### 2. WireGuard Mesh (No Relay)
```
Internet → VPS (Traefik) → WireGuard → http-proxy (Caddy) → internal services
```
Each client device connects directly to a WireGuard endpoint. Could be on the VPS which forwards to the homelab, or if there is a routable IP at home, directly to an internal host.
This handles public Jellyfin access and any other publicly-exposed services.
**Pros:**
- Simple and fast
- No extra software
### New path (personal VPN)
**Cons:**
- Manual key and endpoint management for every peer
- Doesn't scale well
- If behind CGNAT, still needs the VPS as intermediary
A separate WireGuard tunnel for personal remote access with restricted firewall rules:
### 3. Headscale (Self-Hosted Tailscale)
```
Laptop/Phone → VPS (WireGuard peers) → tunnel → extgw01 (firewall) → allowed services
```
Run a Headscale control server (on the VPS or internally) and install the Tailscale client on homelab hosts and personal devices. Gets the Tailscale mesh networking UX without depending on Tailscale's infrastructure.
### Access tiers
**Pros:**
- Mesh topology - devices communicate directly via NAT traversal (DERP relay as fallback)
- Easy to add/remove devices
- ACL support for granular access control
- MagicDNS for service discovery
- Good NixOS support for both headscale server and tailscale client
- Subnet routing lets you expose the entire 10.69.13.x network or specific hosts without installing tailscale on every host
1. **VPN (default)**: Laptop/phone connect to VPS WireGuard endpoint, traffic routed through extgw01 firewall. Only whitelisted services are reachable.
2. **SSH + 2FA (escalated)**: SSH into extgw01 for full network access when needed.
**Cons:**
- More moving parts than plain WireGuard
- Headscale is a third-party reimplementation, can lag behind Tailscale features
- Need to run and maintain the control server
## New Host: extgw01
### 4. Tailscale (Managed)
A NixOS host on the internal network acting as both WireGuard gateway and SSH bastion.
Same as Headscale but using Tailscale's hosted control plane.
### Responsibilities
**Pros:**
- Zero infrastructure to manage on the control plane side
- Polished UX, well-maintained clients
- Free tier covers personal use
- **WireGuard tunnel** to the VPS for client traffic
- **Firewall** with allowlist controlling which internal services are reachable through the VPN
- **SSH bastion** with 2FA for full network access when needed
- **DNS**: Clients get split DNS config (laptop via systemd-resolved routing domain, phone uses internal DNS for all queries)
**Cons:**
- Dependency on Tailscale's service
- Less aligned with self-hosting preference
- Coordination metadata goes through their servers (data plane is still peer-to-peer)
### Firewall allowlist (initial)
### 5. Netbird (Self-Hosted)
| Service | Destination | Port |
|------------|------------------------------|-------|
| Grafana | monitoring01.home.2rjus.net | 3000 |
| Jellyfin | jelly01.home.2rjus.net | 8096 |
| Sonarr | sonarr-jail.home.2rjus.net | 8989 |
| Radarr | radarr-jail.home.2rjus.net | 7878 |
| NZBget | nzbget-jail.home.2rjus.net | 6789 |
Open-source alternative to Tailscale with a self-hostable management server. WireGuard-based, supports ACLs and NAT traversal.
### SSH 2FA options (to be decided)
**Pros:**
- Fully self-hostable
- Web UI for management
- ACL and peer grouping support
- **Kanidm**: Already deployed on kanidm01, supports RADIUS/OAuth2 for PAM integration
- **SSH certificates via OpenBao**: Fits existing Vault infrastructure, short-lived certs
- **TOTP via PAM**: Simplest fallback, Google Authenticator / similar
**Cons:**
- Heavier to self-host (needs multiple components: management server, signal server, TURN relay)
- Less mature NixOS module support compared to Tailscale/Headscale
## VPS Configuration
### 6. Nebula (by Defined Networking)
The VPS needs a new WireGuard interface (separate from the existing http-proxy tunnel):
Certificate-based mesh VPN. Each node gets a certificate from a CA you control. No central coordination server needed at runtime.
- WireGuard endpoint listening on a public UDP port
- 2 peers: laptop, phone
- Routes client traffic through tunnel to extgw01
- Minimal config - just routing, no firewall policy (that lives on extgw01)
**Pros:**
- No always-on control plane
- Certificate-based identity
- Lightweight
## Implementation Steps
**Cons:**
- Less convenient for ad-hoc device addition (need to issue certs)
- NAT traversal less mature than Tailscale's
- Smaller community/ecosystem
## Key Decision Points
- **Static public IP vs CGNAT?** Determines whether clients can connect directly to home network or need VPS relay.
- **Number of client devices?** If just phone and laptop, plain WireGuard via VPS is fine. More devices favors Headscale.
- **Per-service vs per-network access?** Gateway with firewall rules gives per-service control. Headscale ACLs can also do this. Plain WireGuard gives network-level access with gateway firewall for finer control.
- **Subnet routing vs per-host agents?** With Headscale/Tailscale, can either install client on every host, or use a single subnet router that advertises the 10.69.13.x range. The latter is closer to the gateway approach and avoids touching every host.
## Leading Candidates
Based on existing WireGuard experience, self-hosting preference, and NixOS stack:
1. **Headscale with a subnet router** - Best balance of convenience and self-hosting
2. **WireGuard gateway via VPS** - Simplest, most transparent, builds on existing setup
1. **Create extgw01 host configuration** in this repo
- VM provisioned via OpenTofu (same as other hosts)
- WireGuard interface for VPS tunnel
- nftables/iptables firewall with service allowlist
- IP forwarding enabled
2. **Configure VPS WireGuard** for client peers
- New WireGuard interface with laptop + phone peers
- Routing for 10.69.13.0/24 through extgw01 tunnel
3. **Set up client configs**
- Laptop: WireGuard config + systemd-resolved split DNS for `home.2rjus.net`
- Phone: WireGuard app config with DNS pointing at internal nameservers
4. **Set up SSH 2FA** on extgw01
- Evaluate Kanidm integration vs OpenBao SSH certs vs TOTP
5. **Test and verify**
- VPN access to allowed services only
- Firewall blocks everything else
- SSH + 2FA grants full access
- Existing public access path unaffected

20
flake.lock generated
View File

@@ -28,11 +28,11 @@
]
},
"locked": {
"lastModified": 1770648258,
"narHash": "sha256-sExxD8N9Q0RrHIoppOV6qp4jcJirLVjpQd20C72V78I=",
"lastModified": 1771004123,
"narHash": "sha256-Jw36EzL4IGIc2TmeZGphAAUrJXoWqfvCbybF8bTHgMA=",
"ref": "master",
"rev": "277a49a666347e2e2ae67128cf732956a9c3be56",
"revCount": 27,
"rev": "e5e8be86ecdcae8a5962ba3bddddfe91b574792b",
"revCount": 36,
"type": "git",
"url": "https://git.t-juice.club/torjus/homelab-deploy"
},
@@ -64,11 +64,11 @@
},
"nixpkgs": {
"locked": {
"lastModified": 1770464364,
"narHash": "sha256-z5NJPSBwsLf/OfD8WTmh79tlSU8XgIbwmk6qB1/TFzY=",
"lastModified": 1771043024,
"narHash": "sha256-O1XDr7EWbRp+kHrNNgLWgIrB0/US5wvw9K6RERWAj6I=",
"owner": "nixos",
"repo": "nixpkgs",
"rev": "23d72dabcb3b12469f57b37170fcbc1789bd7457",
"rev": "3aadb7ca9eac2891d52a9dec199d9580a6e2bf44",
"type": "github"
},
"original": {
@@ -80,11 +80,11 @@
},
"nixpkgs-unstable": {
"locked": {
"lastModified": 1770562336,
"narHash": "sha256-ub1gpAONMFsT/GU2hV6ZWJjur8rJ6kKxdm9IlCT0j84=",
"lastModified": 1771008912,
"narHash": "sha256-gf2AmWVTs8lEq7z/3ZAsgnZDhWIckkb+ZnAo5RzSxJg=",
"owner": "nixos",
"repo": "nixpkgs",
"rev": "d6c71932130818840fc8fe9509cf50be8c64634f",
"rev": "a82ccc39b39b621151d6732718e3e250109076fa",
"type": "github"
},
"original": {

View File

@@ -110,15 +110,6 @@
./hosts/jelly01
];
};
nix-cache01 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
inherit inputs self;
};
modules = commonModules ++ [
./hosts/nix-cache01
];
};
nats1 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
@@ -209,6 +200,15 @@
./hosts/nix-cache02
];
};
garage01 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
inherit inputs self;
};
modules = commonModules ++ [
./hosts/garage01
];
};
};
packages = forAllSystems (
{ pkgs }:

View File

@@ -1,33 +1,37 @@
{
config,
lib,
pkgs,
...
}:
{
imports = [
./hardware-configuration.nix
../template2/hardware-configuration.nix
../../system
../../common/vm
];
homelab.dns.cnames = [ "nix-cache" "actions1" ];
homelab.host.role = "build-host";
fileSystems."/nix" = {
device = "/dev/disk/by-label/nixcache";
fsType = "xfs";
# Host metadata (adjust as needed)
homelab.host = {
tier = "test"; # Start in test tier, move to prod after validation
role = "storage";
};
homelab.dns.cnames = [ "s3" ];
# Enable Vault integration
vault.enable = true;
# Enable remote deployment via NATS
homelab.deploy.enable = true;
nixpkgs.config.allowUnfree = true;
# Use the systemd-boot EFI boot loader.
boot.loader.grub = {
enable = true;
device = "/dev/sda";
configurationLimit = 3;
};
boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/vda";
networking.hostName = "nix-cache01";
networking.hostName = "garage01";
networking.domain = "home.2rjus.net";
networking.useNetworkd = true;
networking.useDHCP = false;
@@ -41,7 +45,7 @@
systemd.network.networks."ens18" = {
matchConfig.Name = "ens18";
address = [
"10.69.13.15/24"
"10.69.13.26/24"
];
routes = [
{ Gateway = "10.69.13.1"; }
@@ -54,9 +58,6 @@
"nix-command"
"flakes"
];
vault.enable = true;
homelab.deploy.enable = true;
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim
@@ -64,13 +65,11 @@
git
];
services.qemuGuest.enable = true;
# Open ports in the firewall.
# networking.firewall.allowedTCPPorts = [ ... ];
# networking.firewall.allowedUDPPorts = [ ... ];
# Or disable the firewall altogether.
networking.firewall.enable = false;
system.stateVersion = "24.05"; # Did you read the comment?
}
system.stateVersion = "25.11"; # Did you read the comment?
}

View File

@@ -0,0 +1,6 @@
{ ... }: {
imports = [
./configuration.nix
../../services/garage
];
}

View File

@@ -87,6 +87,7 @@
"--keep-monthly 6"
"--keep-within 1d"
];
extraOptions = [ "--retry-lock=5m" ];
};
# Open ports in the firewall.

View File

@@ -83,6 +83,7 @@
"--keep-monthly 6"
"--keep-within 1d"
];
extraOptions = [ "--retry-lock=5m" ];
};
services.restic.backups.grafana-db = {
@@ -100,6 +101,7 @@
"--keep-monthly 6"
"--keep-within 1d"
];
extraOptions = [ "--retry-lock=5m" ];
};
# Open ports in the firewall.

View File

@@ -18,8 +18,7 @@
role = "monitoring";
};
# DNS CNAME for Grafana test instance
homelab.dns.cnames = [ "grafana-test" ];
homelab.dns.cnames = [ "grafana-test" "metrics" "vmalert" "loki" ];
# Enable Vault integration
vault.enable = true;

View File

@@ -2,5 +2,7 @@
imports = [
./configuration.nix
../../services/grafana
../../services/victoriametrics
../../services/loki
];
}

View File

@@ -1,8 +0,0 @@
{ ... }:
{
imports = [
./configuration.nix
../../services/nix-cache
../../services/actions-runner
];
}

View File

@@ -1,42 +0,0 @@
{
config,
lib,
pkgs,
modulesPath,
...
}:
{
imports = [
(modulesPath + "/profiles/qemu-guest.nix")
];
boot.initrd.availableKernelModules = [
"ata_piix"
"uhci_hcd"
"virtio_pci"
"virtio_scsi"
"sd_mod"
"sr_mod"
];
boot.initrd.kernelModules = [ "dm-snapshot" ];
boot.kernelModules = [
"ptp_kvm"
];
boot.extraModulePackages = [ ];
fileSystems."/" = {
device = "/dev/disk/by-label/root";
fsType = "xfs";
};
swapDevices = [ { device = "/dev/disk/by-label/swap"; } ];
# Enables DHCP on each ethernet and wireless interface. In case of scripted networking
# (the default) this is the recommended approach. When using systemd-networkd it's
# still possible to use this option, but it's recommended to use it in conjunction
# with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
networking.useDHCP = lib.mkDefault true;
# networking.interfaces.ens18.useDHCP = lib.mkDefault true;
nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
}

View File

@@ -0,0 +1,45 @@
{ config, ... }:
{
# Fetch builder NKey from Vault
vault.secrets.builder-nkey = {
secretPath = "shared/homelab-deploy/builder-nkey";
extractKey = "nkey";
outputDir = "/run/secrets/builder-nkey";
services = [ "homelab-deploy-builder" ];
};
# Configure the builder service
services.homelab-deploy.builder = {
enable = true;
natsUrl = "nats://nats1.home.2rjus.net:4222";
nkeyFile = "/run/secrets/builder-nkey";
settings.repos = {
nixos-servers = {
url = "git+https://git.t-juice.club/torjus/nixos-servers.git";
defaultBranch = "master";
};
nixos = {
url = "git+https://git.t-juice.club/torjus/nixos.git";
defaultBranch = "master";
};
};
timeout = 7200;
metrics.enable = true;
};
# Expose builder metrics for Prometheus scraping
homelab.monitoring.scrapeTargets = [
{
job_name = "homelab-deploy-builder";
port = 9973;
}
];
# Ensure builder starts after vault secret is available
systemd.services.homelab-deploy-builder = {
after = [ "vault-secret-builder-nkey.service" ];
requires = [ "vault-secret-builder-nkey.service" ];
};
}

View File

@@ -13,11 +13,13 @@
../../common/vm
];
# Host metadata (adjust as needed)
homelab.host = {
tier = "test"; # Start in test tier, move to prod after validation
tier = "prod";
role = "build-host";
};
homelab.dns.cnames = [ "nix-cache" ];
# Enable Vault integration
vault.enable = true;

View File

@@ -1,5 +1,8 @@
{ ... }: {
imports = [
./configuration.nix
./builder.nix
./scheduler.nix
../../services/nix-cache
];
}

View File

@@ -0,0 +1,61 @@
{ config, pkgs, lib, inputs, ... }:
let
homelab-deploy = inputs.homelab-deploy.packages.${pkgs.system}.default;
scheduledBuildScript = pkgs.writeShellApplication {
name = "scheduled-build";
runtimeInputs = [ homelab-deploy ];
text = ''
NATS_URL="nats://nats1.home.2rjus.net:4222"
NKEY_FILE="/run/secrets/scheduler-nkey"
echo "Starting scheduled builds at $(date)"
# Build all nixos-servers hosts
homelab-deploy build \
--nats-url "$NATS_URL" \
--nkey-file "$NKEY_FILE" \
nixos-servers --all
# Build all nixos (gunter) hosts
homelab-deploy build \
--nats-url "$NATS_URL" \
--nkey-file "$NKEY_FILE" \
nixos --all
echo "Scheduled builds completed at $(date)"
'';
};
in
{
# Fetch scheduler NKey from Vault
vault.secrets.scheduler-nkey = {
secretPath = "shared/homelab-deploy/scheduler-nkey";
extractKey = "nkey";
outputDir = "/run/secrets/scheduler-nkey";
services = [ "scheduled-build" ];
};
# Timer: every 2 hours
systemd.timers.scheduled-build = {
description = "Trigger scheduled Nix builds";
wantedBy = [ "timers.target" ];
timerConfig = {
OnCalendar = "*-*-* 00/2:00:00"; # Every 2 hours at :00
Persistent = true; # Run missed builds on boot
RandomizedDelaySec = "5m"; # Slight jitter
};
};
# Service: oneshot that triggers builds
systemd.services.scheduled-build = {
description = "Trigger builds for all hosts via NATS";
after = [ "network-online.target" "vault-secret-scheduler-nkey.service" ];
requires = [ "vault-secret-scheduler-nkey.service" ];
wants = [ "network-online.target" ];
serviceConfig = {
Type = "oneshot";
ExecStart = lib.getExe scheduledBuildScript;
};
};
}

View File

@@ -28,7 +28,7 @@ let
streams: [{
stream: {
job: "bootstrap",
host: $host,
hostname: $host,
stage: $stage,
branch: $branch
},

View File

@@ -1,57 +0,0 @@
{ pkgs, config, ... }:
{
vault.secrets.actions-token = {
secretPath = "hosts/nix-cache01/actions-token";
extractKey = "token";
outputDir = "/run/secrets/actions-token-1";
services = [ "gitea-runner-actions1" ];
};
virtualisation.podman = {
enable = true;
dockerCompat = true;
};
services.gitea-actions-runner.instances = {
actions1 = {
enable = true;
tokenFile = "/run/secrets/actions-token-1";
name = "actions1.home.2rjus.net";
settings = {
log = {
level = "debug";
};
runner = {
file = ".runner";
capacity = 4;
timeout = "2h";
shutdown_timeout = "10m";
insecure = false;
fetch_timeout = "10s";
fetch_interval = "30s";
};
cache = {
enabled = true;
dir = "/var/cache/gitea-actions1";
};
container = {
privileged = false;
};
};
labels =
builtins.map (n: "${n}:docker://gitea/runner-images:${n}") [
"ubuntu-latest"
"ubuntu-latest-slim"
"ubuntu-latest-full"
]
++ [
"homelab"
];
url = "https://git.t-juice.club";
};
};
}

View File

@@ -0,0 +1,64 @@
{ config, pkgs, ... }:
{
homelab.monitoring.scrapeTargets = [
{
job_name = "garage";
port = 3903;
metrics_path = "/metrics";
}
{
job_name = "caddy";
port = 9117;
}
];
vault.secrets.garage-env = {
secretPath = "hosts/${config.networking.hostName}/garage";
extractKey = "env";
outputDir = "/run/secrets/garage-env";
services = [ "garage" ];
};
services.garage = {
enable = true;
package = pkgs.garage;
environmentFile = "/run/secrets/garage-env";
settings = {
metadata_dir = "/var/lib/garage/meta";
data_dir = "/var/lib/garage/data";
replication_factor = 1;
rpc_bind_addr = "[::]:3901";
rpc_public_addr = "garage01.home.2rjus.net:3901";
s3_api = {
api_bind_addr = "[::]:3900";
s3_region = "garage";
root_domain = ".s3.home.2rjus.net";
};
admin = {
api_bind_addr = "[::]:3903";
};
};
};
services.caddy = {
enable = true;
package = pkgs.unstable.caddy;
configFile = pkgs.writeText "Caddyfile" ''
{
acme_ca https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory
metrics
}
s3.home.2rjus.net {
reverse_proxy http://localhost:3900
}
http://garage01.home.2rjus.net:9117 {
handle /metrics {
metrics
}
respond 404
}
'';
};
}

View File

@@ -0,0 +1,391 @@
{
"uid": "apiary-homelab",
"title": "Apiary - Honeypot",
"tags": ["apiary", "honeypot", "prometheus", "homelab"],
"timezone": "browser",
"schemaVersion": 39,
"version": 1,
"refresh": "1m",
"time": {
"from": "now-24h",
"to": "now"
},
"templating": {
"list": []
},
"panels": [
{
"id": 1,
"title": "SSH Connections",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 0, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sum(oubliette_ssh_connections_total{job=\"apiary\"})",
"legendFormat": "Total",
"refId": "A",
"instant": true
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "blue", "value": null}
]
}
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none",
"textMode": "auto"
},
"description": "Total SSH connections across all outcomes"
},
{
"id": 2,
"title": "Active Sessions",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 6, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "oubliette_sessions_active{job=\"apiary\"}",
"legendFormat": "Active",
"refId": "A",
"instant": true
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 5},
{"color": "red", "value": 20}
]
},
"noValue": "0"
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none",
"textMode": "auto"
},
"description": "Currently active honeypot sessions"
},
{
"id": 3,
"title": "Unique IPs",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 12, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "oubliette_storage_unique_ips{job=\"apiary\"}",
"legendFormat": "IPs",
"refId": "A",
"instant": true
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "purple", "value": null}
]
}
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none",
"textMode": "auto"
},
"description": "Total unique source IPs observed"
},
{
"id": 4,
"title": "Total Login Attempts",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 18, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "oubliette_storage_login_attempts_total{job=\"apiary\"}",
"legendFormat": "Attempts",
"refId": "A",
"instant": true
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "orange", "value": null}
]
}
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none",
"textMode": "auto"
},
"description": "Total login attempts stored"
},
{
"id": 5,
"title": "SSH Connections Over Time",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 4},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"interval": "60s",
"targets": [
{
"expr": "rate(oubliette_ssh_connections_total{job=\"apiary\"}[$__rate_interval])",
"legendFormat": "{{outcome}}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "cps",
"custom": {
"drawStyle": "line",
"lineInterpolation": "smooth",
"fillOpacity": 20,
"pointSize": 5,
"showPoints": "auto",
"stacking": {"mode": "none"}
}
}
},
"options": {
"legend": {"displayMode": "list", "placement": "bottom"},
"tooltip": {"mode": "multi", "sort": "desc"}
},
"description": "SSH connection rate by outcome"
},
{
"id": 6,
"title": "Auth Attempts Over Time",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 4},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"interval": "60s",
"targets": [
{
"expr": "rate(oubliette_auth_attempts_total{job=\"apiary\"}[$__rate_interval])",
"legendFormat": "{{reason}} - {{result}}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "cps",
"custom": {
"drawStyle": "line",
"lineInterpolation": "smooth",
"fillOpacity": 20,
"pointSize": 5,
"showPoints": "auto",
"stacking": {"mode": "none"}
}
}
},
"options": {
"legend": {"displayMode": "list", "placement": "bottom"},
"tooltip": {"mode": "multi", "sort": "desc"}
},
"description": "Authentication attempt rate by reason and result"
},
{
"id": 7,
"title": "Sessions by Shell",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 22},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"interval": "60s",
"targets": [
{
"expr": "rate(oubliette_sessions_total{job=\"apiary\"}[$__rate_interval])",
"legendFormat": "{{shell}}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "cps",
"custom": {
"drawStyle": "line",
"lineInterpolation": "smooth",
"fillOpacity": 20,
"pointSize": 5,
"showPoints": "auto",
"stacking": {"mode": "normal"}
}
}
},
"options": {
"legend": {"displayMode": "list", "placement": "bottom"},
"tooltip": {"mode": "multi", "sort": "desc"}
},
"description": "Session creation rate by shell type"
},
{
"id": 8,
"title": "Attempts by Country",
"type": "geomap",
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 12},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "oubliette_auth_attempts_by_country_total{job=\"apiary\"}",
"legendFormat": "{{country}}",
"refId": "A",
"instant": true,
"format": "table"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 10},
{"color": "orange", "value": 50},
{"color": "red", "value": 200}
]
}
}
},
"options": {
"view": {
"id": "zero",
"lat": 30,
"lon": 10,
"zoom": 2
},
"basemap": {
"type": "default"
},
"layers": [
{
"type": "markers",
"name": "Auth Attempts",
"config": {
"showLegend": true,
"style": {
"size": {
"field": "Value",
"min": 3,
"max": 20
},
"color": {
"field": "Value"
},
"symbol": {
"mode": "fixed",
"fixed": "img/icons/marker/circle.svg"
}
}
},
"location": {
"mode": "lookup",
"lookup": "country",
"gazetteer": "public/gazetteer/countries.json"
}
}
]
},
"description": "Authentication attempts by country (geo lookup from country code)"
},
{
"id": 9,
"title": "Session Duration Distribution",
"type": "heatmap",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 30},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"interval": "60s",
"targets": [
{
"expr": "rate(oubliette_session_duration_seconds_bucket{job=\"apiary\"}[$__rate_interval])",
"legendFormat": "{{le}}",
"refId": "A",
"format": "heatmap"
}
],
"fieldConfig": {
"defaults": {
"custom": {
"scaleDistribution": {
"type": "log",
"log": 2
}
}
}
},
"options": {
"calculate": false,
"yAxis": {
"unit": "s"
},
"color": {
"scheme": "Oranges",
"mode": "scheme"
},
"cellGap": 1,
"tooltip": {
"show": true
}
},
"description": "Distribution of session durations"
},
{
"id": 10,
"title": "Commands Executed by Shell",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 22},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"interval": "60s",
"targets": [
{
"expr": "rate(oubliette_commands_executed_total{job=\"apiary\"}[$__rate_interval])",
"legendFormat": "{{shell}}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "cps",
"custom": {
"drawStyle": "line",
"lineInterpolation": "smooth",
"fillOpacity": 20,
"pointSize": 5,
"showPoints": "auto",
"stacking": {"mode": "normal"}
}
}
},
"options": {
"legend": {"displayMode": "list", "placement": "bottom"},
"tooltip": {"mode": "multi", "sort": "desc"}
},
"description": "Rate of commands executed in honeypot shells"
}
]
}

View File

@@ -628,6 +628,322 @@
}
],
"description": "Distribution of hosts by tier (test vs prod)"
},
{
"id": 15,
"title": "Build Service",
"type": "row",
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 36},
"collapsed": false
},
{
"id": 16,
"title": "Builds (24h)",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 37},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sum(increase(homelab_deploy_build_host_total{status=\"success\"}[24h]))",
"legendFormat": "Builds",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [{"color": "green", "value": null}]
},
"noValue": "0",
"decimals": 0
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none"
},
"description": "Successful host builds in the last 24 hours"
},
{
"id": 17,
"title": "Failed Builds (24h)",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 37},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sum(increase(homelab_deploy_build_host_total{status=\"failure\"}[24h])) or vector(0)",
"legendFormat": "Failed",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 1},
{"color": "red", "value": 5}
]
},
"noValue": "0",
"decimals": 0
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none"
},
"description": "Failed host builds in the last 24 hours"
},
{
"id": 18,
"title": "Last Build",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 37},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "time() - max(homelab_deploy_build_last_timestamp)",
"legendFormat": "Last Build",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 86400},
{"color": "red", "value": 604800}
]
},
"noValue": "-"
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none"
},
"description": "Time since last build attempt (yellow >1d, red >7d)"
},
{
"id": 19,
"title": "Avg Build Time",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 37},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sum(increase(homelab_deploy_build_duration_seconds_sum[24h])) / sum(increase(homelab_deploy_build_duration_seconds_count[24h]))",
"legendFormat": "Avg Time",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 30},
{"color": "red", "value": 60}
]
},
"noValue": "-"
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none"
},
"description": "Average build duration per host over the last 24 hours"
},
{
"id": 20,
"title": "Total Hosts Built",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 16, "y": 37},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count(homelab_deploy_build_duration_seconds_count)",
"legendFormat": "Hosts",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [{"color": "blue", "value": null}]
},
"noValue": "0"
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none"
},
"description": "Total number of unique hosts that have been built"
},
{
"id": 21,
"title": "Build Jobs (24h)",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 20, "y": 37},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sum(increase(homelab_deploy_builds_total[24h]))",
"legendFormat": "Jobs",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [{"color": "purple", "value": null}]
},
"noValue": "0",
"decimals": 0
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none"
},
"description": "Total build jobs (each job may build multiple hosts) in the last 24 hours"
},
{
"id": 22,
"title": "Build Time by Host",
"type": "bargauge",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 41},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sort_desc(homelab_deploy_build_duration_seconds_sum / homelab_deploy_build_duration_seconds_count)",
"legendFormat": "{{host}}",
"refId": "A",
"instant": true
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 15},
{"color": "orange", "value": 25},
{"color": "red", "value": 45}
]
},
"min": 0
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"orientation": "horizontal",
"displayMode": "gradient",
"showUnfilled": true
},
"description": "Average build time per host (green <15s, yellow <25s, orange <45s, red >45s)"
},
{
"id": 23,
"title": "Build Count by Host",
"type": "bargauge",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 41},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sort_desc(sum by (host) (homelab_deploy_build_host_total))",
"legendFormat": "{{host}}",
"refId": "A",
"instant": true
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "blue", "value": null},
{"color": "purple", "value": 10}
]
},
"min": 0
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"orientation": "horizontal",
"displayMode": "gradient",
"showUnfilled": true
},
"description": "Total build count per host (all time)"
},
{
"id": 24,
"title": "Build Activity",
"type": "timeseries",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 49},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sum(increase(homelab_deploy_build_host_total{status=\"success\"}[1h]))",
"legendFormat": "Successful",
"refId": "A"
},
{
"expr": "sum(increase(homelab_deploy_build_host_total{status=\"failure\"}[1h]))",
"legendFormat": "Failed",
"refId": "B"
}
],
"fieldConfig": {
"defaults": {
"custom": {
"lineWidth": 1,
"fillOpacity": 30,
"showPoints": "never",
"stacking": {"mode": "none"}
}
},
"overrides": [
{
"matcher": {"id": "byName", "options": "Successful"},
"properties": [{"id": "color", "value": {"mode": "fixed", "fixedColor": "green"}}]
},
{
"matcher": {"id": "byName", "options": "Failed"},
"properties": [{"id": "color", "value": {"mode": "fixed", "fixedColor": "red"}}]
}
]
},
"options": {
"legend": {
"displayMode": "list",
"placement": "bottom"
},
"tooltip": {"mode": "multi", "sort": "desc"}
},
"description": "Build activity over time (successful vs failed builds per hour)"
}
]
}

View File

@@ -34,21 +34,27 @@
};
};
# Declarative datasources pointing to monitoring01
# Declarative datasources
provision.datasources.settings = {
apiVersion = 1;
datasources = [
{
name = "Prometheus";
name = "VictoriaMetrics";
type = "prometheus";
url = "http://localhost:8428";
isDefault = true;
uid = "victoriametrics";
}
{
name = "Prometheus (monitoring01)";
type = "prometheus";
url = "http://monitoring01.home.2rjus.net:9090";
isDefault = true;
uid = "prometheus";
}
{
name = "Loki";
type = "loki";
url = "http://monitoring01.home.2rjus.net:3100";
url = "http://localhost:3100";
uid = "loki";
}
];
@@ -81,22 +87,20 @@
services.caddy = {
enable = true;
package = pkgs.unstable.caddy;
configFile = pkgs.writeText "Caddyfile" ''
{
acme_ca https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory
metrics
}
grafana-test.home.2rjus.net {
log {
output file /var/log/caddy/grafana.log {
mode 644
}
globalConfig = ''
acme_ca https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory
metrics
'';
virtualHosts."grafana-test.home.2rjus.net".extraConfig = ''
log {
output file /var/log/caddy/grafana.log {
mode 644
}
reverse_proxy http://127.0.0.1:3000
}
reverse_proxy http://127.0.0.1:3000
'';
# Metrics endpoint on plain HTTP for Prometheus scraping
extraConfig = ''
http://${config.networking.hostName}.home.2rjus.net/metrics {
metrics
}

104
services/loki/default.nix Normal file
View File

@@ -0,0 +1,104 @@
{ config, lib, pkgs, ... }:
let
# Script to generate bcrypt hash from Vault password for Caddy basic_auth
generateCaddyAuth = pkgs.writeShellApplication {
name = "generate-caddy-loki-auth";
runtimeInputs = [ config.services.caddy.package ];
text = ''
PASSWORD=$(cat /run/secrets/loki-push-auth)
HASH=$(caddy hash-password --plaintext "$PASSWORD")
echo "LOKI_PUSH_HASH=$HASH" > /run/secrets/caddy-loki-auth.env
chmod 0400 /run/secrets/caddy-loki-auth.env
'';
};
in
{
# Fetch Loki push password from Vault
vault.secrets.loki-push-auth = {
secretPath = "shared/loki/push-auth";
extractKey = "password";
services = [ "caddy" ];
};
# Generate bcrypt hash for Caddy before it starts
systemd.services.caddy-loki-auth = {
description = "Generate Caddy basic auth hash for Loki";
after = [ "vault-secret-loki-push-auth.service" ];
requires = [ "vault-secret-loki-push-auth.service" ];
before = [ "caddy.service" ];
requiredBy = [ "caddy.service" ];
serviceConfig = {
Type = "oneshot";
RemainAfterExit = true;
ExecStart = lib.getExe generateCaddyAuth;
};
};
# Load the bcrypt hash as environment variable for Caddy
services.caddy.environmentFile = "/run/secrets/caddy-loki-auth.env";
# Caddy reverse proxy for Loki with basic auth
services.caddy.virtualHosts."loki.home.2rjus.net".extraConfig = ''
basic_auth {
promtail {env.LOKI_PUSH_HASH}
}
reverse_proxy http://127.0.0.1:3100
'';
services.loki = {
enable = true;
configuration = {
auth_enabled = false;
server = {
http_listen_address = "127.0.0.1";
http_listen_port = 3100;
};
common = {
ring = {
instance_addr = "127.0.0.1";
kvstore = {
store = "inmemory";
};
};
replication_factor = 1;
path_prefix = "/var/lib/loki";
};
schema_config = {
configs = [
{
from = "2024-01-01";
store = "tsdb";
object_store = "filesystem";
schema = "v13";
index = {
prefix = "loki_index_";
period = "24h";
};
}
];
};
storage_config = {
filesystem = {
directory = "/var/lib/loki/chunks";
};
};
compactor = {
working_directory = "/var/lib/loki/compactor";
compaction_interval = "10m";
retention_enabled = true;
retention_delete_delay = "2h";
retention_delete_worker_count = 150;
delete_request_store = "filesystem";
};
limits_config = {
retention_period = "30d";
ingestion_rate_mb = 10;
ingestion_burst_size_mb = 20;
max_streams_per_user = 10000;
max_query_series = 500;
max_query_parallelism = 8;
};
};
};
}

View File

@@ -21,7 +21,7 @@ let
"https://pyroscope.home.2rjus.net"
"https://pushgw.home.2rjus.net"
# Caddy auto-TLS on nix-cache01
# Caddy auto-TLS on nix-cache02
"https://nix-cache.home.2rjus.net"
# Caddy auto-TLS on grafana01

View File

@@ -37,6 +37,22 @@
directory = "/var/lib/loki/chunks";
};
};
compactor = {
working_directory = "/var/lib/loki/compactor";
compaction_interval = "10m";
retention_enabled = true;
retention_delete_delay = "2h";
retention_delete_worker_count = 150;
delete_request_store = "filesystem";
};
limits_config = {
retention_period = "30d";
ingestion_rate_mb = 10;
ingestion_burst_size_mb = 20;
max_streams_per_user = 10000;
max_query_series = 500;
max_query_parallelism = 8;
};
};
};
}

View File

@@ -73,6 +73,15 @@ in
};
};
# Fetch apiary bearer token from Vault
vault.secrets.prometheus-apiary-token = {
secretPath = "hosts/monitoring01/apiary-token";
extractKey = "password";
owner = "prometheus";
group = "prometheus";
services = [ "prometheus" ];
};
services.prometheus = {
enable = true;
# syntax-only check because we use external credential files (e.g., openbao-token)
@@ -178,9 +187,7 @@ in
}
];
}
# TODO: nix-cache_caddy can't be auto-generated because the cert is issued
# for nix-cache.home.2rjus.net (service CNAME), not nix-cache01 (hostname).
# Consider adding a target override to homelab.monitoring.scrapeTargets.
# Caddy metrics from nix-cache02 (serves nix-cache.home.2rjus.net)
{
job_name = "nix-cache_caddy";
scheme = "https";
@@ -235,6 +242,19 @@ in
credentials_file = "/run/secrets/prometheus/openbao-token";
};
}
# Apiary external service
{
job_name = "apiary";
scheme = "https";
scrape_interval = "60s";
static_configs = [{
targets = [ "apiary.t-juice.club" ];
}];
authorization = {
type = "Bearer";
credentials_file = "/run/secrets/prometheus-apiary-token";
};
}
] ++ autoScrapeConfigs;
pushgateway = {

View File

@@ -118,13 +118,13 @@ groups:
description: "NSD has been down on {{ $labels.instance }} more than 5 minutes."
# Only alert on primary DNS (secondary has cold cache after failover)
- alert: unbound_low_cache_hit_ratio
expr: (rate(unbound_cache_hits_total{dns_role="primary"}[5m]) / (rate(unbound_cache_hits_total{dns_role="primary"}[5m]) + rate(unbound_cache_misses_total{dns_role="primary"}[5m]))) < 0.5
expr: (rate(unbound_cache_hits_total{dns_role="primary"}[5m]) / (rate(unbound_cache_hits_total{dns_role="primary"}[5m]) + rate(unbound_cache_misses_total{dns_role="primary"}[5m]))) < 0.2
for: 15m
labels:
severity: warning
annotations:
summary: "Low DNS cache hit ratio on {{ $labels.instance }}"
description: "Unbound cache hit ratio is below 50% on {{ $labels.instance }}."
description: "Unbound cache hit ratio is below 20% on {{ $labels.instance }}."
- name: http_proxy_rules
rules:
- alert: caddy_down
@@ -171,37 +171,14 @@ groups:
description: "NATS has {{ $value }} slow consumers on {{ $labels.instance }}."
- name: nix_cache_rules
rules:
- alert: build_flakes_service_not_active_recently
expr: count_over_time(node_systemd_unit_state{instance="nix-cache01.home.2rjus.net:9100", name="build-flakes.service", state="active"}[1h]) < 1
for: 0m
labels:
severity: critical
annotations:
summary: "The build-flakes service on {{ $labels.instance }} has not run recently"
description: "The build-flakes service on {{ $labels.instance }} has not run recently"
- alert: build_flakes_error
expr: build_flakes_error == 1
labels:
severity: warning
annotations:
summary: "The build-flakes job has failed for host {{ $labels.host }}."
description: "The build-flakes job has failed for host {{ $labels.host }}."
- alert: harmonia_down
expr: node_systemd_unit_state {instance="nix-cache01.home.2rjus.net:9100", name = "harmonia.service", state = "active"} == 0
expr: node_systemd_unit_state{instance="nix-cache02.home.2rjus.net:9100", name="harmonia.service", state="active"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Harmonia not running on {{ $labels.instance }}"
description: "Harmonia has been down on {{ $labels.instance }} more than 5 minutes."
- alert: low_disk_space_nix
expr: node_filesystem_free_bytes{instance="nix-cache01.home.2rjus.net:9100", mountpoint="/nix"} / node_filesystem_size_bytes{instance="nix-cache01.home.2rjus.net:9100", mountpoint="/nix"} * 100 < 10
for: 5m
labels:
severity: warning
annotations:
summary: "Disk space low on /nix for {{ $labels.instance }}"
description: "Disk space is low on /nix for host {{ $labels.instance }}. Please check."
- name: home_assistant_rules
rules:
- alert: home_assistant_down
@@ -418,3 +395,21 @@ groups:
annotations:
summary: "TLS probe failed for {{ $labels.instance }}"
description: "Cannot connect to {{ $labels.instance }} to check TLS certificate. The service may be down or unreachable."
- name: homelab_deploy_rules
rules:
- alert: homelab_deploy_build_failed
expr: increase(homelab_deploy_build_host_total{status="failure"}[1h]) > 0
for: 0m
labels:
severity: warning
annotations:
summary: "Build failed for {{ $labels.host }} in repo {{ $labels.repo }}"
description: "Host {{ $labels.host }} failed to build from {{ $labels.repo }} repository."
- alert: homelab_deploy_builder_down
expr: up{job="homelab-deploy-builder"} == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Homelab deploy builder not responding on {{ $labels.instance }}"
description: "Cannot scrape homelab-deploy-builder metrics from {{ $labels.instance }} for 5 minutes."

View File

@@ -74,10 +74,12 @@
publish = [
"deploy.test.>"
"deploy.discover"
"build.>"
];
subscribe = [
"deploy.responses.>"
"deploy.discover"
"build.responses.>"
];
};
}
@@ -85,8 +87,30 @@
{
nkey = "UD2BFB7DLM67P5UUVCKBUJMCHADIZLGGVUNSRLZE2ZC66FW2XT44P73Y";
permissions = {
publish = [ "deploy.>" ];
subscribe = [ "deploy.>" ];
publish = [
"deploy.>"
"build.>"
];
subscribe = [
"deploy.>"
"build.responses.>"
];
};
}
# Builder (subscribes to build requests, publishes responses)
{
nkey = "UB4PUHGKAWAK6OS62FX7DOQTPFFJTLZZBTKCOCAXDP75H3NSMWAEDJ7E";
permissions = {
subscribe = [ "build.>" ];
publish = [ "build.responses.>" ];
};
}
# Scheduler (publishes build requests, subscribes to responses)
{
nkey = "UDQ5SFEGDM66AQGLK7KQDW6ZOC2QCXE2P6EJQ6VPBSR2CRCABPOVWRI4";
permissions = {
publish = [ "build.>" ];
subscribe = [ "build.responses.>" ];
};
}
];

View File

@@ -1,29 +0,0 @@
{ pkgs, ... }:
let
build-flake-script = pkgs.writeShellApplication {
name = "build-flake-script";
runtimeInputs = with pkgs; [
git
nix
nixos-rebuild
jq
curl
];
text = builtins.readFile ./build-flakes.sh;
};
in
{
systemd.services."build-flakes" = {
serviceConfig = {
Type = "exec";
ExecStart = "${build-flake-script}/bin/build-flake-script";
};
};
systemd.timers."build-flakes" = {
enable = true;
wantedBy = [ "timers.target" ];
timerConfig = {
OnCalendar = "*-*-* *:30:00";
};
};
}

View File

@@ -1,44 +0,0 @@
JOB_NAME="build_flakes"
cd /root/nixos-servers
git pull
echo "Starting nixos-servers builds"
for host in $(nix flake show --json| jq -r '.nixosConfigurations | keys[]'); do
echo "Building $host"
if ! nixos-rebuild --verbose -L --flake ".#$host" build; then
echo "Build failed for $host"
cat <<EOF | curl -sS -X PUT --data-binary @- "https://pushgw.home.2rjus.net/metrics/job/$JOB_NAME/host/$host"
# TYPE build_flakes_error gauge
# HELP build_flakes_error 0 if the build was successful, 1 if it failed
build_flakes_error{instance="$HOSTNAME"} 1
EOF
else
echo "Build successful for $host"
cat <<EOF | curl -sS -X PUT --data-binary @- "https://pushgw.home.2rjus.net/metrics/job/$JOB_NAME/host/$host"
# TYPE build_flakes_error gauge
# HELP build_flakes_error 0 if the build was successful, 1 if it failed
build_flakes_error{instance="$HOSTNAME"} 0
EOF
fi
done
echo "All nixos-servers builds complete"
echo "Building gunter"
cd /root/nixos
git pull
host="gunter"
if ! nixos-rebuild --verbose -L --flake ".#gunter" build; then
echo "Build failed for $host"
cat <<EOF | curl -sS -X PUT --data-binary @- "https://pushgw.home.2rjus.net/metrics/job/$JOB_NAME/host/$host"
# TYPE build_flakes_error gauge
# HELP build_flakes_error 0 if the build was successful, 1 if it failed
build_flakes_error{instance="$HOSTNAME"} 1
EOF
else
echo "Build successful for $host"
cat <<EOF | curl -sS -X PUT --data-binary @- "https://pushgw.home.2rjus.net/metrics/job/$JOB_NAME/host/$host"
# TYPE build_flakes_error gauge
# HELP build_flakes_error 0 if the build was successful, 1 if it failed
build_flakes_error{instance="$HOSTNAME"} 0
EOF
fi

View File

@@ -1,10 +1,8 @@
{ ... }:
{
imports = [
./build-flakes.nix
./harmonia.nix
./proxy.nix
./nix.nix
];
}

View File

@@ -1,7 +1,7 @@
{ pkgs, config, ... }:
{
vault.secrets.cache-secret = {
secretPath = "hosts/nix-cache01/cache-secret";
secretPath = "hosts/${config.networking.hostName}/cache-secret";
extractKey = "key";
outputDir = "/run/secrets/cache-secret";
services = [ "harmonia" ];

View File

@@ -0,0 +1,219 @@
{ self, config, lib, pkgs, ... }:
let
monLib = import ../../lib/monitoring.nix { inherit lib; };
externalTargets = import ../monitoring/external-targets.nix;
nodeExporterTargets = monLib.generateNodeExporterTargets self externalTargets;
autoScrapeConfigs = monLib.generateScrapeConfigs self externalTargets;
# Script to fetch AppRole token for VictoriaMetrics to use when scraping OpenBao metrics
fetchOpenbaoToken = pkgs.writeShellApplication {
name = "fetch-openbao-token-vm";
runtimeInputs = [ pkgs.curl pkgs.jq ];
text = ''
VAULT_ADDR="https://vault01.home.2rjus.net:8200"
APPROLE_DIR="/var/lib/vault/approle"
OUTPUT_FILE="/run/secrets/victoriametrics/openbao-token"
# Read AppRole credentials
if [ ! -f "$APPROLE_DIR/role-id" ] || [ ! -f "$APPROLE_DIR/secret-id" ]; then
echo "AppRole credentials not found at $APPROLE_DIR" >&2
exit 1
fi
ROLE_ID=$(cat "$APPROLE_DIR/role-id")
SECRET_ID=$(cat "$APPROLE_DIR/secret-id")
# Authenticate to Vault
AUTH_RESPONSE=$(curl -sf -k -X POST \
-d "{\"role_id\":\"$ROLE_ID\",\"secret_id\":\"$SECRET_ID\"}" \
"$VAULT_ADDR/v1/auth/approle/login")
# Extract token
VAULT_TOKEN=$(echo "$AUTH_RESPONSE" | jq -r '.auth.client_token')
if [ -z "$VAULT_TOKEN" ] || [ "$VAULT_TOKEN" = "null" ]; then
echo "Failed to extract Vault token from response" >&2
exit 1
fi
# Write token to file
mkdir -p "$(dirname "$OUTPUT_FILE")"
echo -n "$VAULT_TOKEN" > "$OUTPUT_FILE"
chown victoriametrics:victoriametrics "$OUTPUT_FILE"
chmod 0400 "$OUTPUT_FILE"
echo "Successfully fetched OpenBao token"
'';
};
scrapeConfigs = [
# Auto-generated node-exporter targets from flake hosts + external
{
job_name = "node-exporter";
static_configs = nodeExporterTargets;
}
# Systemd exporter on all hosts (same targets, different port)
{
job_name = "systemd-exporter";
static_configs = map
(cfg: cfg // {
targets = map (t: builtins.replaceStrings [ ":9100" ] [ ":9558" ] t) cfg.targets;
})
nodeExporterTargets;
}
# Local monitoring services
{
job_name = "victoriametrics";
static_configs = [{ targets = [ "localhost:8428" ]; }];
}
{
job_name = "loki";
static_configs = [{ targets = [ "localhost:3100" ]; }];
}
{
job_name = "grafana";
static_configs = [{ targets = [ "localhost:3000" ]; }];
}
{
job_name = "alertmanager";
static_configs = [{ targets = [ "localhost:9093" ]; }];
}
# Caddy metrics from nix-cache02
{
job_name = "nix-cache_caddy";
scheme = "https";
static_configs = [{ targets = [ "nix-cache.home.2rjus.net" ]; }];
}
# OpenBao metrics with bearer token auth
{
job_name = "openbao";
scheme = "https";
metrics_path = "/v1/sys/metrics";
params = { format = [ "prometheus" ]; };
static_configs = [{ targets = [ "vault01.home.2rjus.net:8200" ]; }];
authorization = {
type = "Bearer";
credentials_file = "/run/secrets/victoriametrics/openbao-token";
};
}
# Apiary external service
{
job_name = "apiary";
scheme = "https";
scrape_interval = "60s";
static_configs = [{ targets = [ "apiary.t-juice.club" ]; }];
authorization = {
type = "Bearer";
credentials_file = "/run/secrets/victoriametrics-apiary-token";
};
}
] ++ autoScrapeConfigs;
in
{
# Static user for VictoriaMetrics (overrides DynamicUser) so vault.secrets
# and credential files can be owned by this user
users.users.victoriametrics = {
isSystemUser = true;
group = "victoriametrics";
};
users.groups.victoriametrics = { };
# Override DynamicUser since we need a static user for credential file access
systemd.services.victoriametrics.serviceConfig = {
DynamicUser = lib.mkForce false;
User = "victoriametrics";
Group = "victoriametrics";
};
# Systemd service to fetch AppRole token for OpenBao scraping
systemd.services.victoriametrics-openbao-token = {
description = "Fetch OpenBao token for VictoriaMetrics metrics scraping";
after = [ "network-online.target" ];
wants = [ "network-online.target" ];
before = [ "victoriametrics.service" ];
requiredBy = [ "victoriametrics.service" ];
serviceConfig = {
Type = "oneshot";
ExecStart = lib.getExe fetchOpenbaoToken;
};
};
# Timer to periodically refresh the token (AppRole tokens have 1-hour TTL)
systemd.timers.victoriametrics-openbao-token = {
description = "Refresh OpenBao token for VictoriaMetrics";
wantedBy = [ "timers.target" ];
timerConfig = {
OnBootSec = "5min";
OnUnitActiveSec = "30min";
RandomizedDelaySec = "5min";
};
};
# Fetch apiary bearer token from Vault
vault.secrets.victoriametrics-apiary-token = {
secretPath = "hosts/monitoring01/apiary-token";
extractKey = "password";
owner = "victoriametrics";
group = "victoriametrics";
services = [ "victoriametrics" ];
};
services.victoriametrics = {
enable = true;
retentionPeriod = "3"; # 3 months
# Disable config check since we reference external credential files
checkConfig = false;
prometheusConfig = {
global.scrape_interval = "15s";
scrape_configs = scrapeConfigs;
};
};
# vmalert for alerting rules - no notifier during parallel operation
services.vmalert.instances.default = {
enable = true;
settings = {
"datasource.url" = "http://localhost:8428";
# Blackhole notifications during parallel operation to prevent duplicate alerts.
# Replace with notifier.url after cutover from monitoring01:
# "notifier.url" = [ "http://localhost:9093" ];
"notifier.blackhole" = true;
"rule" = [ ../monitoring/rules.yml ];
};
};
# Caddy reverse proxy for VictoriaMetrics and vmalert
services.caddy.virtualHosts."metrics.home.2rjus.net".extraConfig = ''
reverse_proxy http://127.0.0.1:8428
'';
services.caddy.virtualHosts."vmalert.home.2rjus.net".extraConfig = ''
reverse_proxy http://127.0.0.1:8880
'';
# Alertmanager - same config as monitoring01 but will only receive
# alerts after cutover (vmalert notifier is disabled above)
services.prometheus.alertmanager = {
enable = true;
configuration = {
global = { };
route = {
receiver = "webhook_natstonotify";
group_wait = "30s";
group_interval = "5m";
repeat_interval = "1h";
group_by = [ "alertname" ];
};
receivers = [
{
name = "webhook_natstonotify";
webhook_configs = [
{
url = "http://localhost:5001/alert";
}
];
}
];
};
};
}

View File

@@ -1,4 +1,12 @@
{ config, ... }:
{ config, lib, ... }:
let
hostLabels = {
hostname = config.networking.hostName;
tier = config.homelab.host.tier;
} // lib.optionalAttrs (config.homelab.host.role != null) {
role = config.homelab.host.role;
};
in
{
# Configure journald
services.journald = {
@@ -8,6 +16,16 @@
SystemKeepFree=1G
'';
};
# Fetch Loki push password from Vault (only on hosts with Vault enabled)
vault.secrets.promtail-loki-auth = lib.mkIf config.vault.enable {
secretPath = "shared/loki/push-auth";
extractKey = "password";
owner = "promtail";
group = "promtail";
services = [ "promtail" ];
};
# Configure promtail
services.promtail = {
enable = true;
@@ -23,6 +41,14 @@
{
url = "http://monitoring01.home.2rjus.net:3100/loki/api/v1/push";
}
] ++ lib.optionals config.vault.enable [
{
url = "https://loki.home.2rjus.net/loki/api/v1/push";
basic_auth = {
username = "promtail";
password_file = "/run/secrets/promtail-loki-auth";
};
}
];
scrape_configs = [
@@ -32,17 +58,26 @@
json = true;
labels = {
job = "systemd-journal";
};
} // hostLabels;
};
relabel_configs = [
{
source_labels = [ "__journal__systemd_unit" ];
target_label = "systemd_unit";
}
];
pipeline_stages = [
# Extract PRIORITY from journal JSON
{ json.expressions.priority = "PRIORITY"; }
# Map numeric PRIORITY to level name
{
source_labels = [ "__journal__hostname" ];
target_label = "host";
template = {
source = "priority";
template = ''{{ if or (eq .Value "0") (eq .Value "1") (eq .Value "2") }}critical{{ else if eq .Value "3" }}error{{ else if eq .Value "4" }}warning{{ else if eq .Value "5" }}notice{{ else if eq .Value "6" }}info{{ else if eq .Value "7" }}debug{{ end }}'';
};
}
# Attach as level label
{ labels.level = "priority"; }
];
}
{
@@ -53,8 +88,7 @@
labels = {
job = "varlog";
__path__ = "/var/log/**/*.log";
hostname = "${config.networking.hostName}";
};
} // hostLabels;
}
];
}

View File

@@ -42,7 +42,7 @@ in
"https://cuda-maintainers.cachix.org"
];
trusted-public-keys = [
"nix-cache.home.2rjus.net-1:2kowZOG6pvhoK4AHVO3alBlvcghH20wchzoR0V86UWI="
"nix-cache02.home.2rjus.net-1:QyT5FAvJtV+EPQrgQQ6iV9JMg1kRiWuIAJftM35QMls="
"cache.nixos.org-1:6NCHdD59X431o0gWypbMrAURkbJ16ZPMQFGspcDShjY="
"cuda-maintainers.cachix.org-1:0dq3bujKpuEPMCX6U4WylrUDZ9JyUG0VpVZa7CNfq5E="
];

View File

@@ -61,7 +61,7 @@ let
streams: [{
stream: {
job: $job,
host: $host,
hostname: $host,
type: $type,
id: $id
},

View File

@@ -26,6 +26,17 @@ path "secret/data/shared/nixos-exporter/*" {
EOT
}
# Shared policy for Loki push authentication (all hosts push logs)
resource "vault_policy" "loki_push" {
name = "loki-push"
policy = <<EOT
path "secret/data/shared/loki/*" {
capabilities = ["read", "list"]
}
EOT
}
# Define host access policies
locals {
host_policies = {
@@ -78,7 +89,7 @@ locals {
]
}
# Wave 3: DNS servers
# Wave 3: DNS servers (managed in hosts-generated.tf)
# Wave 4: http-proxy
"http-proxy" = {
@@ -87,13 +98,6 @@ locals {
]
}
# Wave 5: nix-cache01
"nix-cache01" = {
paths = [
"secret/data/hosts/nix-cache01/*",
]
}
# vault01: Vault server itself (fetches secrets from itself)
"vault01" = {
paths = [
@@ -111,10 +115,11 @@ locals {
]
}
# monitoring02: Grafana test instance
# monitoring02: Grafana + VictoriaMetrics
"monitoring02" = {
paths = [
"secret/data/hosts/monitoring02/*",
"secret/data/hosts/monitoring01/apiary-token",
"secret/data/services/grafana/*",
]
}
@@ -144,7 +149,7 @@ resource "vault_approle_auth_backend_role" "hosts" {
backend = vault_auth_backend.approle.path
role_name = each.key
token_policies = concat(
["${each.key}-policy", "homelab-deploy", "nixos-exporter"],
["${each.key}-policy", "homelab-deploy", "nixos-exporter", "loki-push"],
lookup(each.value, "extra_policies", [])
)

View File

@@ -36,6 +36,12 @@ locals {
"nix-cache02" = {
paths = [
"secret/data/hosts/nix-cache02/*",
"secret/data/shared/homelab-deploy/*",
]
}
"garage01" = {
paths = [
"secret/data/hosts/garage01/*",
]
}
@@ -68,7 +74,7 @@ resource "vault_approle_auth_backend_role" "generated_hosts" {
backend = vault_auth_backend.approle.path
role_name = each.key
token_policies = ["host-${each.key}", "homelab-deploy", "nixos-exporter"]
token_policies = ["host-${each.key}", "homelab-deploy", "nixos-exporter", "loki-push"]
secret_id_ttl = 0 # Never expire (wrapped tokens provide time limit)
token_ttl = 3600
token_max_ttl = 3600

View File

@@ -76,15 +76,9 @@ locals {
}
# Nix cache signing key
"hosts/nix-cache01/cache-secret" = {
"hosts/nix-cache02/cache-secret" = {
auto_generate = false
data = { key = var.cache_signing_key }
}
# Gitea Actions runner token
"hosts/nix-cache01/actions-token" = {
auto_generate = false
data = { token = var.actions_token_1 }
data = { key = var.cache_signing_key_02 }
}
# Homelab-deploy NKeys
@@ -103,6 +97,22 @@ locals {
data = { nkey = var.homelab_deploy_admin_deployer_nkey }
}
"shared/homelab-deploy/builder-nkey" = {
auto_generate = false
data = { nkey = var.homelab_deploy_builder_nkey }
}
"shared/homelab-deploy/scheduler-nkey" = {
auto_generate = false
data = { nkey = var.homelab_deploy_scheduler_nkey }
}
# Garage S3 environment (RPC secret + admin token)
"hosts/garage01/garage" = {
auto_generate = false
data = { env = var.garage_env }
}
# Kanidm idm_admin password
"kanidm/idm-admin-password" = {
auto_generate = true
@@ -137,6 +147,18 @@ locals {
auto_generate = false
data = { api_key = var.sonarr_api_key }
}
# Bearer token for scraping apiary metrics
"hosts/monitoring01/apiary-token" = {
auto_generate = true
password_length = 64
}
# Loki push authentication (used by Promtail on all hosts)
"shared/loki/push-auth" = {
auto_generate = true
password_length = 32
}
}
}

View File

@@ -40,14 +40,8 @@ variable "wireguard_private_key" {
sensitive = true
}
variable "cache_signing_key" {
description = "Nix binary cache signing key"
type = string
sensitive = true
}
variable "actions_token_1" {
description = "Gitea Actions runner token"
variable "cache_signing_key_02" {
description = "Nix binary cache signing key (nix-cache02)"
type = string
sensitive = true
}
@@ -73,6 +67,20 @@ variable "homelab_deploy_admin_deployer_nkey" {
sensitive = true
}
variable "homelab_deploy_builder_nkey" {
description = "NKey seed for homelab-deploy builder"
type = string
default = "PLACEHOLDER"
sensitive = true
}
variable "homelab_deploy_scheduler_nkey" {
description = "NKey seed for scheduled build triggering"
type = string
default = "PLACEHOLDER"
sensitive = true
}
variable "nixos_exporter_nkey" {
description = "NKey seed for nixos-exporter NATS authentication"
type = string
@@ -80,6 +88,13 @@ variable "nixos_exporter_nkey" {
sensitive = true
}
variable "garage_env" {
description = "Garage environment file contents (GARAGE_RPC_SECRET and GARAGE_ADMIN_TOKEN)"
type = string
default = "PLACEHOLDER"
sensitive = true
}
variable "radarr_api_key" {
description = "Radarr API key for exportarr metrics"
type = string

View File

@@ -89,10 +89,17 @@ locals {
"nix-cache02" = {
ip = "10.69.13.25/24"
cpu_cores = 8
memory = 16384
memory = 20480
disk_size = "200G"
vault_wrapped_token = "s.C5EuHFyULACEqZgsLqMC3cJB"
}
"garage01" = {
ip = "10.69.13.26/24"
cpu_cores = 2
memory = 2048
disk_size = "30G"
vault_wrapped_token = "s.dtMKPT35AIrbyEiHf9c2UcsB"
}
}
# Compute VM configurations with defaults applied