Commit Graph

115 Commits

Author SHA1 Message Date
c3691a39a3 grafana: add Grafana on monitoring02 with Kanidm OIDC
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Deploy Grafana test instance on monitoring02 with:
- Kanidm OIDC authentication (admins -> Admin role, others -> Viewer)
- Declarative datasources for Prometheus and Loki on monitoring01
- Local Caddy for TLS termination via internal ACME CA
- DNS CNAME grafana-test.home.2rjus.net

Terraform changes add OAuth2 client secret and AppRole policies for
kanidm01 and monitoring02.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 19:58:19 +01:00
0b977808ca hosts: add monitoring02 configuration
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
New test-tier host for monitoring stack expansion with:
- Static IP 10.69.13.24
- 4 CPU cores, 4GB RAM, 20GB disk
- Vault integration and NATS-based deployment enabled

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 19:19:38 +01:00
b845a8bb8b system: add kanidm PAM/NSS client module
Add homelab.kanidm.enable option for central authentication via Kanidm.
The module configures:
- PAM/NSS integration with kanidm-unixd
- Client connection to auth.home.2rjus.net
- Login authorization for ssh-users group

Enable on testvm01-03 for testing.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 15:12:19 +01:00
bfbf0cea68 template2: enable zram for bootstrap
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m34s
Prevents OOM during initial nixos-rebuild on 2GB VMs.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 13:34:08 +01:00
1674b6a844 system: enable zram swap for all hosts
Some checks failed
Run nix flake check / flake-check (push) Failing after 12m6s
Provides compressed swap in RAM to prevent OOM kills during
nixos-rebuild on low-memory VMs (2GB). Removes duplicate zram
configs from jelly01 and nix-cache01.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 13:02:58 +01:00
7fcc043a4d testvm: add SSH session command auditing
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Enable Linux audit to log execve syscalls from interactive SSH sessions.
Uses auid filter to exclude system services and nix builds.

Logs forwarded to journald for Loki ingestion. Query with:
{host="testvmXX"} |= "EXECVE"

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 03:07:10 +01:00
ca0e3fd629 kanidm01: add kanidm authentication server
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
- New test-tier VM at 10.69.13.23 with role=auth
- Kanidm 1.8 server with HTTPS (443) and LDAPS (636)
- ACME certificate from internal CA (auth.home.2rjus.net)
- Provisioned groups: admins, users, ssh-users
- Provisioned user: torjus
- Daily backups at 22:00 (7 versions)
- Prometheus monitoring scrape target

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 00:13:59 +01:00
3a14ffd6b5 template2: add nix cache configuration
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
New VMs bootstrapped from template2 will now use the local nix cache
during initial nixos-rebuild, speeding up bootstrap times.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 23:40:53 +01:00
94feae82a0 ns1: recreate with OpenTofu workflow
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Old VM had incorrect hardware-configuration.nix with hardcoded UUIDs
that didn't match actual disk layout, causing boot failure (emergency mode).

Recreated using template2-based configuration for OpenTofu provisioning.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 23:18:08 +01:00
8ec2a083bd pgdb1: decommission postgresql host
Remove pgdb1 host configuration and postgres service module.
The only consumer (Open WebUI on gunter) has migrated to local PostgreSQL.

Removed:
- hosts/pgdb1/ - host configuration
- services/postgres/ - service module (only used by pgdb1)
- postgres_rules from monitoring rules
- rebuild-all.sh (obsolete script)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 22:54:50 +01:00
ba9f47f914 jump: remove unused host configuration
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Host was decommissioned and not in flake.nix.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 22:30:37 +01:00
536daee4c7 ns2: migrate to OpenTofu management
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
- Remove hosts/template/ (legacy template1) and give each legacy host
  its own hardware-configuration.nix copy
- Recreate ns2 using create-host with template2 base
- Add secondary DNS services (NSD + Unbound resolver)
- Configure Vault policy for shared DNS secrets
- Fix create-host IP uniqueness validator to check CIDR notation
  (prevents false positives from DNS resolver entries)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 19:28:35 +01:00
aedccbd9a0 flake: remove sops-nix (no longer used)
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
All secrets are now managed by OpenBao (Vault). Remove the legacy
sops-nix infrastructure that is no longer in use.

Removed:
- sops-nix flake input
- system/sops.nix module
- .sops.yaml configuration file
- Age key generation from template prepare-host scripts

Updated:
- flake.nix - removed sops-nix references from all hosts
- flake.lock - removed sops-nix input
- scripts/create-host/ - removed sops references
- CLAUDE.md - removed SOPS documentation

Note: secrets/ directory should be manually removed by the user.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 18:46:24 +01:00
bdc6057689 hosts: decommission ca host and remove labmon
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Remove the step-ca host and labmon flake input now that ACME has been
migrated to OpenBao PKI.

Removed:
- hosts/ca/ - step-ca host configuration
- services/ca/ - step-ca service module
- labmon flake input and module (no longer used)

Updated:
- flake.nix - removed ca host and labmon references
- flake.lock - removed labmon input
- rebuild-all.sh - removed ca from host list
- CLAUDE.md - updated documentation

Note: secrets/ca/ should be manually removed by the user.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 18:41:49 +01:00
9d019f2b9a testvm01: add nginx with ACME certificate for PKI testing
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Set up a simple nginx server with an ACME certificate from the new
OpenBao PKI infrastructure. This allows testing the ACME migration
before deploying to production hosts.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 18:22:28 +01:00
21db7e9573 acme: migrate from step-ca to OpenBao PKI
Switch all ACME certificate issuance from step-ca (ca.home.2rjus.net)
to OpenBao PKI (vault.home.2rjus.net:8200/v1/pki_int/acme/directory).

- Update default ACME server in system/acme.nix
- Update Caddy acme_ca in http-proxy and nix-cache services
- Remove labmon service from monitoring01 (step-ca monitoring)
- Remove labmon scrape target and certificate_rules alerts
- Remove alloy.nix (only used for labmon profiling)
- Add docs/plans/cert-monitoring.md for future cert monitoring needs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 18:20:10 +01:00
979040aaf7 vault01: enable homelab-deploy listener
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Enable vault.enable and homelab.deploy.enable on vault01 so it can
receive NATS-based remote deployments. Vault fetches secrets from
itself using AppRole after auto-unseal.

Add systemd ordering to ensure vault-secret services wait for openbao
to be unsealed before attempting to fetch secrets.

Also adds vault01 AppRole entry to Terraform.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 17:55:09 +01:00
8791c29402 hosts: enable homelab-deploy listener on pgdb1, nats1, jelly01
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Enable vault.enable and homelab.deploy.enable for these hosts to
allow NATS-based remote deployments and expose metrics on port 9972.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 17:43:06 +01:00
ae3039af19 template2: send bootstrap status to Loki for remote monitoring
Adds log_to_loki function that pushes structured log entries to Loki
at key bootstrap stages (starting, network_ok, vault_*, building,
success, failed). Enables querying bootstrap state via LogQL without
console access.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:34:47 +01:00
11261c4636 template2: revert to journal+console output for bootstrap
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
TTY output was causing nixos-rebuild to fail. Keep the custom
greeting line to indicate bootstrap image, but use journal+console
for reliable logging.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:24:39 +01:00
78e8d7a600 template2: add ncurses for clear command in bootstrap
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:10:25 +01:00
a1ae766eb8 template2: show bootstrap progress on tty1
- Display bootstrap banner and live progress on tty1 instead of login prompt
- Add custom getty greeting on other ttys indicating this is a bootstrap image
- Disable getty on tty1 during bootstrap so output is visible

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 14:49:58 +01:00
370cf2b03a hosts: enable vault and deploy listener on test VMs
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
- Add vault.enable = true to testvm01, testvm02, testvm03
- Add homelab.deploy.enable = true for remote deployment via NATS
- Update create-host template to include these by default

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 13:55:33 +01:00
7bc465b414 hosts: add testvm01, testvm02, testvm03 test hosts
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Three permanent test hosts for validating deployment and bootstrapping
workflow. Each host configured with:
- Static IP (10.69.13.20-22/24)
- Vault AppRole integration
- Bootstrap from deploy-test-hosts branch

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 13:34:16 +01:00
8d7bc50108 hosts: remove testvm01
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Test host no longer needed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 12:58:24 +01:00
03e70ac094 hosts: remove vaulttest01
Test host no longer needed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 12:55:38 +01:00
f6eca9decc vaulttest01: add htop for deploy verification test
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m3s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 07:23:22 +01:00
c214f8543c homelab: add deploy.enable option with assertion
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m6s
Run nix flake check / flake-check (pull_request) Successful in 2m7s
- Add homelab.deploy.enable option (requires vault.enable)
- Create shared homelab-deploy Vault policy for all hosts
- Enable homelab.deploy on all vault-enabled hosts

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 06:54:42 +01:00
7933127d77 system: enable homelab-deploy listener for all vault hosts
Add system/homelab-deploy.nix module that automatically enables the
listener on all hosts with vault.enable=true. Uses homelab.host.tier
and homelab.host.role for NATS subject subscriptions.

- Add homelab-deploy access to all host AppRole policies
- Remove manual listener config from vaulttest01 (now handled by system module)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 06:54:42 +01:00
0643f23281 vaulttest01: add vault secret dependency to listener
Some checks failed
Run nix flake check / flake-check (push) Failing after 15m32s
Ensure homelab-deploy-listener waits for the NKey secret to be
fetched from Vault before starting.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 05:29:29 +01:00
ad8570f8db homelab-deploy: add NATS-based deployment system
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m45s
Add homelab-deploy flake input and NixOS module for message-based
deployments across the fleet. Configure DEPLOY account in NATS with
tiered access control (listener, test-deployer, admin-deployer).
Enable listener on vaulttest01 as initial test host.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 05:22:06 +01:00
a926d34287 nix-cache01: set priority to high
All checks were successful
Run nix flake check / flake-check (pull_request) Successful in 2m14s
Run nix flake check / flake-check (push) Successful in 2m17s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 02:54:32 +01:00
12bf0683f5 modules: add homelab.host for host metadata
Add a shared `homelab.host` module that provides host metadata for
multiple consumers:
- tier: deployment tier (test/prod) for future homelab-deploy service
- priority: alerting priority (high/low) for Prometheus label filtering
- role: primary role of the host (dns, database, monitoring, etc.)
- labels: free-form labels for additional metadata

Host configurations updated with appropriate values:
- ns1, ns2: role=dns with dns_role labels
- nix-cache01: priority=low, role=build-host
- vault01: role=vault
- jump: role=bastion
- template, template2, testvm01, vaulttest01: tier=test, priority=low

The module is now imported via commonModules in flake.nix, making it
available to all hosts including minimal configurations like template2.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 02:49:58 +01:00
eee3dde04f restic: add randomized delay to backup timers
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Backups to the shared restic repository were all scheduled at exactly
midnight, causing lock conflicts. Adding RandomizedDelaySec spreads
them out over a 2-hour window to prevent simultaneous access.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 01:09:38 +01:00
bbb22e588e system: replace writeShellScript with writeShellApplication
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m3s
Run nix flake check / flake-check (push) Failing after 5m57s
Convert remaining writeShellScript usages to writeShellApplication for
shellcheck validation and strict bash options.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 00:17:24 +01:00
879e7aba60 templates: use writeShellApplication for prepare-host script
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 00:14:05 +01:00
59e1962d75 auth01: decommission host and remove authelia/lldap services
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m5s
Run nix flake check / flake-check (push) Failing after 18m1s
Remove auth01 host configuration and associated services in preparation
for new auth stack with different provisioning system.

Removed:
- hosts/auth01/ - host configuration
- services/authelia/ - authelia service module
- services/lldap/ - lldap service module
- secrets/auth01/ - sops secrets
- Reverse proxy entries for auth and lldap
- Monitoring alert rules for authelia and lldap
- SOPS configuration for auth01

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 23:35:45 +01:00
0700033c0a secrets: migrate all hosts from sops to OpenBao vault
Replace sops-nix secrets with OpenBao vault secrets across all hosts.
Hardcode root password hash, add extractKey option to vault-secrets
module, update Terraform with secrets/policies for all hosts, and
create AppRole provisioning playbook.

Hosts migrated: ha1, monitoring01, ns1, ns2, http-proxy, nix-cache01
Wave 1 hosts (nats1, jelly01, pgdb1) get AppRole policies only.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 18:43:09 +01:00
0ef63ad874 hosts: remove decommissioned media1, ns3, ns4, nixos-test1
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m47s
Run nix flake check / flake-check (pull_request) Successful in 3m20s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 01:36:57 +01:00
dd1b64de27 monitoring: auto-generate Prometheus scrape targets from host configs
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m49s
Run nix flake check / flake-check (push) Has been cancelled
Add homelab.monitoring NixOS options (enable, scrapeTargets) following
the same pattern as homelab.dns. Prometheus scrape configs are now
auto-generated from flake host configurations and external targets,
replacing hardcoded target lists.

Also cleans up alert rules: snake_case naming, fix zigbee2mqtt typo,
remove duplicate pushgateway alert, add for clauses to monitoring_rules,
remove hardcoded WireGuard public key, and add new alerts for
certificates, proxmox, caddy, smartctl temperature, filesystem
prediction, systemd state, file descriptors, and host reboots.

Fixes grafana scrape target port from 3100 to 3000.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 00:49:07 +01:00
cee1b264cd dns: auto-generate zone entries from host configurations
Replace static zone file with dynamically generated records:
- Add homelab.dns module with enable/cnames options
- Extract IPs from systemd.network configs (filters VPN interfaces)
- Use git commit timestamp as zone serial number
- Move external hosts to separate external-hosts.nix

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 21:43:44 +01:00
d25fc99e1d backup: migrate to native services.restic.backups
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Run nix flake check / flake-check (pull_request) Successful in 4m0s
Replace custom backup-helper flake input with NixOS native
services.restic.backups module for ha1, monitoring01, and nixos-test1.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 00:41:40 +01:00
7ae474fd3e pki: add new vault root ca to pki 2026-02-03 06:53:59 +01:00
01d4812280 vault: implement bootstrap integration
Some checks failed
Run nix flake check / flake-check (push) Successful in 2m31s
Run nix flake check / flake-check (pull_request) Failing after 14m16s
2026-02-03 01:10:36 +01:00
4afb37d730 create-host: enable resolved in configuration.nix.j2
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m15s
2026-02-01 20:37:36 +01:00
a2c798bc30 vault: add minimal vault config
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
2026-02-01 20:27:02 +01:00
6d64e53586 hosts: add vault01 host
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m20s
2026-02-01 20:08:48 +01:00
7fe0aa0f54 test: add testvm01 for pipeline testing 2026-02-01 17:41:04 +01:00
83de9a3ffb pipeline: add testing improvements for branch-based workflows
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Implement dual improvements to enable efficient testing of pipeline changes
without polluting master branch:

1. Add --force flag to create-host script
   - Skip hostname/IP uniqueness validation
   - Overwrite existing host configurations
   - Update entries in flake.nix and terraform/vms.tf (no duplicates)
   - Useful for iterating on configurations during testing

2. Add branch support to bootstrap mechanism
   - Bootstrap service reads NIXOS_FLAKE_BRANCH environment variable
   - Defaults to master if not set
   - Uses branch in git URL via ?ref= parameter
   - Service loads environment from /etc/environment

3. Add cloud-init disk support for branch configuration
   - VMs can specify flake_branch field in terraform/vms.tf
   - Automatically generates cloud-init snippet setting NIXOS_FLAKE_BRANCH
   - Uploads snippet to Proxmox via SSH
   - Production VMs omit flake_branch and use master

4. Update documentation
   - Document --force flag usage in create-host README
   - Add branch testing examples in terraform README
   - Update TODO.md with testing workflow
   - Add .generated/ to gitignore

Testing workflow: Create feature branch, set flake_branch in VM definition,
deploy with terraform, iterate with --force flag, clean up before merging.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-01 16:34:28 +01:00
2aeed8f231 template2: add filesystem definitions to support normal builds
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m17s
Run nix flake check / flake-check (push) Failing after 16m59s
Add filesystem configuration matching Proxmox image builder output
to allow template2 to build with both `nixos-rebuild build` and
`nixos-rebuild build-image --image-variant proxmox`.

Filesystem specs discovered from running VM:
- ext4 filesystem with label "nixos"
- x-systemd.growfs option for automatic partition growth
- No swap partition

Using lib.mkDefault ensures these definitions work for normal builds
while allowing the Proxmox image builder to override when needed.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-01 11:17:48 +01:00