New test-tier host for monitoring stack expansion with:
- Static IP 10.69.13.24
- 4 CPU cores, 4GB RAM, 20GB disk
- Vault integration and NATS-based deployment enabled
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds a system-wide script for sending command output or interactive
sessions to Loki for easy sharing with Claude.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- CLI workflows for creating users and groups
- Troubleshooting guide (nscd, cache invalidation)
- Home directory behavior (UUID-based with symlinks)
- Update auth-system-replacement plan with progress
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Keep base groups (admins, users, ssh-users) provisioned declaratively
but manage regular users via the kanidm CLI. This allows setting POSIX
attributes and passwords in a single workflow.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add homelab.kanidm.enable option for central authentication via Kanidm.
The module configures:
- PAM/NSS integration with kanidm-unixd
- Client connection to auth.home.2rjus.net
- Login authorization for ssh-users group
Enable on testvm01-03 for testing.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Provides compressed swap in RAM to prevent OOM kills during
nixos-rebuild on low-memory VMs (2GB). Removes duplicate zram
configs from jelly01 and nix-cache01.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Based on security review findings, covering SSH hardening, firewall
enablement, log transport TLS, security alerting, and secrets management.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Changed section 4 from "if needed" to always spawn auditor
- Added explicit "Do NOT query audit logs yourself" guidance
- Listed specific scenarios requiring auditor (service stopped, etc.)
- Added manual intervention as first common cause
- Updated guidelines to emphasize mandatory delegation
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add new auditor agent for security-focused audit log analysis:
- SSH session tracking, command execution, sudo usage
- Suspicious activity detection patterns
- Can be used standalone or as sub-agent by investigate-alarm
Update investigate-alarm to delegate audit analysis to auditor
and add git-explorer MCP for configuration drift detection.
Add git-explorer to .mcp.json for repository inspection.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Kanidm does not expose a Prometheus /metrics endpoint.
The scrape target was causing 404 errors after the TLS
certificate issue was fixed.
Also add SSH command restriction to CLAUDE.md.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Include both auth.home.2rjus.net (CNAME) and kanidm01.home.2rjus.net
(A record) as SANs in the TLS certificate. This fixes Prometheus
scraping which connects via the hostname, not the CNAME.
Fixes: x509: certificate is valid for auth.home.2rjus.net, not kanidm01.home.2rjus.net
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add best practices for querying Loki to avoid overwhelming responses:
- Start with narrow filters and small limits
- Filter audit logs to EXECVE only
- Exclude verbose noise (PATH, PROCTITLE, SYSCALL, BPF)
- Expand queries incrementally if needed
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Enable Linux audit to log execve syscalls from interactive SSH sessions.
Uses auid filter to exclude system services and nix builds.
Logs forwarded to journald for Loki ingestion. Query with:
{host="testvmXX"} |= "EXECVE"
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sub-agent for investigating system alarms using Prometheus metrics
and Loki logs. Provides root cause analysis with timeline of events.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Move detailed Prometheus/Loki reference from CLAUDE.md to the
observability skill
- Add complete list of Prometheus jobs organized by category
- Add bootstrap log documentation with stages table
- Add kanidm01 to host labels table
- CLAUDE.md now references the skill instead of duplicating info
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Document the end-to-end host creation workflow including:
- Prerequisites and step-by-step process
- Tier specification (test vs prod)
- Bootstrap observability via Loki
- Verification steps
- Troubleshooting guide
- Related files reference
Update CLAUDE.md to reference the new document.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Plan for migrating from Prometheus to VictoriaMetrics on new monitoring02
host with parallel operation, declarative Grafana dashboards, and CNAME-based
cutover.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Mark completed implementation steps
- Document deployed kanidm01 configuration
- Record UID/GID range decision (65,536-69,999)
- Add verified working items (WebUI, LDAP, certs)
- Update next steps and resolved questions
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Set owner/group to kanidm so the post-start provisioning
script can read the idm_admin password.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- New test-tier VM at 10.69.13.23 with role=auth
- Kanidm 1.8 server with HTTPS (443) and LDAPS (636)
- ACME certificate from internal CA (auth.home.2rjus.net)
- Provisioned groups: admins, users, ssh-users
- Provisioned user: torjus
- Daily backups at 22:00 (7 versions)
- Prometheus monitoring scrape target
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
New VMs bootstrapped from template2 will now use the local nix cache
during initial nixos-rebuild, speeding up bootstrap times.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
ns1 needs access to shared/dns/* for zone transfer key and
shared/homelab-deploy/* for the NATS listener.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Old VM had incorrect hardware-configuration.nix with hardcoded UUIDs
that didn't match actual disk layout, causing boot failure (emergency mode).
Recreated using template2-based configuration for OpenTofu provisioning.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove pgdb1 host configuration and postgres service module.
The only consumer (Open WebUI on gunter) has migrated to local PostgreSQL.
Removed:
- hosts/pgdb1/ - host configuration
- services/postgres/ - service module (only used by pgdb1)
- postgres_rules from monitoring rules
- rebuild-all.sh (obsolete script)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Only consumer was Open WebUI on gunter, which will migrate to local
PostgreSQL. Removed pgdb1 backup/migration phases and added to
decommission list.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- ns2 migrated to OpenTofu
- testvm02, testvm03 added to managed hosts
- Remove vaulttest01 (no longer exists)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Configure Unbound to query both ns1 and ns2 for the home.2rjus.net
zone, in addition to local NSD. This provides redundancy during
bootstrap or if local NSD is temporarily unavailable.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Bootstrap times can be improved by configuring the base template
to use the local nix cache during initial builds.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove hosts/template/ (legacy template1) and give each legacy host
its own hardware-configuration.nix copy
- Recreate ns2 using create-host with template2 base
- Add secondary DNS services (NSD + Unbound resolver)
- Configure Vault policy for shared DNS secrets
- Fix create-host IP uniqueness validator to check CIDR notation
(prevents false positives from DNS resolver entries)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove secrets/ directory (sops-nix no longer in use, all hosts use Vault)
- Move TODO.md to docs/plans/completed/automated-host-deployment-pipeline.md
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
All secrets are now managed by OpenBao (Vault). Remove the legacy
sops-nix infrastructure that is no longer in use.
Removed:
- sops-nix flake input
- system/sops.nix module
- .sops.yaml configuration file
- Age key generation from template prepare-host scripts
Updated:
- flake.nix - removed sops-nix references from all hosts
- flake.lock - removed sops-nix input
- scripts/create-host/ - removed sops references
- CLAUDE.md - removed SOPS documentation
Note: secrets/ directory should be manually removed by the user.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>