docs: move completed plans to completed folder
Some checks failed
Run nix flake check / flake-check (push) Failing after 13m22s

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-13 21:08:17 +01:00
parent ae823e439d
commit 5d3d93b280
3 changed files with 0 additions and 0 deletions

View File

@@ -0,0 +1,183 @@
# Authentication System Replacement Plan
## Overview
Deploy a modern, unified authentication solution for the homelab. Provides central user management, SSO for web services, and consistent UID/GID mapping for NAS permissions.
## Goals
1. **Central user database** - Manage users across all homelab hosts from a single source
2. **Linux PAM/NSS integration** - Users can SSH into hosts using central credentials
3. **UID/GID consistency** - Proper POSIX attributes for NAS share permissions
4. **OIDC provider** - Single sign-on for homelab web services (Grafana, etc.)
## Solution: Kanidm
Kanidm was chosen for the following reasons:
| Requirement | Kanidm Support |
|-------------|----------------|
| Central user database | Native |
| Linux PAM/NSS (host login) | Native NixOS module |
| UID/GID for NAS | POSIX attributes supported |
| OIDC for services | Built-in |
| Declarative config | Excellent NixOS provisioning |
| Simplicity | Modern API, LDAP optional |
| NixOS integration | First-class |
### Configuration Files
- **Host configuration:** `hosts/kanidm01/`
- **Service module:** `services/kanidm/default.nix`
## NAS Integration
### Current: TrueNAS CORE (FreeBSD)
TrueNAS CORE has a built-in LDAP client. Kanidm's read-only LDAP interface will work for NFS share permissions:
- **NFS shares**: Only need consistent UID/GID mapping - Kanidm's LDAP provides this
- **No SMB requirement**: SMB would need Samba schema attributes (deprecated in TrueNAS 13.0+), but we're NFS-only
Configuration approach:
1. Enable Kanidm's LDAP interface (`ldapbindaddress = "0.0.0.0:636"`)
2. Import internal CA certificate into TrueNAS
3. Configure TrueNAS LDAP client with Kanidm's Base DN and bind credentials
4. Users/groups appear in TrueNAS permission dropdowns
Note: Kanidm's LDAP is read-only and uses LDAPS only (no StartTLS). This is fine for our use case.
### Future: NixOS NAS
When the NAS is migrated to NixOS, it becomes a first-class citizen:
- Native Kanidm PAM/NSS integration (same as other hosts)
- No LDAP compatibility layer needed
- Full integration with the rest of the homelab
This future migration path is a strong argument for Kanidm over LDAP-only solutions.
## Implementation Steps
1. **Create kanidm01 host and service module**
- Host: `kanidm01.home.2rjus.net` (10.69.13.23, test tier)
- Service module: `services/kanidm/`
- TLS via internal ACME (`auth.home.2rjus.net`)
- Vault integration for idm_admin password
- LDAPS on port 636
2. **Configure provisioning**
- Groups provisioned declaratively: `admins`, `users`, `ssh-users`
- Users managed imperatively via CLI (allows setting POSIX passwords in one step)
- POSIX attributes enabled (UID/GID range 65,536-69,999)
3. **Test NAS integration** (in progress)
- ✅ LDAP interface verified working
- Configure TrueNAS LDAP client to connect to Kanidm
- Verify UID/GID mapping works with NFS shares
4. **Add OIDC clients** for homelab services
- Grafana
- Other services as needed
5. **Create client module** in `system/` for PAM/NSS ✅
- Module: `system/kanidm-client.nix`
- `homelab.kanidm.enable = true` enables PAM/NSS
- Short usernames (not SPN format)
- Home directory symlinks via `home_alias`
- Enabled on test tier: testvm01, testvm02, testvm03
6. **Documentation**
- `docs/user-management.md` - CLI workflows, troubleshooting
- User/group creation procedures verified working
## Progress
### Completed (2026-02-08)
**Kanidm server deployed on kanidm01 (test tier):**
- Host: `kanidm01.home.2rjus.net` (10.69.13.23)
- WebUI: `https://auth.home.2rjus.net`
- LDAPS: port 636
- Valid certificate from internal CA
**Configuration:**
- Kanidm 1.8 with secret provisioning support
- Daily backups at 22:00 (7 versions retained)
- Vault integration for idm_admin password
- Prometheus monitoring scrape target configured
**Provisioned entities:**
- Groups: `admins`, `users`, `ssh-users` (declarative)
- Users managed via CLI (imperative)
**Verified working:**
- WebUI login with idm_admin
- LDAP bind and search with POSIX-enabled user
- LDAPS with valid internal CA certificate
### Completed (2026-02-08) - PAM/NSS Client
**Client module deployed (`system/kanidm-client.nix`):**
- `homelab.kanidm.enable = true` enables PAM/NSS integration
- Connects to auth.home.2rjus.net
- Short usernames (`torjus` instead of `torjus@home.2rjus.net`)
- Home directory symlinks (`/home/torjus` → UUID-based dir)
- Login restricted to `ssh-users` group
**Enabled on test tier:**
- testvm01, testvm02, testvm03
**Verified working:**
- User/group resolution via `getent`
- SSH login with Kanidm unix passwords
- Home directory creation with symlinks
- Imperative user/group creation via CLI
**Documentation:**
- `docs/user-management.md` with full CLI workflows
- Password requirements (min 10 chars)
- Troubleshooting guide (nscd, cache invalidation)
### UID/GID Range (Resolved)
**Range: 65,536 - 69,999** (manually allocated)
- Users: 65,536 - 67,999 (up to ~2500 users)
- Groups: 68,000 - 69,999 (up to ~2000 groups)
Rationale:
- Starts at Kanidm's recommended minimum (65,536)
- Well above NixOS system users (typically <1000)
- Avoids Podman/container issues with very high GIDs
### Completed (2026-02-08) - OAuth2/OIDC for Grafana
**OAuth2 client deployed for Grafana on monitoring02:**
- Client ID: `grafana`
- Redirect URL: `https://grafana-test.home.2rjus.net/login/generic_oauth`
- Scope maps: `openid`, `profile`, `email`, `groups` for `users` group
- Role mapping: `admins` group → Grafana Admin, others → Viewer
**Configuration locations:**
- Kanidm OAuth2 client: `services/kanidm/default.nix`
- Grafana OIDC config: `services/grafana/default.nix`
- Vault secret: `services/grafana/oauth2-client-secret`
**Key findings:**
- PKCE is required by Kanidm - enable `use_pkce = true` in Grafana
- Must set `email_attribute_path`, `login_attribute_path`, `name_attribute_path` to extract from userinfo
- Users need: primary credential (password + TOTP for MFA), membership in `users` group, email address set
- Unix password is separate from primary credential (web login requires primary credential)
### Next Steps
1. Enable PAM/NSS on production hosts (after test tier validation)
2. Configure TrueNAS LDAP client for NAS integration testing
3. Add OAuth2 clients for other services as needed
## References
- [Kanidm Documentation](https://kanidm.github.io/kanidm/stable/)
- [NixOS Kanidm Module](https://search.nixos.org/options?query=services.kanidm)
- [Kanidm PAM/NSS Integration](https://kanidm.github.io/kanidm/stable/pam_and_nsswitch.html)

View File

@@ -0,0 +1,156 @@
# Nix Cache Host Reprovision
## Overview
Reprovision `nix-cache01` using the OpenTofu workflow, and improve the build/cache system with:
1. NATS-based remote build triggering (replacing the current bash script)
2. Safer flake update workflow that validates builds before pushing to master
## Status
**Phase 1: New Build Host** - COMPLETE
**Phase 2: NATS Build Triggering** - COMPLETE
**Phase 3: Safe Flake Update Workflow** - NOT STARTED
**Phase 4: Complete Migration** - COMPLETE
**Phase 5: Scheduled Builds** - COMPLETE
## Completed Work
### New Build Host (nix-cache02)
Instead of reprovisioning nix-cache01 in-place, we created a new host `nix-cache02` at 10.69.13.25:
- **Specs**: 8 CPU cores, 16GB RAM (temporarily, will increase to 24GB after nix-cache01 decommissioned), 200GB disk
- **Provisioned via OpenTofu** with automatic Vault credential bootstrapping
- **Builder service** configured with two repos:
- `nixos-servers``git+https://git.t-juice.club/torjus/nixos-servers.git`
- `nixos` (gunter) → `git+https://git.t-juice.club/torjus/nixos.git`
### NATS-Based Build Triggering
The `homelab-deploy` tool was extended with a builder mode:
**NATS Subjects:**
- `build.<repo>.<target>` - e.g., `build.nixos-servers.all` or `build.nixos-servers.ns1`
**NATS Permissions (in DEPLOY account):**
| User | Publish | Subscribe |
|------|---------|-----------|
| Builder | `build.responses.>` | `build.>` |
| Test deployer | `deploy.test.>`, `deploy.discover`, `build.>` | `deploy.responses.>`, `deploy.discover`, `build.responses.>` |
| Admin deployer | `deploy.>`, `build.>` | `deploy.>`, `build.responses.>` |
**Vault Secrets:**
- `shared/homelab-deploy/builder-nkey` - NKey seed for builder authentication
**NixOS Configuration:**
- `hosts/nix-cache02/builder.nix` - Builder service configuration
- `services/nats/default.nix` - Updated with builder NATS user
**MCP Integration:**
- `.mcp.json` updated with `--enable-builds` flag
- Build tool available via MCP for Claude Code
**Tested:**
- Single host build: `build nixos-servers testvm01` (~30s)
- All hosts build: `build nixos-servers all` (16 hosts in ~226s)
### Harmonia Binary Cache
- Parameterized `services/nix-cache/harmonia.nix` to use hostname-based Vault paths
- Parameterized `services/nix-cache/proxy.nix` for hostname-based domain
- New signing key: `nix-cache02.home.2rjus.net-1`
- Vault secret: `hosts/nix-cache02/cache-secret`
- Removed unused Gitea Actions runner from nix-cache01
## Current State
### nix-cache02 (Active)
- Running at 10.69.13.25
- Serving `https://nix-cache.home.2rjus.net` (canonical URL)
- Builder service active, responding to NATS build requests
- Metrics exposed on port 9973 (`homelab-deploy-builder` job)
- Harmonia binary cache server running
- Signing key: `nix-cache02.home.2rjus.net-1`
- Prod tier with `build-host` role
### nix-cache01 (Decommissioned)
- VM deleted from Proxmox
- Host configuration removed from repo
- Vault AppRole and secrets removed
- Old signing key removed from trusted-public-keys
## Remaining Work
### Phase 3: Safe Flake Update Workflow
1. Create `.github/workflows/flake-update-safe.yaml`
2. Disable or remove old `flake-update.yaml`
3. Test manually with `workflow_dispatch`
4. Monitor first automated run
### Phase 4: Complete Migration ✅
1. ~~**Add Harmonia to nix-cache02**~~ ✅ Done - new signing key, parameterized service
2. ~~**Add trusted public key to all hosts**~~ ✅ Done - `system/nix.nix` updated
3. ~~**Test cache from other hosts**~~ ✅ Done - verified from testvm01
4. ~~**Update proxy and DNS**~~ ✅ Done - `nix-cache.home.2rjus.net` CNAME now points to nix-cache02
5. ~~**Deploy to all hosts**~~ ✅ Done - all hosts have new trusted key
6. ~~**Decommission nix-cache01**~~ ✅ Done - 2026-02-10:
- Removed `hosts/nix-cache01/` directory
- Removed `services/nix-cache/build-flakes.{nix,sh}`
- Removed Vault AppRole and secrets
- Removed old signing key from `system/nix.nix`
- Removed from `flake.nix`
- Deleted VM from Proxmox
### Phase 5: Scheduled Builds ✅
Implemented a systemd timer on nix-cache02 that triggers builds every 2 hours:
- **Timer**: `scheduled-build.timer` runs every 2 hours with 5m random jitter
- **Service**: `scheduled-build.service` calls `homelab-deploy build` for both repos
- **Authentication**: Dedicated scheduler NKey stored in Vault
- **NATS user**: Added to DEPLOY account with publish `build.>` and subscribe `build.responses.>`
Files:
- `hosts/nix-cache02/scheduler.nix` - Timer and service configuration
- `services/nats/default.nix` - Scheduler NATS user
- `terraform/vault/secrets.tf` - Scheduler NKey secret
- `terraform/vault/variables.tf` - Variable for scheduler NKey
## Resolved Questions
- **Parallel vs sequential builds?** Sequential - hosts share packages, subsequent builds are fast after first
- **What about gunter?** Configured as `nixos` repo in builder settings
- **Disk size?** 200GB for new host
- **Build host specs?** 8 cores, 16-24GB RAM matches current nix-cache01
### Phase 6: Observability
1. **Alerting rules** for build failures:
```promql
# Alert if any build fails
increase(homelab_deploy_build_host_total{status="failure"}[1h]) > 0
# Alert if no successful builds in 24h (scheduled builds stopped)
time() - homelab_deploy_build_last_success_timestamp > 86400
```
2. **Grafana dashboard** for build metrics:
- Build success/failure rate over time
- Average build duration per host (histogram)
- Build frequency (builds per hour/day)
- Last successful build timestamp per repo
Available metrics:
- `homelab_deploy_builds_total{repo, status}` - total builds by repo and status
- `homelab_deploy_build_host_total{repo, host, status}` - per-host build counts
- `homelab_deploy_build_duration_seconds_{bucket,sum,count}` - build duration histogram
- `homelab_deploy_build_last_timestamp{repo}` - last build attempt
- `homelab_deploy_build_last_success_timestamp{repo}` - last successful build
## Open Questions
- [x] ~~When to cut over DNS from nix-cache01 to nix-cache02?~~ Done - 2026-02-10
- [ ] Implement safe flake update workflow before or after full migration?

View File

@@ -0,0 +1,113 @@
# pgdb1 Decommissioning Plan
## Overview
Decommission the pgdb1 PostgreSQL server. The only consumer was Open WebUI on gunter, which has been migrated to use a local PostgreSQL instance.
## Pre-flight Verification
Before proceeding, verify that gunter is no longer using pgdb1:
1. Check Open WebUI on gunter is configured for local PostgreSQL (not 10.69.13.16)
2. Optionally: Check pgdb1 for recent connection activity:
```bash
ssh pgdb1 'sudo -u postgres psql -c "SELECT * FROM pg_stat_activity WHERE datname IS NOT NULL;"'
```
## Files to Remove
### Host Configuration
- `hosts/pgdb1/default.nix`
- `hosts/pgdb1/configuration.nix`
- `hosts/pgdb1/hardware-configuration.nix`
- `hosts/pgdb1/` (directory)
### Service Module
- `services/postgres/postgres.nix`
- `services/postgres/default.nix`
- `services/postgres/` (directory)
Note: This service module is only used by pgdb1, so it can be removed entirely.
### Flake Entry
Remove from `flake.nix` (lines 131-138):
```nix
pgdb1 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
inherit inputs self;
};
modules = commonModules ++ [
./hosts/pgdb1
];
};
```
### Vault AppRole
Remove from `terraform/vault/approle.tf` (lines 69-73):
```hcl
"pgdb1" = {
paths = [
"secret/data/hosts/pgdb1/*",
]
}
```
### Monitoring Rules
Remove from `services/monitoring/rules.yml` the `postgres_down` alert (lines 359-365):
```yaml
- name: postgres_rules
rules:
- alert: postgres_down
expr: node_systemd_unit_state{instance="pgdb1.home.2rjus.net:9100", name="postgresql.service", state="active"} == 0
for: 5m
labels:
severity: critical
```
### Utility Scripts
Delete `rebuild-all.sh` entirely (obsolete script).
## Execution Steps
### Phase 1: Verification
- [ ] Confirm Open WebUI on gunter uses local PostgreSQL
- [ ] Verify no active connections to pgdb1
### Phase 2: Code Cleanup
- [ ] Create feature branch: `git checkout -b decommission-pgdb1`
- [ ] Remove `hosts/pgdb1/` directory
- [ ] Remove `services/postgres/` directory
- [ ] Remove pgdb1 entry from `flake.nix`
- [ ] Remove postgres alert from `services/monitoring/rules.yml`
- [ ] Delete `rebuild-all.sh` (obsolete)
- [ ] Run `nix flake check` to verify no broken references
- [ ] Commit changes
### Phase 3: Terraform Cleanup
- [ ] Remove pgdb1 from `terraform/vault/approle.tf`
- [ ] Run `tofu plan` in `terraform/vault/` to preview changes
- [ ] Run `tofu apply` to remove the AppRole
- [ ] Commit terraform changes
### Phase 4: Infrastructure Cleanup
- [ ] Shut down pgdb1 VM in Proxmox
- [ ] Delete the VM from Proxmox
- [ ] (Optional) Remove any DNS entries if not auto-generated
### Phase 5: Finalize
- [ ] Merge feature branch to master
- [ ] Trigger auto-upgrade on DNS servers (ns1, ns2) to remove DNS entry
- [ ] Move this plan to `docs/plans/completed/`
## Rollback
If issues arise after decommissioning:
1. The VM can be recreated from template using the git history
2. Database data would need to be restored from backup (if any exists)
## Notes
- pgdb1 IP: 10.69.13.16
- The postgres service allowed connections from gunter (10.69.30.105)
- No restic backup was configured for this host