docs: move completed plans to completed folder
Some checks failed
Run nix flake check / flake-check (push) Failing after 13m22s
Some checks failed
Run nix flake check / flake-check (push) Failing after 13m22s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
183
docs/plans/completed/auth-system-replacement.md
Normal file
183
docs/plans/completed/auth-system-replacement.md
Normal file
@@ -0,0 +1,183 @@
|
||||
# Authentication System Replacement Plan
|
||||
|
||||
## Overview
|
||||
|
||||
Deploy a modern, unified authentication solution for the homelab. Provides central user management, SSO for web services, and consistent UID/GID mapping for NAS permissions.
|
||||
|
||||
## Goals
|
||||
|
||||
1. **Central user database** - Manage users across all homelab hosts from a single source
|
||||
2. **Linux PAM/NSS integration** - Users can SSH into hosts using central credentials
|
||||
3. **UID/GID consistency** - Proper POSIX attributes for NAS share permissions
|
||||
4. **OIDC provider** - Single sign-on for homelab web services (Grafana, etc.)
|
||||
|
||||
## Solution: Kanidm
|
||||
|
||||
Kanidm was chosen for the following reasons:
|
||||
|
||||
| Requirement | Kanidm Support |
|
||||
|-------------|----------------|
|
||||
| Central user database | Native |
|
||||
| Linux PAM/NSS (host login) | Native NixOS module |
|
||||
| UID/GID for NAS | POSIX attributes supported |
|
||||
| OIDC for services | Built-in |
|
||||
| Declarative config | Excellent NixOS provisioning |
|
||||
| Simplicity | Modern API, LDAP optional |
|
||||
| NixOS integration | First-class |
|
||||
|
||||
### Configuration Files
|
||||
|
||||
- **Host configuration:** `hosts/kanidm01/`
|
||||
- **Service module:** `services/kanidm/default.nix`
|
||||
|
||||
## NAS Integration
|
||||
|
||||
### Current: TrueNAS CORE (FreeBSD)
|
||||
|
||||
TrueNAS CORE has a built-in LDAP client. Kanidm's read-only LDAP interface will work for NFS share permissions:
|
||||
|
||||
- **NFS shares**: Only need consistent UID/GID mapping - Kanidm's LDAP provides this
|
||||
- **No SMB requirement**: SMB would need Samba schema attributes (deprecated in TrueNAS 13.0+), but we're NFS-only
|
||||
|
||||
Configuration approach:
|
||||
1. Enable Kanidm's LDAP interface (`ldapbindaddress = "0.0.0.0:636"`)
|
||||
2. Import internal CA certificate into TrueNAS
|
||||
3. Configure TrueNAS LDAP client with Kanidm's Base DN and bind credentials
|
||||
4. Users/groups appear in TrueNAS permission dropdowns
|
||||
|
||||
Note: Kanidm's LDAP is read-only and uses LDAPS only (no StartTLS). This is fine for our use case.
|
||||
|
||||
### Future: NixOS NAS
|
||||
|
||||
When the NAS is migrated to NixOS, it becomes a first-class citizen:
|
||||
|
||||
- Native Kanidm PAM/NSS integration (same as other hosts)
|
||||
- No LDAP compatibility layer needed
|
||||
- Full integration with the rest of the homelab
|
||||
|
||||
This future migration path is a strong argument for Kanidm over LDAP-only solutions.
|
||||
|
||||
## Implementation Steps
|
||||
|
||||
1. **Create kanidm01 host and service module** ✅
|
||||
- Host: `kanidm01.home.2rjus.net` (10.69.13.23, test tier)
|
||||
- Service module: `services/kanidm/`
|
||||
- TLS via internal ACME (`auth.home.2rjus.net`)
|
||||
- Vault integration for idm_admin password
|
||||
- LDAPS on port 636
|
||||
|
||||
2. **Configure provisioning** ✅
|
||||
- Groups provisioned declaratively: `admins`, `users`, `ssh-users`
|
||||
- Users managed imperatively via CLI (allows setting POSIX passwords in one step)
|
||||
- POSIX attributes enabled (UID/GID range 65,536-69,999)
|
||||
|
||||
3. **Test NAS integration** (in progress)
|
||||
- ✅ LDAP interface verified working
|
||||
- Configure TrueNAS LDAP client to connect to Kanidm
|
||||
- Verify UID/GID mapping works with NFS shares
|
||||
|
||||
4. **Add OIDC clients** for homelab services
|
||||
- Grafana
|
||||
- Other services as needed
|
||||
|
||||
5. **Create client module** in `system/` for PAM/NSS ✅
|
||||
- Module: `system/kanidm-client.nix`
|
||||
- `homelab.kanidm.enable = true` enables PAM/NSS
|
||||
- Short usernames (not SPN format)
|
||||
- Home directory symlinks via `home_alias`
|
||||
- Enabled on test tier: testvm01, testvm02, testvm03
|
||||
|
||||
6. **Documentation** ✅
|
||||
- `docs/user-management.md` - CLI workflows, troubleshooting
|
||||
- User/group creation procedures verified working
|
||||
|
||||
## Progress
|
||||
|
||||
### Completed (2026-02-08)
|
||||
|
||||
**Kanidm server deployed on kanidm01 (test tier):**
|
||||
- Host: `kanidm01.home.2rjus.net` (10.69.13.23)
|
||||
- WebUI: `https://auth.home.2rjus.net`
|
||||
- LDAPS: port 636
|
||||
- Valid certificate from internal CA
|
||||
|
||||
**Configuration:**
|
||||
- Kanidm 1.8 with secret provisioning support
|
||||
- Daily backups at 22:00 (7 versions retained)
|
||||
- Vault integration for idm_admin password
|
||||
- Prometheus monitoring scrape target configured
|
||||
|
||||
**Provisioned entities:**
|
||||
- Groups: `admins`, `users`, `ssh-users` (declarative)
|
||||
- Users managed via CLI (imperative)
|
||||
|
||||
**Verified working:**
|
||||
- WebUI login with idm_admin
|
||||
- LDAP bind and search with POSIX-enabled user
|
||||
- LDAPS with valid internal CA certificate
|
||||
|
||||
### Completed (2026-02-08) - PAM/NSS Client
|
||||
|
||||
**Client module deployed (`system/kanidm-client.nix`):**
|
||||
- `homelab.kanidm.enable = true` enables PAM/NSS integration
|
||||
- Connects to auth.home.2rjus.net
|
||||
- Short usernames (`torjus` instead of `torjus@home.2rjus.net`)
|
||||
- Home directory symlinks (`/home/torjus` → UUID-based dir)
|
||||
- Login restricted to `ssh-users` group
|
||||
|
||||
**Enabled on test tier:**
|
||||
- testvm01, testvm02, testvm03
|
||||
|
||||
**Verified working:**
|
||||
- User/group resolution via `getent`
|
||||
- SSH login with Kanidm unix passwords
|
||||
- Home directory creation with symlinks
|
||||
- Imperative user/group creation via CLI
|
||||
|
||||
**Documentation:**
|
||||
- `docs/user-management.md` with full CLI workflows
|
||||
- Password requirements (min 10 chars)
|
||||
- Troubleshooting guide (nscd, cache invalidation)
|
||||
|
||||
### UID/GID Range (Resolved)
|
||||
|
||||
**Range: 65,536 - 69,999** (manually allocated)
|
||||
|
||||
- Users: 65,536 - 67,999 (up to ~2500 users)
|
||||
- Groups: 68,000 - 69,999 (up to ~2000 groups)
|
||||
|
||||
Rationale:
|
||||
- Starts at Kanidm's recommended minimum (65,536)
|
||||
- Well above NixOS system users (typically <1000)
|
||||
- Avoids Podman/container issues with very high GIDs
|
||||
|
||||
### Completed (2026-02-08) - OAuth2/OIDC for Grafana
|
||||
|
||||
**OAuth2 client deployed for Grafana on monitoring02:**
|
||||
- Client ID: `grafana`
|
||||
- Redirect URL: `https://grafana-test.home.2rjus.net/login/generic_oauth`
|
||||
- Scope maps: `openid`, `profile`, `email`, `groups` for `users` group
|
||||
- Role mapping: `admins` group → Grafana Admin, others → Viewer
|
||||
|
||||
**Configuration locations:**
|
||||
- Kanidm OAuth2 client: `services/kanidm/default.nix`
|
||||
- Grafana OIDC config: `services/grafana/default.nix`
|
||||
- Vault secret: `services/grafana/oauth2-client-secret`
|
||||
|
||||
**Key findings:**
|
||||
- PKCE is required by Kanidm - enable `use_pkce = true` in Grafana
|
||||
- Must set `email_attribute_path`, `login_attribute_path`, `name_attribute_path` to extract from userinfo
|
||||
- Users need: primary credential (password + TOTP for MFA), membership in `users` group, email address set
|
||||
- Unix password is separate from primary credential (web login requires primary credential)
|
||||
|
||||
### Next Steps
|
||||
|
||||
1. Enable PAM/NSS on production hosts (after test tier validation)
|
||||
2. Configure TrueNAS LDAP client for NAS integration testing
|
||||
3. Add OAuth2 clients for other services as needed
|
||||
|
||||
## References
|
||||
|
||||
- [Kanidm Documentation](https://kanidm.github.io/kanidm/stable/)
|
||||
- [NixOS Kanidm Module](https://search.nixos.org/options?query=services.kanidm)
|
||||
- [Kanidm PAM/NSS Integration](https://kanidm.github.io/kanidm/stable/pam_and_nsswitch.html)
|
||||
156
docs/plans/completed/nix-cache-reprovision.md
Normal file
156
docs/plans/completed/nix-cache-reprovision.md
Normal file
@@ -0,0 +1,156 @@
|
||||
# Nix Cache Host Reprovision
|
||||
|
||||
## Overview
|
||||
|
||||
Reprovision `nix-cache01` using the OpenTofu workflow, and improve the build/cache system with:
|
||||
1. NATS-based remote build triggering (replacing the current bash script)
|
||||
2. Safer flake update workflow that validates builds before pushing to master
|
||||
|
||||
## Status
|
||||
|
||||
**Phase 1: New Build Host** - COMPLETE
|
||||
**Phase 2: NATS Build Triggering** - COMPLETE
|
||||
**Phase 3: Safe Flake Update Workflow** - NOT STARTED
|
||||
**Phase 4: Complete Migration** - COMPLETE
|
||||
**Phase 5: Scheduled Builds** - COMPLETE
|
||||
|
||||
## Completed Work
|
||||
|
||||
### New Build Host (nix-cache02)
|
||||
|
||||
Instead of reprovisioning nix-cache01 in-place, we created a new host `nix-cache02` at 10.69.13.25:
|
||||
|
||||
- **Specs**: 8 CPU cores, 16GB RAM (temporarily, will increase to 24GB after nix-cache01 decommissioned), 200GB disk
|
||||
- **Provisioned via OpenTofu** with automatic Vault credential bootstrapping
|
||||
- **Builder service** configured with two repos:
|
||||
- `nixos-servers` → `git+https://git.t-juice.club/torjus/nixos-servers.git`
|
||||
- `nixos` (gunter) → `git+https://git.t-juice.club/torjus/nixos.git`
|
||||
|
||||
### NATS-Based Build Triggering
|
||||
|
||||
The `homelab-deploy` tool was extended with a builder mode:
|
||||
|
||||
**NATS Subjects:**
|
||||
- `build.<repo>.<target>` - e.g., `build.nixos-servers.all` or `build.nixos-servers.ns1`
|
||||
|
||||
**NATS Permissions (in DEPLOY account):**
|
||||
| User | Publish | Subscribe |
|
||||
|------|---------|-----------|
|
||||
| Builder | `build.responses.>` | `build.>` |
|
||||
| Test deployer | `deploy.test.>`, `deploy.discover`, `build.>` | `deploy.responses.>`, `deploy.discover`, `build.responses.>` |
|
||||
| Admin deployer | `deploy.>`, `build.>` | `deploy.>`, `build.responses.>` |
|
||||
|
||||
**Vault Secrets:**
|
||||
- `shared/homelab-deploy/builder-nkey` - NKey seed for builder authentication
|
||||
|
||||
**NixOS Configuration:**
|
||||
- `hosts/nix-cache02/builder.nix` - Builder service configuration
|
||||
- `services/nats/default.nix` - Updated with builder NATS user
|
||||
|
||||
**MCP Integration:**
|
||||
- `.mcp.json` updated with `--enable-builds` flag
|
||||
- Build tool available via MCP for Claude Code
|
||||
|
||||
**Tested:**
|
||||
- Single host build: `build nixos-servers testvm01` (~30s)
|
||||
- All hosts build: `build nixos-servers all` (16 hosts in ~226s)
|
||||
|
||||
### Harmonia Binary Cache
|
||||
|
||||
- Parameterized `services/nix-cache/harmonia.nix` to use hostname-based Vault paths
|
||||
- Parameterized `services/nix-cache/proxy.nix` for hostname-based domain
|
||||
- New signing key: `nix-cache02.home.2rjus.net-1`
|
||||
- Vault secret: `hosts/nix-cache02/cache-secret`
|
||||
- Removed unused Gitea Actions runner from nix-cache01
|
||||
|
||||
## Current State
|
||||
|
||||
### nix-cache02 (Active)
|
||||
- Running at 10.69.13.25
|
||||
- Serving `https://nix-cache.home.2rjus.net` (canonical URL)
|
||||
- Builder service active, responding to NATS build requests
|
||||
- Metrics exposed on port 9973 (`homelab-deploy-builder` job)
|
||||
- Harmonia binary cache server running
|
||||
- Signing key: `nix-cache02.home.2rjus.net-1`
|
||||
- Prod tier with `build-host` role
|
||||
|
||||
### nix-cache01 (Decommissioned)
|
||||
- VM deleted from Proxmox
|
||||
- Host configuration removed from repo
|
||||
- Vault AppRole and secrets removed
|
||||
- Old signing key removed from trusted-public-keys
|
||||
|
||||
## Remaining Work
|
||||
|
||||
### Phase 3: Safe Flake Update Workflow
|
||||
|
||||
1. Create `.github/workflows/flake-update-safe.yaml`
|
||||
2. Disable or remove old `flake-update.yaml`
|
||||
3. Test manually with `workflow_dispatch`
|
||||
4. Monitor first automated run
|
||||
|
||||
### Phase 4: Complete Migration ✅
|
||||
|
||||
1. ~~**Add Harmonia to nix-cache02**~~ ✅ Done - new signing key, parameterized service
|
||||
2. ~~**Add trusted public key to all hosts**~~ ✅ Done - `system/nix.nix` updated
|
||||
3. ~~**Test cache from other hosts**~~ ✅ Done - verified from testvm01
|
||||
4. ~~**Update proxy and DNS**~~ ✅ Done - `nix-cache.home.2rjus.net` CNAME now points to nix-cache02
|
||||
5. ~~**Deploy to all hosts**~~ ✅ Done - all hosts have new trusted key
|
||||
6. ~~**Decommission nix-cache01**~~ ✅ Done - 2026-02-10:
|
||||
- Removed `hosts/nix-cache01/` directory
|
||||
- Removed `services/nix-cache/build-flakes.{nix,sh}`
|
||||
- Removed Vault AppRole and secrets
|
||||
- Removed old signing key from `system/nix.nix`
|
||||
- Removed from `flake.nix`
|
||||
- Deleted VM from Proxmox
|
||||
|
||||
### Phase 5: Scheduled Builds ✅
|
||||
|
||||
Implemented a systemd timer on nix-cache02 that triggers builds every 2 hours:
|
||||
|
||||
- **Timer**: `scheduled-build.timer` runs every 2 hours with 5m random jitter
|
||||
- **Service**: `scheduled-build.service` calls `homelab-deploy build` for both repos
|
||||
- **Authentication**: Dedicated scheduler NKey stored in Vault
|
||||
- **NATS user**: Added to DEPLOY account with publish `build.>` and subscribe `build.responses.>`
|
||||
|
||||
Files:
|
||||
- `hosts/nix-cache02/scheduler.nix` - Timer and service configuration
|
||||
- `services/nats/default.nix` - Scheduler NATS user
|
||||
- `terraform/vault/secrets.tf` - Scheduler NKey secret
|
||||
- `terraform/vault/variables.tf` - Variable for scheduler NKey
|
||||
|
||||
## Resolved Questions
|
||||
|
||||
- **Parallel vs sequential builds?** Sequential - hosts share packages, subsequent builds are fast after first
|
||||
- **What about gunter?** Configured as `nixos` repo in builder settings
|
||||
- **Disk size?** 200GB for new host
|
||||
- **Build host specs?** 8 cores, 16-24GB RAM matches current nix-cache01
|
||||
|
||||
### Phase 6: Observability
|
||||
|
||||
1. **Alerting rules** for build failures:
|
||||
```promql
|
||||
# Alert if any build fails
|
||||
increase(homelab_deploy_build_host_total{status="failure"}[1h]) > 0
|
||||
|
||||
# Alert if no successful builds in 24h (scheduled builds stopped)
|
||||
time() - homelab_deploy_build_last_success_timestamp > 86400
|
||||
```
|
||||
|
||||
2. **Grafana dashboard** for build metrics:
|
||||
- Build success/failure rate over time
|
||||
- Average build duration per host (histogram)
|
||||
- Build frequency (builds per hour/day)
|
||||
- Last successful build timestamp per repo
|
||||
|
||||
Available metrics:
|
||||
- `homelab_deploy_builds_total{repo, status}` - total builds by repo and status
|
||||
- `homelab_deploy_build_host_total{repo, host, status}` - per-host build counts
|
||||
- `homelab_deploy_build_duration_seconds_{bucket,sum,count}` - build duration histogram
|
||||
- `homelab_deploy_build_last_timestamp{repo}` - last build attempt
|
||||
- `homelab_deploy_build_last_success_timestamp{repo}` - last successful build
|
||||
|
||||
## Open Questions
|
||||
|
||||
- [x] ~~When to cut over DNS from nix-cache01 to nix-cache02?~~ Done - 2026-02-10
|
||||
- [ ] Implement safe flake update workflow before or after full migration?
|
||||
113
docs/plans/completed/pgdb1-decommission.md
Normal file
113
docs/plans/completed/pgdb1-decommission.md
Normal file
@@ -0,0 +1,113 @@
|
||||
# pgdb1 Decommissioning Plan
|
||||
|
||||
## Overview
|
||||
|
||||
Decommission the pgdb1 PostgreSQL server. The only consumer was Open WebUI on gunter, which has been migrated to use a local PostgreSQL instance.
|
||||
|
||||
## Pre-flight Verification
|
||||
|
||||
Before proceeding, verify that gunter is no longer using pgdb1:
|
||||
|
||||
1. Check Open WebUI on gunter is configured for local PostgreSQL (not 10.69.13.16)
|
||||
2. Optionally: Check pgdb1 for recent connection activity:
|
||||
```bash
|
||||
ssh pgdb1 'sudo -u postgres psql -c "SELECT * FROM pg_stat_activity WHERE datname IS NOT NULL;"'
|
||||
```
|
||||
|
||||
## Files to Remove
|
||||
|
||||
### Host Configuration
|
||||
- `hosts/pgdb1/default.nix`
|
||||
- `hosts/pgdb1/configuration.nix`
|
||||
- `hosts/pgdb1/hardware-configuration.nix`
|
||||
- `hosts/pgdb1/` (directory)
|
||||
|
||||
### Service Module
|
||||
- `services/postgres/postgres.nix`
|
||||
- `services/postgres/default.nix`
|
||||
- `services/postgres/` (directory)
|
||||
|
||||
Note: This service module is only used by pgdb1, so it can be removed entirely.
|
||||
|
||||
### Flake Entry
|
||||
Remove from `flake.nix` (lines 131-138):
|
||||
```nix
|
||||
pgdb1 = nixpkgs.lib.nixosSystem {
|
||||
inherit system;
|
||||
specialArgs = {
|
||||
inherit inputs self;
|
||||
};
|
||||
modules = commonModules ++ [
|
||||
./hosts/pgdb1
|
||||
];
|
||||
};
|
||||
```
|
||||
|
||||
### Vault AppRole
|
||||
Remove from `terraform/vault/approle.tf` (lines 69-73):
|
||||
```hcl
|
||||
"pgdb1" = {
|
||||
paths = [
|
||||
"secret/data/hosts/pgdb1/*",
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Monitoring Rules
|
||||
Remove from `services/monitoring/rules.yml` the `postgres_down` alert (lines 359-365):
|
||||
```yaml
|
||||
- name: postgres_rules
|
||||
rules:
|
||||
- alert: postgres_down
|
||||
expr: node_systemd_unit_state{instance="pgdb1.home.2rjus.net:9100", name="postgresql.service", state="active"} == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
```
|
||||
|
||||
### Utility Scripts
|
||||
Delete `rebuild-all.sh` entirely (obsolete script).
|
||||
|
||||
## Execution Steps
|
||||
|
||||
### Phase 1: Verification
|
||||
- [ ] Confirm Open WebUI on gunter uses local PostgreSQL
|
||||
- [ ] Verify no active connections to pgdb1
|
||||
|
||||
### Phase 2: Code Cleanup
|
||||
- [ ] Create feature branch: `git checkout -b decommission-pgdb1`
|
||||
- [ ] Remove `hosts/pgdb1/` directory
|
||||
- [ ] Remove `services/postgres/` directory
|
||||
- [ ] Remove pgdb1 entry from `flake.nix`
|
||||
- [ ] Remove postgres alert from `services/monitoring/rules.yml`
|
||||
- [ ] Delete `rebuild-all.sh` (obsolete)
|
||||
- [ ] Run `nix flake check` to verify no broken references
|
||||
- [ ] Commit changes
|
||||
|
||||
### Phase 3: Terraform Cleanup
|
||||
- [ ] Remove pgdb1 from `terraform/vault/approle.tf`
|
||||
- [ ] Run `tofu plan` in `terraform/vault/` to preview changes
|
||||
- [ ] Run `tofu apply` to remove the AppRole
|
||||
- [ ] Commit terraform changes
|
||||
|
||||
### Phase 4: Infrastructure Cleanup
|
||||
- [ ] Shut down pgdb1 VM in Proxmox
|
||||
- [ ] Delete the VM from Proxmox
|
||||
- [ ] (Optional) Remove any DNS entries if not auto-generated
|
||||
|
||||
### Phase 5: Finalize
|
||||
- [ ] Merge feature branch to master
|
||||
- [ ] Trigger auto-upgrade on DNS servers (ns1, ns2) to remove DNS entry
|
||||
- [ ] Move this plan to `docs/plans/completed/`
|
||||
|
||||
## Rollback
|
||||
|
||||
If issues arise after decommissioning:
|
||||
1. The VM can be recreated from template using the git history
|
||||
2. Database data would need to be restored from backup (if any exists)
|
||||
|
||||
## Notes
|
||||
|
||||
- pgdb1 IP: 10.69.13.16
|
||||
- The postgres service allowed connections from gunter (10.69.30.105)
|
||||
- No restic backup was configured for this host
|
||||
Reference in New Issue
Block a user