43 Commits

Author SHA1 Message Date
9da57c6a2f flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/e6eae2ee2110f3d31110d5c222cd395303343b08?narHash=sha256-KHFT9UWOF2yRPlAnSXQJh6uVcgNcWlFqqiAZ7OVlHNc%3D' (2026-02-03)
  → 'github:nixos/nixpkgs/bf922a59c5c9998a6584645f7d0de689512e444c?narHash=sha256-ksTL7P9QC1WfZasNlaAdLOzqD8x5EPyods69YBqxSfk%3D' (2026-02-04)
2026-02-05 00:01:37 +00:00
da9dd02d10 Merge pull request 'monitoring: auto-generate Prometheus scrape targets from host configs' (#16) from monitoring-improvements into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 6m32s
Periodic flake update / flake-update (push) Successful in 1m54s
Reviewed-on: #16
2026-02-04 23:53:46 +00:00
e7980978c7 docs: document monitoring auto-generation in CLAUDE.md
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m33s
Run nix flake check / flake-check (pull_request) Successful in 6m48s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 00:52:39 +01:00
dd1b64de27 monitoring: auto-generate Prometheus scrape targets from host configs
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m49s
Run nix flake check / flake-check (push) Has been cancelled
Add homelab.monitoring NixOS options (enable, scrapeTargets) following
the same pattern as homelab.dns. Prometheus scrape configs are now
auto-generated from flake host configurations and external targets,
replacing hardcoded target lists.

Also cleans up alert rules: snake_case naming, fix zigbee2mqtt typo,
remove duplicate pushgateway alert, add for clauses to monitoring_rules,
remove hardcoded WireGuard public key, and add new alerts for
certificates, proxmox, caddy, smartctl temperature, filesystem
prediction, systemd state, file descriptors, and host reboots.

Fixes grafana scrape target port from 3100 to 3000.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 00:49:07 +01:00
4e8cc124f2 docs: add plan management workflow and lab-monitoring MCP server
Some checks failed
Run nix flake check / flake-check (push) Failing after 11m30s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 00:21:08 +01:00
a2a55f3955 docs: add docs directory info and nixos options improvement plan
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m12s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 22:27:11 +01:00
c38034ba41 docs: rewrite README with current infrastructure overview
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m41s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 22:20:49 +01:00
d7d4b0846c docs: move dns-automation plan to completed
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m17s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 22:13:38 +01:00
8ca7c4e402 Merge pull request 'dns-automation' (#15) from dns-automation into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m18s
Reviewed-on: #15
2026-02-04 21:02:24 +00:00
106912499b docs: add git workflow note about not committing to master
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m16s
Run nix flake check / flake-check (push) Failing after 17m2s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 21:57:40 +01:00
83af00458b dns: remove defunct external hosts
Remove hosts that no longer respond to ping:
- kube-blue1-10 (entire k8s cluster)
- virt-mini1, mpnzb, inc2, testing
- CNAMEs: rook, git (pointed to removed kube-blue nodes)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 21:50:56 +01:00
67d5de3eb8 docs: update CLAUDE.md for DNS automation
- Add /modules/ and /lib/ to directory structure
- Document homelab.dns options and zone auto-generation
- Update "Adding a New Host" workflow (no manual zone editing)
- Expand DNS Architecture section with auto-generation details

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 21:45:16 +01:00
cee1b264cd dns: auto-generate zone entries from host configurations
Replace static zone file with dynamically generated records:
- Add homelab.dns module with enable/cnames options
- Extract IPs from systemd.network configs (filters VPN interfaces)
- Use git commit timestamp as zone serial number
- Move external hosts to separate external-hosts.nix

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 21:43:44 +01:00
4ceee04308 docs: update MCP config for nixpkgs-options and add nixpkgs-packages
Some checks failed
Run nix flake check / flake-check (push) Failing after 14m50s
Rename nixos-options to nixpkgs-options and add new nixpkgs-packages
server for package search functionality. Update CLAUDE.md to document
both MCP servers and their available tools.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 20:50:36 +01:00
e3ced5bcda flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/41e216c0ca66c83b12ab7a98cc326b5db01db646?narHash=sha256-I7Lmgj3owOTBGuauy9FL6qdpeK2umDoe07lM4V%2BPnyA%3D' (2026-01-31)
  → 'github:nixos/nixpkgs/e576e3c9cf9bad747afcddd9e34f51d18c855b4e?narHash=sha256-tlFqNG/uzz2%2B%2BaAmn4v8J0vAkV3z7XngeIIB3rM3650%3D' (2026-02-03)
• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/cb369ef2efd432b3cdf8622b0ffc0a97a02f3137?narHash=sha256-VKS4ZLNx4PNrABoB0L8KUpc1fE7CLpQXQs985tGfaCU%3D' (2026-02-02)
  → 'github:nixos/nixpkgs/e6eae2ee2110f3d31110d5c222cd395303343b08?narHash=sha256-KHFT9UWOF2yRPlAnSXQJh6uVcgNcWlFqqiAZ7OVlHNc%3D' (2026-02-03)
• Updated input 'sops-nix':
    'github:Mic92/sops-nix/1e89149dcfc229e7e2ae24a8030f124a31e4f24f?narHash=sha256-twBMKGQvaztZQxFxbZnkg7y/50BW9yjtCBWwdjtOZew%3D' (2026-02-01)
  → 'github:Mic92/sops-nix/17eea6f3816ba6568b8c81db8a4e6ca438b30b7c?narHash=sha256-ktjWTq%2BD5MTXQcL9N6cDZXUf9kX8JBLLBLT0ZyOTSYY%3D' (2026-02-03)
2026-02-04 00:01:04 +00:00
15459870cd Merge pull request 'backup: migrate to native services.restic.backups' (#14) from migrate-to-native-restic-backups into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 4m4s
Periodic flake update / flake-update (push) Successful in 1m10s
Reviewed-on: #14
2026-02-03 23:47:11 +00:00
d1861eefb5 docs: add clipboard note and update flake inputs
Some checks failed
Run nix flake check / flake-check (push) Successful in 4m10s
Run nix flake check / flake-check (pull_request) Failing after 18m29s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 00:45:37 +01:00
d25fc99e1d backup: migrate to native services.restic.backups
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Run nix flake check / flake-check (pull_request) Successful in 4m0s
Replace custom backup-helper flake input with NixOS native
services.restic.backups module for ha1, monitoring01, and nixos-test1.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 00:41:40 +01:00
b5da9431aa docs: add nixos-options MCP configuration
Some checks failed
Run nix flake check / flake-check (push) Failing after 13m51s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 00:01:00 +01:00
0e5dea635e Merge pull request 'create-host: add delete feature' (#13) from create-host-delete-feature into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m20s
Reviewed-on: #13
2026-02-03 12:06:32 +00:00
86249c466b create-host: add delete feature
Some checks failed
Run nix flake check / flake-check (push) Failing after 21m31s
Run nix flake check / flake-check (pull_request) Failing after 15m17s
2026-02-03 12:11:41 +01:00
5d560267cf Merge pull request 'pki-migration' (#12) from pki-migration into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m21s
Reviewed-on: #12
2026-02-03 05:56:53 +00:00
63662b89e0 docs: update TODO.md
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m51s
Run nix flake check / flake-check (pull_request) Successful in 2m53s
2026-02-03 06:53:59 +01:00
7ae474fd3e pki: add new vault root ca to pki 2026-02-03 06:53:59 +01:00
f0525b5c74 ns: add vaulttest01 to zone
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m19s
2026-02-03 06:42:05 +01:00
42c391b355 ns: add vault cname to zone
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m7s
2026-02-03 06:00:59 +01:00
048536ba70 docs: move dns automation from TODO.md to nixos-improvements.md
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m20s
2026-02-03 04:51:27 +01:00
cccce09406 Merge pull request 'vault: implement bootstrap integration' (#11) from vault-bootstrap-integration into master
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Reviewed-on: #11
2026-02-03 03:46:25 +00:00
01d4812280 vault: implement bootstrap integration
Some checks failed
Run nix flake check / flake-check (push) Successful in 2m31s
Run nix flake check / flake-check (pull_request) Failing after 14m16s
2026-02-03 01:10:36 +01:00
b5364d2ccc flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/62c8382960464ceb98ea593cb8321a2cf8f9e3e5?narHash=sha256-kKB3bqYJU5nzYeIROI82Ef9VtTbu4uA3YydSk/Bioa8%3D' (2026-01-30)
  → 'github:nixos/nixpkgs/cb369ef2efd432b3cdf8622b0ffc0a97a02f3137?narHash=sha256-VKS4ZLNx4PNrABoB0L8KUpc1fE7CLpQXQs985tGfaCU%3D' (2026-02-02)
2026-02-03 00:01:39 +00:00
7fc69c40a6 docs: add truenas-migration plan
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m18s
Periodic flake update / flake-update (push) Successful in 1m13s
2026-02-02 18:29:11 +01:00
34a2f2ab50 docs: add infrastructure documentation
Some checks failed
Run nix flake check / flake-check (push) Failing after 11m9s
2026-02-02 17:36:55 +01:00
16b3214982 Merge pull request 'vault-setup' (#10) from vault-setup into master
Some checks failed
Run nix flake check / flake-check (push) Failing after 6m19s
Reviewed-on: #10
2026-02-02 15:28:58 +00:00
244dd0c78b flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/fa83fd837f3098e3e678e6cf017b2b36102c7211?narHash=sha256-e7VO/kGLgRMbWtpBqdWl0uFg8Y2XWFMdz0uUJvlML8o%3D' (2026-01-28)
  → 'github:nixos/nixpkgs/41e216c0ca66c83b12ab7a98cc326b5db01db646?narHash=sha256-I7Lmgj3owOTBGuauy9FL6qdpeK2umDoe07lM4V%2BPnyA%3D' (2026-01-31)
• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/bfc1b8a4574108ceef22f02bafcf6611380c100d?narHash=sha256-msG8SU5WsBUfVVa/9RPLaymvi5bI8edTavbIq3vRlhI%3D' (2026-01-26)
  → 'github:nixos/nixpkgs/62c8382960464ceb98ea593cb8321a2cf8f9e3e5?narHash=sha256-kKB3bqYJU5nzYeIROI82Ef9VtTbu4uA3YydSk/Bioa8%3D' (2026-01-30)
• Updated input 'sops-nix':
    'github:Mic92/sops-nix/c5eebd4eb2e3372fe12a8d70a248a6ee9dd02eff?narHash=sha256-wFcr32ZqspCxk4%2BFvIxIL0AZktRs6DuF8oOsLt59YBU%3D' (2026-01-26)
  → 'github:Mic92/sops-nix/1e89149dcfc229e7e2ae24a8030f124a31e4f24f?narHash=sha256-twBMKGQvaztZQxFxbZnkg7y/50BW9yjtCBWwdjtOZew%3D' (2026-02-01)
2026-02-02 00:00:56 +00:00
238ad45c14 chore: update TODO.md
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m53s
Run nix flake check / flake-check (pull_request) Successful in 2m16s
2026-02-02 00:47:31 +01:00
c694b9889a vault: add auto-unseal
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m16s
2026-02-02 00:28:24 +01:00
3f2f91aedd terraform: add vault pki management to terraform
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
2026-02-01 23:23:03 +01:00
5d513fd5af terraform: add vault secret managment to terraform 2026-02-01 23:07:47 +01:00
b6f1e80c2a chore: run tofu fmt 2026-02-01 23:04:02 +01:00
4133eafc4e flake: add openbao to devshell
Some checks failed
Run nix flake check / flake-check (push) Failing after 18m52s
2026-02-01 22:16:52 +01:00
ace848b29c vault: replace vault with openbao 2026-02-01 22:16:52 +01:00
b012df9f34 ns: add vault01 host to zone
Some checks failed
Run nix flake check / flake-check (push) Failing after 15m40s
Periodic flake update / flake-update (push) Successful in 1m7s
2026-02-01 20:54:22 +01:00
ab053c25bd opentofu: add tmp device to vms 2026-02-01 20:54:05 +01:00
72 changed files with 5550 additions and 544 deletions

9
.gitignore vendored
View File

@@ -10,3 +10,12 @@ terraform/terraform.tfvars
terraform/*.auto.tfvars
terraform/crash.log
terraform/crash.*.log
terraform/vault/.terraform/
terraform/vault/.terraform.lock.hcl
terraform/vault/*.tfstate
terraform/vault/*.tfstate.*
terraform/vault/terraform.tfvars
terraform/vault/*.auto.tfvars
terraform/vault/crash.log
terraform/vault/crash.*.log

27
.mcp.json Normal file
View File

@@ -0,0 +1,27 @@
{
"mcpServers": {
"nixpkgs-options": {
"command": "nix",
"args": ["run", "git+https://git.t-juice.club/torjus/labmcp#nixpkgs-search", "--", "options", "serve"],
"env": {
"NIXPKGS_SEARCH_DATABASE": "sqlite:///run/user/1000/labmcp/nixpkgs-search.db"
}
},
"nixpkgs-packages": {
"command": "nix",
"args": ["run", "git+https://git.t-juice.club/torjus/labmcp#nixpkgs-search", "--", "packages", "serve"],
"env": {
"NIXPKGS_SEARCH_DATABASE": "sqlite:///run/user/1000/labmcp/nixpkgs-search.db"
}
},
"lab-monitoring": {
"command": "nix",
"args": ["run", "git+https://git.t-juice.club/torjus/labmcp#lab-monitoring", "--", "serve", "--enable-silences"],
"env": {
"PROMETHEUS_URL": "https://prometheus.home.2rjus.net",
"ALERTMANAGER_URL": "https://alertmanager.home.2rjus.net"
}
}
}
}

153
CLAUDE.md
View File

@@ -21,6 +21,16 @@ nixos-rebuild build --flake .#<hostname>
nix build .#nixosConfigurations.<hostname>.config.system.build.toplevel
```
**Important:** Do NOT pipe `nix build` commands to other commands like `tail` or `head`. Piping can hide errors and make builds appear successful when they actually failed. Always run `nix build` without piping to see the full output.
```bash
# BAD - hides errors
nix build .#create-host 2>&1 | tail -20
# GOOD - shows all output and errors
nix build .#create-host
```
### Deployment
Do not automatically deploy changes. Deployments are usually done by updating the master branch, and then triggering the auto update on the specific host.
@@ -44,6 +54,19 @@ nix develop
Secrets are handled by sops. Do not edit any `.sops.yaml` or any file within `secrets/`. Ask the user to modify if necessary.
### Git Workflow
**Important:** Never commit directly to `master` unless the user explicitly asks for it. Always create a feature branch for changes.
When starting a new plan or task, the first step should typically be to create and checkout a new branch with an appropriate name (e.g., `git checkout -b dns-automation` or `git checkout -b fix-nginx-config`).
### Plan Management
When creating plans for large features, follow this workflow:
1. When implementation begins, save a copy of the plan to `docs/plans/` (e.g., `docs/plans/feature-name.md`)
2. Once the feature is fully implemented, move the plan to `docs/plans/completed/`
### Git Commit Messages
Commit messages should follow the format: `topic: short description`
@@ -53,24 +76,66 @@ Examples:
- `template2: add proxmox image configuration`
- `terraform: add VM deployment configuration`
### Clipboard
To copy text to the clipboard, pipe to `wl-copy` (Wayland):
```bash
echo "text" | wl-copy
```
### NixOS Options and Packages Lookup
Two MCP servers are available for searching NixOS options and packages:
- **nixpkgs-options** - Search and lookup NixOS configuration option documentation
- **nixpkgs-packages** - Search and lookup Nix packages from nixpkgs
**Session Setup:** At the start of each session, index the nixpkgs revision from `flake.lock` to ensure documentation matches the project's nixpkgs version:
1. Read `flake.lock` and find the `nixpkgs` node's `rev` field
2. Call `index_revision` with that git hash (both servers share the same index)
**Options Tools (nixpkgs-options):**
- `search_options` - Search for options by name or description (e.g., query "nginx" or "postgresql")
- `get_option` - Get full details for a specific option (e.g., `services.loki.configuration`)
- `get_file` - Fetch the source file from nixpkgs that declares an option
**Package Tools (nixpkgs-packages):**
- `search_packages` - Search for packages by name or description (e.g., query "nginx" or "python")
- `get_package` - Get full details for a specific package by attribute path (e.g., `firefox`, `python312Packages.requests`)
- `get_file` - Fetch the source file from nixpkgs that defines a package
This ensures documentation matches the exact nixpkgs version (currently NixOS 25.11) used by this flake.
## Architecture
### Directory Structure
- `/flake.nix` - Central flake defining all 16 NixOS configurations
- `/flake.nix` - Central flake defining all NixOS configurations
- `/hosts/<hostname>/` - Per-host configurations
- `default.nix` - Entry point, imports configuration.nix and services
- `configuration.nix` - Host-specific settings (networking, hardware, users)
- `/system/` - Shared system-level configurations applied to ALL hosts
- Core modules: nix.nix, sshd.nix, sops.nix, acme.nix, autoupgrade.nix
- Monitoring: node-exporter and promtail on every host
- `/modules/` - Custom NixOS modules
- `homelab/` - Homelab-specific options (DNS automation, monitoring scrape targets)
- `/lib/` - Nix library functions
- `dns-zone.nix` - DNS zone generation functions
- `monitoring.nix` - Prometheus scrape target generation functions
- `/services/` - Reusable service modules, selectively imported by hosts
- `home-assistant/` - Home automation stack
- `monitoring/` - Observability stack (Prometheus, Grafana, Loki, Tempo)
- `ns/` - DNS services (authoritative, resolver)
- `ns/` - DNS services (authoritative, resolver, zone generation)
- `http-proxy/`, `ca/`, `postgres/`, `nats/`, `jellyfin/`, etc.
- `/secrets/` - SOPS-encrypted secrets with age encryption
- `/common/` - Shared configurations (e.g., VM guest agent)
- `/docs/` - Documentation and plans
- `plans/` - Future plans and proposals
- `plans/completed/` - Completed plans (moved here when done)
- `/playbooks/` - Ansible playbooks for fleet management
- `/.sops.yaml` - SOPS configuration with age keys for all servers
@@ -92,7 +157,9 @@ All hosts automatically get:
- Internal ACME CA integration (ca.home.2rjus.net)
- Daily auto-upgrades with auto-reboot
- Prometheus node-exporter + Promtail (logs to monitoring01)
- Monitoring scrape target auto-registration via `homelab.monitoring` options
- Custom root CA trust
- DNS zone auto-registration via `homelab.dns` options
### Active Hosts
@@ -118,7 +185,6 @@ Template/test hosts:
- `nixpkgs-unstable` - Unstable channel (available via overlay as `pkgs.unstable.<package>`)
- `sops-nix` - Secrets management
- Custom packages from git.t-juice.club:
- `backup-helper` - Backup automation module
- `alerttonotify` - Alert routing
- `labmon` - Lab monitoring
@@ -203,19 +269,50 @@ Example VM deployment includes:
OpenTofu outputs the VM's IP address after deployment for easy SSH access.
#### Template Rebuilding and Terraform State
When the Proxmox template is rebuilt (via `build-and-deploy-template.yml`), the template name may change. This would normally cause Terraform to want to recreate all existing VMs, but that's unnecessary since VMs are independent once cloned.
**Solution**: The `terraform/vms.tf` file includes a lifecycle rule to ignore certain attributes that don't need management:
```hcl
lifecycle {
ignore_changes = [
clone, # Template name can change without recreating VMs
startup_shutdown, # Proxmox sets defaults (-1) that we don't need to manage
]
}
```
This means:
- **clone**: Existing VMs are not affected by template name changes; only new VMs use the updated template
- **startup_shutdown**: Proxmox sets default startup order/delay values (-1) that Terraform would otherwise try to remove
- You can safely update `default_template_name` in `terraform/variables.tf` without recreating VMs
- `tofu plan` won't show spurious changes for Proxmox-managed defaults
**When rebuilding the template:**
1. Run `nix develop -c ansible-playbook -i playbooks/inventory.ini playbooks/build-and-deploy-template.yml`
2. Update `default_template_name` in `terraform/variables.tf` if the name changed
3. Run `tofu plan` - should show no VM recreations (only template name in state)
4. Run `tofu apply` - updates state without touching existing VMs
5. New VMs created after this point will use the new template
### Adding a New Host
1. Create `/hosts/<hostname>/` directory
2. Copy structure from `template1` or similar host
3. Add host entry to `flake.nix` nixosConfigurations
4. Add hostname to dns zone files. Merge to master. Run auto-upgrade on dns servers.
5. User clones template host
6. User runs `prepare-host.sh` on new host, this deletes files which should be regenerated, like ssh host keys, machine-id etc. It also creates a new age key, and prints the public key
7. This key is then added to `.sops.yaml`
8. Create `/secrets/<hostname>/` if needed
9. Configure networking (static IP, DNS servers)
4. Configure networking in `configuration.nix` (static IP via `systemd.network.networks`, DNS servers)
5. (Optional) Add `homelab.dns.cnames` if the host needs CNAME aliases
6. User clones template host
7. User runs `prepare-host.sh` on new host, this deletes files which should be regenerated, like ssh host keys, machine-id etc. It also creates a new age key, and prints the public key
8. This key is then added to `.sops.yaml`
9. Create `/secrets/<hostname>/` if needed
10. Commit changes, and merge to master.
11. Deploy by running `nixos-rebuild boot --flake URL#<hostname>` on the host.
12. Run auto-upgrade on DNS servers (ns1, ns2) to pick up the new host's DNS entry
**Note:** DNS A records and Prometheus node-exporter scrape targets are auto-generated from the host's `systemd.network.networks` static IP configuration. No manual zone file or Prometheus config editing is required.
### Important Patterns
@@ -238,9 +335,45 @@ All hosts ship metrics and logs to `monitoring01`:
- **Tracing**: Tempo for distributed tracing
- **Profiling**: Pyroscope for continuous profiling
**Scrape Target Auto-Generation:**
Prometheus scrape targets are automatically generated from host configurations, following the same pattern as DNS zone generation:
- **Node-exporter**: All flake hosts with static IPs are automatically added as node-exporter targets
- **Service targets**: Defined via `homelab.monitoring.scrapeTargets` in service modules
- **External targets**: Non-flake hosts defined in `/services/monitoring/external-targets.nix`
- **Library**: `lib/monitoring.nix` provides `generateNodeExporterTargets` and `generateScrapeConfigs`
Host monitoring options (`homelab.monitoring.*`):
- `enable` (default: `true`) - Include host in Prometheus node-exporter scrape targets
- `scrapeTargets` (default: `[]`) - Additional scrape targets exposed by this host (job_name, port, metrics_path, scheme, scrape_interval, honor_labels)
Service modules declare their scrape targets directly (e.g., `services/ca/default.nix` declares step-ca on port 9000). The Prometheus config on monitoring01 auto-generates scrape configs from all hosts.
To add monitoring targets for non-NixOS hosts, edit `/services/monitoring/external-targets.nix`.
### DNS Architecture
- `ns1` (10.69.13.5) - Primary authoritative DNS + resolver
- `ns2` (10.69.13.6) - Secondary authoritative DNS (AXFR from ns1)
- Zone files managed in `/services/ns/`
- All hosts point to ns1/ns2 for DNS resolution
**Zone Auto-Generation:**
DNS zone entries are automatically generated from host configurations:
- **Flake-managed hosts**: A records extracted from `systemd.network.networks` static IPs
- **CNAMEs**: Defined via `homelab.dns.cnames` option in host configs
- **External hosts**: Non-flake hosts defined in `/services/ns/external-hosts.nix`
- **Serial number**: Uses `self.sourceInfo.lastModified` (git commit timestamp)
Host DNS options (`homelab.dns.*`):
- `enable` (default: `true`) - Include host in DNS zone generation
- `cnames` (default: `[]`) - List of CNAME aliases pointing to this host
Hosts are automatically excluded from DNS if:
- `homelab.dns.enable = false` (e.g., template hosts)
- No static IP configured (e.g., DHCP-only hosts)
- Network interface is a VPN/tunnel (wg*, tun*, tap*)
To add DNS entries for non-NixOS hosts, edit `/services/ns/external-hosts.nix`.

129
README.md
View File

@@ -1,11 +1,128 @@
# nixos-servers
Nixos configs for my homelab servers.
NixOS Flake-based configuration repository for a homelab infrastructure. All hosts run NixOS 25.11 and are managed declaratively through this single repository.
## Configurations in use
## Hosts
* ha1
* ns1
* ns2
* template1
| Host | Role |
|------|------|
| `ns1`, `ns2` | Primary/secondary authoritative DNS |
| `ns3`, `ns4` | Additional DNS servers |
| `ca` | Internal Certificate Authority |
| `ha1` | Home Assistant + Zigbee2MQTT + Mosquitto |
| `http-proxy` | Reverse proxy |
| `monitoring01` | Prometheus, Grafana, Loki, Tempo, Pyroscope |
| `jelly01` | Jellyfin media server |
| `nix-cache01` | Nix binary cache |
| `pgdb1` | PostgreSQL |
| `nats1` | NATS messaging |
| `auth01` | Authentication (LLDAP + Authelia) |
| `vault01` | OpenBao (Vault) secrets management |
| `media1` | Media services |
| `template1`, `template2` | VM templates for cloning new hosts |
## Directory Structure
```
flake.nix # Flake entry point, defines all host configurations
hosts/<hostname>/ # Per-host configuration
system/ # Shared modules applied to ALL hosts
services/ # Reusable service modules, selectively imported per host
modules/ # Custom NixOS module definitions
lib/ # Nix library functions (DNS zone generation, etc.)
secrets/ # SOPS-encrypted secrets (age encryption)
common/ # Shared configurations (e.g., VM guest agent)
terraform/ # OpenTofu configs for Proxmox VM provisioning
terraform/vault/ # OpenTofu configs for OpenBao (secrets, PKI, AppRoles)
playbooks/ # Ansible playbooks for template building and fleet ops
scripts/ # Helper scripts (create-host, vault-fetch)
```
## Key Features
**Automatic DNS zone generation** - A records are derived from each host's static IP configuration. CNAME aliases are defined via `homelab.dns.cnames`. No manual zone file editing required.
**SOPS secrets management** - Each host has a unique age key. Shared secrets live in `secrets/secrets.yaml`, per-host secrets in `secrets/<hostname>/`.
**Daily auto-upgrades** - All hosts pull from the master branch and automatically rebuild and reboot on a randomized schedule.
**Shared base configuration** - Every host automatically gets SSH, monitoring (node-exporter + Promtail), internal ACME certificates, and Nix binary cache access via the `system/` modules.
**Proxmox VM provisioning** - Build VM templates with Ansible and deploy VMs with OpenTofu from `terraform/`.
**OpenBao (Vault) secrets** - Centralized secrets management with AppRole authentication, PKI infrastructure, and automated bootstrap. Managed as code in `terraform/vault/`.
## Usage
```bash
# Enter dev shell (provides ansible, opentofu, openbao, create-host)
nix develop
# Build a host configuration locally
nix build .#nixosConfigurations.<hostname>.config.system.build.toplevel
# List all configurations
nix flake show
```
Deployments are done by merging to master and triggering the auto-upgrade on the target host.
## Provisioning New Hosts
The repository includes an automated pipeline for creating and deploying new hosts on Proxmox.
### 1. Generate host configuration
The `create-host` tool (available in the dev shell) generates all required files for a new host:
```bash
create-host \
--hostname myhost \
--ip 10.69.13.50/24 \
--cpu 4 \
--memory 4096 \
--disk 50G
```
This creates:
- `hosts/<hostname>/` - NixOS configuration (networking, imports, hardware)
- Entry in `flake.nix`
- VM definition in `terraform/vms.tf`
- Vault AppRole policy and wrapped bootstrap token
Omit `--ip` for DHCP. Use `--dry-run` to preview changes. Use `--force` to regenerate an existing host's config.
### 2. Build and deploy the VM template
The Proxmox VM template is built from `hosts/template2` and deployed with Ansible:
```bash
nix develop -c ansible-playbook -i playbooks/inventory.ini playbooks/build-and-deploy-template.yml
```
This only needs to be re-run when the base template changes.
### 3. Deploy the VM
```bash
cd terraform && tofu apply
```
### 4. Automatic bootstrap
On first boot, the VM automatically:
1. Receives its hostname and Vault credentials via cloud-init
2. Unwraps the Vault token and stores AppRole credentials
3. Runs `nixos-rebuild boot` against the flake on the master branch
4. Reboots into the host-specific configuration
5. Services fetch their secrets from Vault at startup
No manual intervention is required after `tofu apply`.
## Network
- Domain: `home.2rjus.net`
- Infrastructure subnet: `10.69.13.0/24`
- DNS: ns1/ns2 authoritative with primary-secondary AXFR
- Internal CA for TLS certificates (migrating from step-ca to OpenBao PKI)
- Centralized monitoring at monitoring01

478
TODO.md
View File

@@ -153,7 +153,9 @@ create-host \
---
### Phase 4: Secrets Management with HashiCorp Vault
### Phase 4: Secrets Management with OpenBao (Vault)
**Status:** 🚧 Phases 4a, 4b, 4c (partial), & 4d Complete
**Challenge:** Current sops-nix approach has chicken-and-egg problem with age keys
@@ -164,243 +166,361 @@ create-host \
4. User commits, pushes
5. VM can now decrypt secrets
**Selected approach:** Migrate to HashiCorp Vault for centralized secrets management
**Selected approach:** Migrate to OpenBao (Vault fork) for centralized secrets management
**Why OpenBao instead of HashiCorp Vault:**
- HashiCorp Vault switched to BSL (Business Source License), unavailable in NixOS cache
- OpenBao is the community fork maintaining the pre-BSL MPL 2.0 license
- API-compatible with Vault, uses same Terraform provider
- Maintains all Vault features we need
**Benefits:**
- Industry-standard secrets management (Vault experience transferable to work)
- Industry-standard secrets management (Vault-compatible experience)
- Eliminates manual age key distribution step
- Secrets-as-code via OpenTofu (infrastructure-as-code aligned)
- Centralized PKI management (replaces step-ca, consolidates TLS + SSH CA)
- Centralized PKI management with ACME support (ready to replace step-ca)
- Automatic secret rotation capabilities
- Audit logging for all secret access
- Audit logging for all secret access (not yet enabled)
- AppRole authentication enables automated bootstrap
**Architecture:**
**Current Architecture:**
```
vault.home.2rjus.net
├─ KV Secrets Engine (replaces sops-nix)
├─ PKI Engine (replaces step-ca for TLS)
├─ SSH CA Engine (replaces step-ca SSH CA)
└─ AppRole Auth (per-host authentication)
vault01.home.2rjus.net (10.69.13.19)
├─ KV Secrets Engine (ready to replace sops-nix)
│ ├─ secret/hosts/{hostname}/*
│ ├─ secret/services/{service}/*
└─ secret/shared/{category}/*
├─ PKI Engine (ready to replace step-ca for TLS)
│ ├─ Root CA (EC P-384, 10 year)
│ ├─ Intermediate CA (EC P-384, 5 year)
│ └─ ACME endpoint enabled
├─ SSH CA Engine (TODO: Phase 4c)
└─ AppRole Auth (per-host authentication configured)
New hosts authenticate on first boot
Fetch secrets via Vault API
[✅ Phase 4d] New hosts authenticate on first boot
[✅ Phase 4d] Fetch secrets via Vault API
No manual key distribution needed
```
**Completed:**
- ✅ Phase 4a: OpenBao server with TPM2 auto-unseal
- ✅ Phase 4b: Infrastructure-as-code (secrets, policies, AppRoles, PKI)
- ✅ Phase 4d: Bootstrap integration for automated secrets access
**Next Steps:**
- Phase 4c: Migrate from step-ca to OpenBao PKI
---
#### Phase 4a: Vault Server Setup
#### Phase 4a: Vault Server Setup ✅ COMPLETED
**Status:** ✅ Fully implemented and tested
**Completed:** 2026-02-02
**Goal:** Deploy and configure Vault server with auto-unseal
**Tasks:**
- [ ] Create `hosts/vault01/` configuration
- [ ] Basic NixOS configuration (hostname, networking, etc.)
- [ ] Vault service configuration
- [ ] Firewall rules (8200 for API, 8201 for cluster)
- [ ] Add to flake.nix and terraform
- [ ] Implement auto-unseal mechanism
- [ ] **Preferred:** TPM-based auto-unseal if hardware supports it
- [ ] Use tpm2-tools to seal/unseal Vault keys
- [ ] Systemd service to unseal on boot
- [ ] **Fallback:** Shamir secret sharing with systemd automation
- [ ] Generate 3 keys, threshold 2
- [ ] Store 2 keys on disk (encrypted), keep 1 offline
- [ ] Systemd service auto-unseals using 2 keys
- [ ] Initial Vault setup
- [ ] Initialize Vault
- [ ] Configure storage backend (integrated raft or file)
- [ ] Set up root token management
- [ ] Enable audit logging
- [ ] Deploy to infrastructure
- [ ] Add DNS entry for vault.home.2rjus.net
- [ ] Deploy VM via terraform
- [ ] Bootstrap and verify Vault is running
**Implementation:**
- Used **OpenBao** (Vault fork) instead of HashiCorp Vault due to BSL licensing concerns
- TPM2-based auto-unseal using systemd's native `LoadCredentialEncrypted`
- Self-signed bootstrap TLS certificates (avoiding circular dependency with step-ca)
- File-based storage backend at `/var/lib/openbao`
- Unix socket + TCP listener (0.0.0.0:8200) configuration
**Deliverable:** Running Vault server that auto-unseals on boot
**Tasks:**
- [x] Create `hosts/vault01/` configuration
- [x] Basic NixOS configuration (hostname: vault01, IP: 10.69.13.19/24)
- [x] Created reusable `services/vault` module
- [x] Firewall not needed (trusted network)
- [x] Already in flake.nix, deployed via terraform
- [x] Implement auto-unseal mechanism
- [x] **TPM2-based auto-unseal** (preferred option)
- [x] systemd `LoadCredentialEncrypted` with TPM2 binding
- [x] `writeShellApplication` script with proper runtime dependencies
- [x] Reads multiple unseal keys (one per line) until unsealed
- [x] Auto-unseals on service start via `ExecStartPost`
- [x] Initial Vault setup
- [x] Initialized OpenBao with Shamir secret sharing (5 keys, threshold 3)
- [x] File storage backend
- [x] Self-signed TLS certificates via LoadCredential
- [x] Deploy to infrastructure
- [x] DNS entry added for vault01.home.2rjus.net
- [x] VM deployed via terraform
- [x] Verified OpenBao running and auto-unsealing
**Changes from Original Plan:**
- Used OpenBao instead of HashiCorp Vault (licensing)
- Used systemd's native TPM2 support instead of tpm2-tools directly
- Skipped audit logging (can be enabled later)
- Used self-signed certs initially (will migrate to OpenBao PKI later)
**Deliverable:** ✅ Running OpenBao server that auto-unseals on boot using TPM2
**Documentation:**
- `/services/vault/README.md` - Service module overview
- `/docs/vault/auto-unseal.md` - Complete TPM2 auto-unseal setup guide
---
#### Phase 4b: Vault-as-Code with OpenTofu
#### Phase 4b: Vault-as-Code with OpenTofu ✅ COMPLETED
**Status:** ✅ Fully implemented and tested
**Completed:** 2026-02-02
**Goal:** Manage all Vault configuration (secrets structure, policies, roles) as code
**Implementation:**
- Complete Terraform/OpenTofu configuration in `terraform/vault/`
- Locals-based pattern (similar to `vms.tf`) for declaring secrets and policies
- Auto-generation of secrets using `random_password` provider
- Three-tier secrets path hierarchy: `hosts/`, `services/`, `shared/`
- PKI infrastructure with **Elliptic Curve certificates** (P-384 for CAs, P-256 for leaf certs)
- ACME support enabled on intermediate CA
**Tasks:**
- [ ] Set up Vault Terraform provider
- [ ] Create `terraform/vault/` directory
- [ ] Configure Vault provider (address, auth)
- [ ] Store Vault token securely (terraform.tfvars, gitignored)
- [ ] Enable and configure secrets engines
- [ ] Enable KV v2 secrets engine at `secret/`
- [ ] Define secret path structure (per-service, per-host)
- [ ] Example: `secret/monitoring/grafana`, `secret/postgres/ha1`
- [ ] Define policies as code
- [ ] Create policies for different service tiers
- [ ] Principle of least privilege (hosts only read their secrets)
- [ ] Example: monitoring-policy allows read on `secret/monitoring/*`
- [ ] Set up AppRole authentication
- [ ] Enable AppRole auth backend
- [ ] Create role per host type (monitoring, dns, database, etc.)
- [ ] Bind policies to roles
- [ ] Configure TTL and token policies
- [ ] Migrate existing secrets from sops-nix
- [ ] Create migration script/playbook
- [ ] Decrypt sops secrets and load into Vault KV
- [ ] Verify all secrets migrated successfully
- [ ] Keep sops as backup during transition
- [ ] Implement secrets-as-code patterns
- [ ] Secret values in gitignored terraform.tfvars
- [ ] Or use random_password for auto-generated secrets
- [ ] Secret structure/paths in version-controlled .tf files
- [x] Set up Vault Terraform provider
- [x] Created `terraform/vault/` directory
- [x] Configured Vault provider (uses HashiCorp provider, compatible with OpenBao)
- [x] Credentials in terraform.tfvars (gitignored)
- [x] terraform.tfvars.example for reference
- [x] Enable and configure secrets engines
- [x] KV v2 engine at `secret/`
- [x] Three-tier path structure:
- `secret/hosts/{hostname}/*` - Host-specific secrets
- `secret/services/{service}/*` - Service-wide secrets
- `secret/shared/{category}/*` - Shared secrets (SMTP, backups, etc.)
- [x] Define policies as code
- [x] Policies auto-generated from `locals.host_policies`
- [x] Per-host policies with read/list on designated paths
- [x] Principle of least privilege enforced
- [x] Set up AppRole authentication
- [x] AppRole backend enabled at `approle/`
- [x] Roles auto-generated per host from `locals.host_policies`
- [x] Token TTL: 1 hour, max 24 hours
- [x] Policies bound to roles
- [x] Implement secrets-as-code patterns
- [x] Auto-generated secrets using `random_password` provider
- [x] Manual secrets supported via variables in terraform.tfvars
- [x] Secret structure versioned in .tf files
- [x] Secret values excluded from git
- [x] Set up PKI infrastructure
- [x] Root CA (10 year TTL, EC P-384)
- [x] Intermediate CA (5 year TTL, EC P-384)
- [x] PKI role for `*.home.2rjus.net` (30 day max TTL, EC P-256)
- [x] ACME enabled on intermediate CA
- [x] Support for static certificate issuance via Terraform
- [x] CRL, OCSP, and issuing certificate URLs configured
**Example OpenTofu:**
```hcl
resource "vault_kv_secret_v2" "monitoring_grafana" {
mount = "secret"
name = "monitoring/grafana"
data_json = jsonencode({
admin_password = var.grafana_admin_password
smtp_password = var.smtp_password
})
}
**Changes from Original Plan:**
- Used Elliptic Curve instead of RSA for all certificates (better performance, smaller keys)
- Implemented PKI infrastructure in Phase 4b instead of Phase 4c (more logical grouping)
- ACME support configured immediately (ready for migration from step-ca)
- Did not migrate existing sops-nix secrets yet (deferred to gradual migration)
resource "vault_policy" "monitoring" {
name = "monitoring-policy"
policy = <<EOT
path "secret/data/monitoring/*" {
capabilities = ["read"]
}
EOT
}
**Files:**
- `terraform/vault/main.tf` - Provider configuration
- `terraform/vault/variables.tf` - Variable definitions
- `terraform/vault/approle.tf` - AppRole authentication (locals-based pattern)
- `terraform/vault/pki.tf` - PKI infrastructure with EC certificates
- `terraform/vault/secrets.tf` - KV secrets engine (auto-generation support)
- `terraform/vault/README.md` - Complete documentation and usage examples
- `terraform/vault/terraform.tfvars.example` - Example credentials
resource "vault_approle_auth_backend_role" "monitoring01" {
backend = "approle"
role_name = "monitoring01"
token_policies = ["monitoring-policy"]
}
```
**Deliverable:** ✅ All secrets, policies, AppRoles, and PKI managed as OpenTofu code in `terraform/vault/`
**Deliverable:** All secrets and policies managed as OpenTofu code in `terraform/vault/`
**Documentation:**
- `/terraform/vault/README.md` - Comprehensive guide covering:
- Setup and deployment
- AppRole usage and host access patterns
- PKI certificate issuance (ACME, static, manual)
- Secrets management patterns
- ACME configuration and troubleshooting
---
#### Phase 4c: PKI Migration (Replace step-ca)
**Goal:** Consolidate PKI infrastructure into Vault
**Status:** 🚧 Partially Complete - vault01 and test host migrated, remaining hosts pending
**Goal:** Migrate hosts from step-ca to OpenBao PKI for TLS certificates
**Note:** PKI infrastructure already set up in Phase 4b (root CA, intermediate CA, ACME support)
**Tasks:**
- [ ] Set up Vault PKI engines
- [ ] Create root CA in Vault (`pki/` mount, 10 year TTL)
- [ ] Create intermediate CA (`pki_int/` mount, 5 year TTL)
- [ ] Sign intermediate with root CA
- [ ] Configure CRL and OCSP
- [ ] Enable ACME support
- [ ] Enable ACME on intermediate CA (Vault 1.14+)
- [ ] Create PKI role for homelab domain
- [ ] Set certificate TTLs and allowed domains
- [ ] Configure SSH CA in Vault
- [x] Set up OpenBao PKI engines (completed in Phase 4b)
- [x] Root CA (`pki/` mount, 10 year TTL, EC P-384)
- [x] Intermediate CA (`pki_int/` mount, 5 year TTL, EC P-384)
- [x] Signed intermediate with root CA
- [x] Configured CRL, OCSP, and issuing certificate URLs
- [x] Enable ACME support (completed in Phase 4b, fixed in Phase 4c)
- [x] Enabled ACME on intermediate CA
- [x] Created PKI role for `*.home.2rjus.net`
- [x] Set certificate TTLs (30 day max) and allowed domains
- [x] ACME directory: `https://vault01.home.2rjus.net:8200/v1/pki_int/acme/directory`
- [x] Fixed ACME response headers (added Replay-Nonce, Link, Location to allowed_response_headers)
- [x] Configured cluster path for ACME
- [x] Download and distribute root CA certificate
- [x] Added root CA to `system/pki/root-ca.nix`
- [x] Distributed to all hosts via system imports
- [x] Test certificate issuance
- [x] Tested ACME issuance on vaulttest01 successfully
- [x] Verified certificate chain and trust
- [x] Migrate vault01's own certificate
- [x] Created `bootstrap-vault-cert` script for initial certificate issuance via bao CLI
- [x] Issued certificate with SANs (vault01.home.2rjus.net + vault.home.2rjus.net)
- [x] Updated service to read certificates from `/var/lib/acme/vault01.home.2rjus.net/`
- [x] Configured ACME for automatic renewals
- [ ] Migrate hosts from step-ca to OpenBao
- [x] Tested on vaulttest01 (non-production host)
- [ ] Standardize hostname usage across all configurations
- [ ] Use `vault.home.2rjus.net` (CNAME) consistently everywhere
- [ ] Update NixOS configurations to use CNAME instead of vault01
- [ ] Update Terraform configurations to use CNAME
- [ ] Audit and fix mixed usage of vault01.home.2rjus.net vs vault.home.2rjus.net
- [ ] Update `system/acme.nix` to use OpenBao ACME endpoint
- [ ] Change server to `https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory`
- [ ] Roll out to all hosts via auto-upgrade
- [ ] Configure SSH CA in OpenBao (optional, future work)
- [ ] Enable SSH secrets engine (`ssh/` mount)
- [ ] Generate SSH signing keys
- [ ] Create roles for host and user certificates
- [ ] Configure TTLs and allowed principals
- [ ] Migrate hosts from step-ca to Vault
- [ ] Update system/acme.nix to use Vault ACME endpoint
- [ ] Change server to `https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory`
- [ ] Test certificate issuance on one host
- [ ] Roll out to all hosts via auto-upgrade
- [ ] Migrate SSH CA trust
- [ ] Distribute Vault SSH CA public key to all hosts
- [ ] Update sshd_config to trust Vault CA
- [ ] Test SSH certificate authentication
- [ ] Distribute SSH CA public key to all hosts
- [ ] Update sshd_config to trust OpenBao CA
- [ ] Decommission step-ca
- [ ] Verify all services migrated
- [ ] Verify all ACME services migrated and working
- [ ] Stop step-ca service on ca host
- [ ] Archive step-ca configuration for backup
- [ ] Update documentation
**Deliverable:** All TLS and SSH certificates issued by Vault, step-ca retired
**Implementation Details (2026-02-03):**
**ACME Configuration Fix:**
The key blocker was that OpenBao's PKI mount was filtering out required ACME response headers. The solution was to add `allowed_response_headers` to the Terraform mount configuration:
```hcl
allowed_response_headers = [
"Replay-Nonce", # Required for ACME nonce generation
"Link", # Required for ACME navigation
"Location" # Required for ACME resource location
]
```
**Cluster Path Configuration:**
ACME requires the cluster path to include the full API path:
```hcl
path = "${var.vault_address}/v1/${vault_mount.pki_int.path}"
aia_path = "${var.vault_address}/v1/${vault_mount.pki_int.path}"
```
**Bootstrap Process:**
Since vault01 needed a certificate from its own PKI (chicken-and-egg problem), we created a `bootstrap-vault-cert` script that:
1. Uses the Unix socket (no TLS) to issue a certificate via `bao` CLI
2. Places it in the ACME directory structure
3. Includes both vault01.home.2rjus.net and vault.home.2rjus.net as SANs
4. After restart, ACME manages renewals automatically
**Files Modified:**
- `terraform/vault/pki.tf` - Added allowed_response_headers, cluster config, ACME config
- `services/vault/default.nix` - Updated cert paths, added bootstrap script, configured ACME
- `system/pki/root-ca.nix` - Added OpenBao root CA to trust store
- `hosts/vaulttest01/configuration.nix` - Overrode ACME server for testing
**Deliverable:** ✅ vault01 and vaulttest01 using OpenBao PKI, remaining hosts still on step-ca
---
#### Phase 4d: Bootstrap Integration
#### Phase 4d: Bootstrap Integration ✅ COMPLETED (2026-02-02)
**Goal:** New hosts automatically authenticate to Vault on first boot, no manual steps
**Tasks:**
- [ ] Update create-host tool
- [ ] Generate AppRole role_id + secret_id for new host
- [ ] Or create wrapped token for one-time bootstrap
- [ ] Add host-specific policy to Vault (via terraform)
- [ ] Store bootstrap credentials for cloud-init injection
- [ ] Update template2 for Vault authentication
- [ ] Create Vault authentication module
- [ ] Reads bootstrap credentials from cloud-init
- [ ] Authenticates to Vault, retrieves permanent AppRole credentials
- [ ] Stores role_id + secret_id locally for services to use
- [ ] Create NixOS Vault secrets module
- [ ] Replacement for sops.secrets
- [ ] Fetches secrets from Vault at nixos-rebuild/activation time
- [ ] Or runtime secret fetching for services
- [ ] Handle Vault token renewal
- [ ] Update bootstrap service
- [ ] After authenticating to Vault, fetch any bootstrap secrets
- [ ] Run nixos-rebuild with host configuration
- [ ] Services automatically fetch their secrets from Vault
- [ ] Update terraform cloud-init
- [ ] Inject Vault address and bootstrap credentials
- [ ] Pass via cloud-init user-data or write_files
- [ ] Credentials scoped to single use or short TTL
- [ ] Test complete flow
- [ ] Run create-host to generate new host config
- [ ] Deploy with terraform
- [ ] Verify host bootstraps and authenticates to Vault
- [ ] Verify services can fetch secrets
- [ ] Confirm no manual steps required
- [x] Update create-host tool
- [x] Generate wrapped token (24h TTL, single-use) for new host
- [x] Add host-specific policy to Vault (via terraform/vault/hosts-generated.tf)
- [x] Store wrapped token in terraform/vms.tf for cloud-init injection
- [x] Add `--regenerate-token` flag to regenerate only the token without overwriting config
- [x] Update template2 for Vault authentication
- [x] Reads wrapped token from cloud-init (/run/cloud-init-env)
- [x] Unwraps token to get role_id + secret_id
- [x] Stores AppRole credentials in /var/lib/vault/approle/ (persistent)
- [x] Graceful fallback if Vault unavailable during bootstrap
- [x] Create NixOS Vault secrets module (system/vault-secrets.nix)
- [x] Runtime secret fetching (services fetch on start, not at nixos-rebuild time)
- [x] Secrets cached in /var/lib/vault/cache/ for fallback when Vault unreachable
- [x] Secrets written to /run/secrets/ (tmpfs, cleared on reboot)
- [x] Fresh authentication per service start (no token renewal needed)
- [x] Optional periodic rotation with systemd timers
- [x] Critical service protection (no auto-restart for DNS, CA, Vault itself)
- [x] Create vault-fetch helper script
- [x] Standalone tool for fetching secrets from Vault
- [x] Authenticates using AppRole credentials
- [x] Writes individual files per secret key
- [x] Handles caching and fallback logic
- [x] Update bootstrap service (hosts/template2/bootstrap.nix)
- [x] Unwraps Vault token on first boot
- [x] Stores persistent AppRole credentials
- [x] Continues with nixos-rebuild
- [x] Services fetch secrets when they start
- [x] Update terraform cloud-init (terraform/cloud-init.tf)
- [x] Inject VAULT_ADDR and VAULT_WRAPPED_TOKEN via write_files
- [x] Write to /run/cloud-init-env (tmpfs, cleaned on reboot)
- [x] Fixed YAML indentation issues (write_files at top level)
- [x] Support flake_branch alongside vault credentials
- [x] Test complete flow
- [x] Created vaulttest01 test host
- [x] Verified bootstrap with Vault integration
- [x] Verified service secret fetching
- [x] Tested cache fallback when Vault unreachable
- [x] Tested wrapped token single-use (second bootstrap fails as expected)
- [x] Confirmed zero manual steps required
**Bootstrap flow:**
**Implementation Details:**
**Wrapped Token Security:**
- Single-use tokens prevent reuse if leaked
- 24h TTL limits exposure window
- Safe to commit to git (expired/used tokens useless)
- Regenerate with `create-host --hostname X --regenerate-token`
**Secret Fetching:**
- Runtime (not build-time) keeps secrets out of Nix store
- Cache fallback enables service availability when Vault down
- Fresh authentication per service start (no renewal complexity)
- Individual files per secret key for easy consumption
**Bootstrap Flow:**
```
1. terraform apply (deploys VM with cloud-init)
2. Cloud-init sets hostname + Vault bootstrap credentials
1. create-host --hostname myhost --ip 10.69.13.x/24
↓ Generates wrapped token, updates terraform
2. tofu apply (deploys VM with cloud-init)
↓ Cloud-init writes wrapped token to /run/cloud-init-env
3. nixos-bootstrap.service runs:
- Authenticates to Vault with bootstrap credentials
- Retrieves permanent AppRole credentials
- Stores locally for service use
- Runs nixos-rebuild
4. Host services fetch secrets from Vault as needed
5. Done - no manual intervention
↓ Unwraps token → gets role_id + secret_id
↓ Stores in /var/lib/vault/approle/ (persistent)
↓ Runs nixos-rebuild boot
4. Service starts → fetches secrets from Vault
↓ Uses stored AppRole credentials
↓ Caches secrets for fallback
5. Done - zero manual intervention
```
**Deliverable:** Fully automated secrets access from first boot, zero manual steps
**Files Created:**
- `scripts/vault-fetch/` - Secret fetching helper (Nix package)
- `system/vault-secrets.nix` - NixOS module for declarative Vault secrets
- `scripts/create-host/vault_helper.py` - Vault API integration
- `terraform/vault/hosts-generated.tf` - Auto-generated host policies
- `docs/vault-bootstrap-implementation.md` - Architecture documentation
- `docs/vault-bootstrap-testing.md` - Testing guide
---
**Configuration:**
- Vault address: `https://vault01.home.2rjus.net:8200` (configurable)
- All defaults remain configurable via environment variables or NixOS options
### Phase 5: DNS Automation
**Next Steps:**
- Gradually migrate existing services from sops-nix to Vault
- Add CNAME for vault.home.2rjus.net → vault01.home.2rjus.net
- Phase 4c: Migrate from step-ca to OpenBao PKI (future)
**Goal:** Automatically generate DNS entries from host configurations
**Approach:** Leverage Nix to generate zone file entries from flake host configurations
Since most hosts use static IPs defined in their NixOS configurations, we can extract this information and automatically generate A records. This keeps DNS in sync with the actual host configs.
**Tasks:**
- [ ] Add optional CNAME field to host configurations
- [ ] Add `networking.cnames = [ "alias1" "alias2" ]` or similar option
- [ ] Document in host configuration template
- [ ] Create Nix function to extract DNS records from all hosts
- [ ] Parse each host's `networking.hostName` and IP configuration
- [ ] Collect any defined CNAMEs
- [ ] Generate zone file fragment with A and CNAME records
- [ ] Integrate auto-generated records into zone files
- [ ] Keep manual entries separate (for non-flake hosts/services)
- [ ] Include generated fragment in main zone file
- [ ] Add comments showing which records are auto-generated
- [ ] Update zone file serial number automatically
- [ ] Test zone file validity after generation
- [ ] Either:
- [ ] Automatically trigger DNS server reload (Ansible)
- [ ] Or document manual step: merge to master, run upgrade on ns1/ns2
**Deliverable:** DNS A records and CNAMEs automatically generated from host configs
**Deliverable:** ✅ Fully automated secrets access from first boot, zero manual steps
---

282
docs/infrastructure.md Normal file
View File

@@ -0,0 +1,282 @@
# Homelab Infrastructure
This document describes the physical and virtual infrastructure components that support the NixOS-managed servers in this repository.
## Overview
The homelab consists of several core infrastructure components:
- **Proxmox VE** - Hypervisor hosting all NixOS VMs
- **TrueNAS** - Network storage and backup target
- **Ubiquiti EdgeRouter** - Primary router and gateway
- **Mikrotik Switch** - Core network switching
All NixOS configurations in this repository run as VMs on Proxmox and rely on these underlying infrastructure components.
## Network Topology
### Subnets
VLAN numbers are based on third octet of ip address.
TODO: VLAN naming is currently inconsistent across router/switch/Proxmox configurations. Need to standardize VLAN names and update all device configs to use consistent naming.
- `10.69.8.x` - Kubernetes (no longer in use)
- `10.69.12.x` - Core services
- `10.69.13.x` - NixOS VMs and core services
- `10.69.30.x` - Client network 1
- `10.69.31.x` - Clients network 2
- `10.69.99.x` - Management network
### Core Network Services
- **Gateway**: Web UI exposed on 10.69.10.1
- **DNS**: ns1 (10.69.13.5), ns2 (10.69.13.6)
- **Primary DNS Domain**: `home.2rjus.net`
## Hardware Components
### Proxmox Hypervisor
**Purpose**: Hosts all NixOS VMs defined in this repository
**Hardware**:
- CPU: AMD Ryzen 9 3900X 12-Core Processor
- RAM: 96GB (94Gi)
- Storage: 1TB NVMe SSD (nvme0n1)
**Management**:
- Web UI: `https://pve1.home.2rjus.net:8006`
- Cluster: Standalone
- Version: Proxmox VE 8.4.16 (kernel 6.8.12-18-pve)
**VM Provisioning**:
- Template VM: ID 9000 (built from `hosts/template2`)
- See `/terraform` directory for automated VM deployment using OpenTofu
**Storage**:
- ZFS pool: `rpool` on NVMe partition (nvme0n1p3)
- Total capacity: ~900GB (232GB used, 667GB available)
- Configuration: Single disk (no RAID)
- Scrub status: Last scrub completed successfully with 0 errors
**Networking**:
- Management interface: `vmbr0` - 10.69.12.75/24 (VLAN 12 - Core services)
- Physical interface: `enp9s0` (primary), `enp4s0` (unused)
- VM bridges:
- `vmbr0` - Main bridge (bridged to enp9s0)
- `vmbr0v8` - VLAN 8 (Kubernetes - deprecated)
- `vmbr0v13` - VLAN 13 (NixOS VMs and core services)
### TrueNAS
**Purpose**: Network storage, backup target, media storage
**Hardware**:
- Model: Custom build
- CPU: AMD Ryzen 5 5600G with Radeon Graphics
- RAM: 32GB (31.2 GiB)
- Disks:
- 2x Kingston SA400S37 240GB SSD (boot pool, mirrored)
- 2x Seagate ST16000NE000 16TB HDD (hdd-pool mirror-0)
- 2x WD WD80EFBX 8TB HDD (hdd-pool mirror-1)
- 2x Seagate ST8000VN004 8TB HDD (hdd-pool mirror-2)
- 1x NVMe 2TB (nvme-pool, no redundancy)
**Management**:
- Web UI: `https://nas.home.2rjus.net` (10.69.12.50)
- Hostname: `nas.home.2rjus.net`
- Version: TrueNAS-13.0-U6.1 (Core)
**Networking**:
- Primary interface: `mlxen0` - 10GbE (10Gbase-CX4) connected to sw1
- IP: 10.69.12.50/24 (VLAN 12 - Core services)
**ZFS Pools**:
- `boot-pool`: 206GB (mirrored SSDs) - 4% used
- Mirror of 2x Kingston 240GB SSDs
- Last scrub: No errors
- `hdd-pool`: 29.1TB total (3-way mirror, 28.4TB used, 658GB free) - 97% capacity
- mirror-0: 2x 16TB Seagate ST16000NE000
- mirror-1: 2x 8TB WD WD80EFBX
- mirror-2: 2x 8TB Seagate ST8000VN004
- Last scrub: No errors
- `nvme-pool`: 1.81TB (single NVMe, 70.4GB used, 1.74TB free) - 3% capacity
- Single NVMe drive, no redundancy
- Last scrub: No errors
**NFS Exports**:
- `/mnt/hdd-pool/media` - Media storage (exported to 10.69.0.0/16, used by Jellyfin)
- `/mnt/hdd-pool/virt/nfs-iso` - ISO storage for Proxmox
- `/mnt/hdd-pool/virt/kube-prod-pvc` - Kubernetes storage (deprecated)
**Jails**:
TrueNAS runs several FreeBSD jails for media management:
- nzbget - Usenet downloader
- restic-rest - Restic REST server for backups
- radarr - Movie management
- sonarr - TV show management
### Ubiquiti EdgeRouter
**Purpose**: Primary router, gateway, firewall, inter-VLAN routing
**Model**: EdgeRouter X 5-Port
**Hardware**:
- Serial: F09FC20E1A4C
**Management**:
- SSH: `ssh ubnt@10.69.10.1`
- Web UI: `https://10.69.10.1`
- Version: EdgeOS v2.0.9-hotfix.6 (build 5574651, 12/30/22)
**WAN Connection**:
- Interface: eth0
- Public IP: 84.213.73.123/20
- Gateway: 84.213.64.1
**Interface Layout**:
- **eth0**: WAN (public IP)
- **eth1**: 10.69.31.1/24 - Clients network 2
- **eth2**: Unused (down)
- **eth3**: 10.69.30.1/24 - Client network 1
- **eth4**: Trunk port to Mikrotik switch (carries all VLANs)
- eth4.8: 10.69.8.1/24 - K8S (deprecated)
- eth4.10: 10.69.10.1/24 - TRUSTED (management access)
- eth4.12: 10.69.12.1/24 - SERVER (Proxmox, TrueNAS, core services)
- eth4.13: 10.69.13.1/24 - SVC (NixOS VMs)
- eth4.21: 10.69.21.1/24 - CLIENTS
- eth4.22: 10.69.22.1/24 - WLAN (wireless clients)
- eth4.23: 10.69.23.1/24 - IOT
- eth4.99: 10.69.99.1/24 - MGMT (device management)
**Routing**:
- Default route: 0.0.0.0/0 via 84.213.64.1 (WAN gateway)
- Static route: 192.168.100.0/24 via eth0
- All internal VLANs directly connected
**DHCP Servers**:
Active DHCP pools on all networks:
- dhcp-8: VLAN 8 (K8S) - 91 addresses
- dhcp-12: VLAN 12 (SERVER) - 51 addresses
- dhcp-13: VLAN 13 (SVC) - 41 addresses
- dhcp-21: VLAN 21 (CLIENTS) - 141 addresses
- dhcp-22: VLAN 22 (WLAN) - 101 addresses
- dhcp-23: VLAN 23 (IOT) - 191 addresses
- dhcp-30: eth3 (Client network 1) - 101 addresses
- dhcp-31: eth1 (Clients network 2) - 21 addresses
- dhcp-mgmt: VLAN 99 (MGMT) - 51 addresses
**NAT/Firewall**:
- Masquerading on WAN interface (eth0)
### Mikrotik Switch
**Purpose**: Core Layer 2/3 switching
**Model**: MikroTik CRS326-24G-2S+ (24x 1GbE + 2x 10GbE SFP+)
**Hardware**:
- CPU: ARMv7 @ 800MHz
- RAM: 512MB
- Uptime: 21+ weeks
**Management**:
- Hostname: `sw1.home.2rjus.net`
- SSH access: `ssh admin@sw1.home.2rjus.net` (using gunter SSH key)
- Management IP: 10.69.99.2/24 (VLAN 99)
- Version: RouterOS 6.47.10 (long-term)
**VLANs**:
- VLAN 8: Kubernetes (deprecated)
- VLAN 12: SERVERS - Core services subnet
- VLAN 13: SVC - Services subnet
- VLAN 21: CLIENTS
- VLAN 22: WLAN - Wireless network
- VLAN 23: IOT
- VLAN 99: MGMT - Management network
**Port Layout** (active ports):
- **ether1**: Uplink to EdgeRouter (trunk, carries all VLANs)
- **ether11**: virt-mini1 (VLAN 12 - SERVERS)
- **ether12**: Home Assistant (VLAN 12 - SERVERS)
- **ether24**: Wireless AP (VLAN 22 - WLAN)
- **sfp-sfpplus1**: Media server/Jellyfin (VLAN 12) - 10Gbps, 7m copper DAC
- **sfp-sfpplus2**: TrueNAS (VLAN 12) - 10Gbps, 1m copper DAC
**Bridge Configuration**:
- All ports bridged to main bridge interface
- Hardware offloading enabled
- VLAN filtering enabled on bridge
## Backup & Disaster Recovery
### Backup Strategy
**NixOS VMs**:
- Declarative configurations in this git repository
- Secrets: SOPS-encrypted, backed up with repository
- State/data: Some hosts are backed up to nas host, but this should be improved and expanded to more hosts.
**Proxmox**:
- VM backups: Not currently implemented
**Critical Credentials**:
TODO: Document this
- OpenBao root token and unseal keys: _[offline secure storage location]_
- Proxmox root password: _[secure storage]_
- TrueNAS admin password: _[secure storage]_
- Router admin credentials: _[secure storage]_
### Disaster Recovery Procedures
**Total Infrastructure Loss**:
1. Restore Proxmox from installation media
2. Restore TrueNAS from installation media, import ZFS pools
3. Restore network configuration on EdgeRouter and Mikrotik
4. Rebuild NixOS VMs from this repository using Proxmox template
5. Restore stateful data from TrueNAS backups
6. Re-initialize OpenBao and restore from backup if needed
**Individual VM Loss**:
1. Deploy new VM from template using OpenTofu (`terraform/`)
2. Run `nixos-rebuild` with appropriate flake configuration
3. Restore any stateful data from backups
4. For vault01: follow re-provisioning steps in `docs/vault/auto-unseal.md`
**Network Device Failure**:
- EdgeRouter: _[config backup location, restoration procedure]_
- Mikrotik: _[config backup location, restoration procedure]_
## Future Additions
- Additional Proxmox nodes for clustering
- Backup Proxmox Backup Server
- Additional TrueNAS for replication
## Maintenance Notes
### Proxmox Updates
- Update schedule: manual
- Pre-update checklist: yolo
### TrueNAS Updates
- Update schedule: manual
### Network Device Updates
- EdgeRouter: manual
- Mikrotik: manual
## Monitoring
**Infrastructure Monitoring**:
TODO: Improve monitoring for physical hosts (proxmox, nas)
TODO: Improve monitoring for networking equipment
All NixOS VMs ship metrics to monitoring01 via node-exporter and logs via Promtail. See `/services/monitoring/` for the observability stack configuration.

View File

@@ -0,0 +1,61 @@
# DNS Automation
**Status:** Completed (2026-02-04)
**Goal:** Automatically generate DNS entries from host configurations
**Approach:** Leverage Nix to generate zone file entries from flake host configurations
Since most hosts use static IPs defined in their NixOS configurations, we can extract this information and automatically generate A records. This keeps DNS in sync with the actual host configs.
## Implementation
- [x] Add optional CNAME field to host configurations
- [x] Added `homelab.dns.cnames` option in `modules/homelab/dns.nix`
- [x] Added `homelab.dns.enable` to allow opting out (defaults to true)
- [x] Documented in CLAUDE.md
- [x] Create Nix function to extract DNS records from all hosts
- [x] Created `lib/dns-zone.nix` with extraction functions
- [x] Parses each host's `networking.hostName` and `systemd.network.networks` IP configuration
- [x] Collects CNAMEs from `homelab.dns.cnames`
- [x] Filters out VPN interfaces (wg*, tun*, tap*, vti*)
- [x] Generates complete zone file with A and CNAME records
- [x] Integrate auto-generated records into zone files
- [x] External hosts separated to `services/ns/external-hosts.nix`
- [x] Zone includes comments showing which records are auto-generated vs external
- [x] Update zone file serial number automatically
- [x] Uses `self.sourceInfo.lastModified` (git commit timestamp)
- [x] Test zone file validity after generation
- [x] NSD validates zone at build time via `nsd-checkzone`
- [x] Deploy process documented
- [x] Merge to master, run auto-upgrade on ns1/ns2
## Files Created/Modified
| File | Purpose |
|------|---------|
| `modules/homelab/dns.nix` | Defines `homelab.dns.*` options |
| `modules/homelab/default.nix` | Module import hub |
| `lib/dns-zone.nix` | Zone generation functions |
| `services/ns/external-hosts.nix` | Non-flake host records |
| `services/ns/master-authorative.nix` | Uses generated zone |
| `services/ns/secondary-authorative.nix` | Uses generated zone |
## Usage
View generated zone:
```bash
nix eval .#nixosConfigurations.ns1.config.services.nsd.zones.'"home.2rjus.net"'.data --raw
```
Add CNAMEs to a host:
```nix
homelab.dns.cnames = [ "alias1" "alias2" ];
```
Exclude a host from DNS:
```nix
homelab.dns.enable = false;
```
Add non-flake hosts: Edit `services/ns/external-hosts.nix`

View File

@@ -0,0 +1,27 @@
# NixOS Infrastructure Improvements
This document contains planned improvements to the NixOS infrastructure that are not directly part of the automated deployment pipeline.
## Planned
### Custom NixOS Options for Service and System Configuration
Currently, most service configurations in `services/` and shared system configurations in `system/` are written as plain NixOS module imports without declaring custom options. This means host-specific customization is done by directly setting upstream NixOS options or by duplicating configuration across hosts.
The `homelab.dns` module (`modules/homelab/dns.nix`) is the first example of defining custom options under a `homelab.*` namespace. This pattern should be extended to more of the repository's configuration.
**Goals:**
- Define `homelab.*` options for services and shared configuration where it makes sense, following the pattern established by `homelab.dns`
- Allow hosts to enable/configure services declaratively (e.g. `homelab.monitoring.enable`, `homelab.http-proxy.virtualHosts`) rather than importing opaque module files
- Keep options simple and focused — wrap only the parts that vary between hosts or that benefit from a clearer interface. Not everything needs a custom option.
**Candidate areas:**
- `system/` modules (e.g. auto-upgrade schedule, ACME CA URL, monitoring endpoints)
- `services/` modules where multiple hosts use the same service with different parameters
- Cross-cutting concerns that are currently implicit (e.g. which Loki endpoint promtail ships to)
## Completed
- [DNS Automation](completed/dns-automation.md) - Automatically generate DNS entries from host configurations

View File

@@ -0,0 +1,151 @@
# TrueNAS Migration Planning
## Current State
### Hardware
- CPU: AMD Ryzen 5 5600G with Radeon Graphics
- RAM: 32GB
- Network: 10GbE (mlxen0)
- Software: TrueNAS-13.0-U6.1 (Core)
### Storage Status
**hdd-pool**: 29.1TB total, **28.4TB used, 658GB free (97% capacity)** ⚠️
- mirror-0: 2x Seagate ST16000NE000 16TB HDD (16TB usable)
- mirror-1: 2x WD WD80EFBX 8TB HDD (8TB usable)
- mirror-2: 2x Seagate ST8000VN004 8TB HDD (8TB usable)
## Goal
Expand storage capacity for the main hdd-pool. Since we need to add disks anyway, also evaluating whether to upgrade or replace the entire system.
## Decisions
### Migration Approach: Option 3 - Migrate to NixOS
**Decision**: Replace TrueNAS with NixOS bare metal installation
**Rationale**:
- Aligns with existing infrastructure (16+ NixOS hosts already managed in this repo)
- Declarative configuration fits homelab philosophy
- Automatic monitoring/logging integration (Prometheus + Promtail)
- Auto-upgrades via same mechanism as other hosts
- SOPS secrets management integration
- TrueNAS-specific features (WebGUI, jails) not heavily utilized
**Service migration**:
- radarr/sonarr: Native NixOS services (`services.radarr`, `services.sonarr`)
- restic-rest: `services.restic.server`
- nzbget: NixOS service or OCI container
- NFS exports: `services.nfs.server`
### Filesystem: BTRFS RAID1
**Decision**: Migrate from ZFS to BTRFS with RAID1
**Rationale**:
- **In-kernel**: No out-of-tree module issues like ZFS
- **Flexible expansion**: Add individual disks, not required to buy pairs
- **Mixed disk sizes**: Better handling than ZFS multi-vdev approach
- **RAID level conversion**: Can convert between RAID levels in place
- Built-in checksumming, snapshots, compression (zstd)
- NixOS has good BTRFS support
**BTRFS RAID1 notes**:
- "RAID1" means 2 copies of all data
- Distributes across all available devices
- With 6+ disks, provides redundancy + capacity scaling
- RAID5/6 avoided (known issues), RAID1/10 are stable
### Hardware: Keep Existing + Add Disks
**Decision**: Retain current hardware, expand disk capacity
**Hardware to keep**:
- AMD Ryzen 5 5600G (sufficient for NAS workload)
- 32GB RAM (adequate)
- 10GbE network interface
- Chassis
**Storage architecture**:
**Bulk storage** (BTRFS RAID1 on HDDs):
- Current: 6x HDDs (2x16TB + 2x8TB + 2x8TB)
- Add: 2x new HDDs (size TBD)
- Use: Media, downloads, backups, non-critical data
- Risk tolerance: High (data mostly replaceable)
**Critical data** (small volume):
- Use 2x 240GB SSDs in mirror (BTRFS or ZFS)
- Or use 2TB NVMe for critical data
- Risk tolerance: Low (data important but small)
### Disk Purchase Decision
**Options under consideration**:
**Option A: 2x 16TB drives**
- Matches largest current drives
- Enables potential future RAID5 if desired (6x 16TB array)
- More conservative capacity increase
**Option B: 2x 20-24TB drives**
- Larger capacity headroom
- Better $/TB ratio typically
- Future-proofs better
**Initial purchase**: 2 drives (chassis has space for 2 more without modifications)
## Migration Strategy
### High-Level Plan
1. **Preparation**:
- Purchase 2x new HDDs (16TB or 20-24TB)
- Create NixOS configuration for new storage host
- Set up bare metal NixOS installation
2. **Initial BTRFS pool**:
- Install 2 new disks
- Create BTRFS filesystem in RAID1
- Mount and test NFS exports
3. **Data migration**:
- Copy data from TrueNAS ZFS pool to new BTRFS pool over 10GbE
- Verify data integrity
4. **Expand pool**:
- As old ZFS pool is emptied, wipe drives and add to BTRFS pool
- Pool grows incrementally: 2 → 4 → 6 → 8 disks
- BTRFS rebalances data across new devices
5. **Service migration**:
- Set up radarr/sonarr/nzbget/restic as NixOS services
- Update NFS client mounts on consuming hosts
6. **Cutover**:
- Point consumers to new NAS host
- Decommission TrueNAS
- Repurpose hardware or keep as spare
### Migration Advantages
- **Low risk**: New pool created independently, old data remains intact during migration
- **Incremental**: Can add old disks one at a time as space allows
- **Flexible**: BTRFS handles mixed disk sizes gracefully
- **Reversible**: Keep TrueNAS running until fully validated
## Next Steps
1. Decide on disk size (16TB vs 20-24TB)
2. Purchase disks
3. Design NixOS host configuration (`hosts/nas1/`)
4. Plan detailed migration timeline
5. Document NFS export mapping (current → new)
## Open Questions
- [ ] Final decision on disk size?
- [ ] Hostname for new NAS host? (nas1? storage1?)
- [ ] IP address allocation (keep 10.69.12.50 or new IP?)
- [ ] Timeline/maintenance window for migration?

View File

@@ -0,0 +1,560 @@
# Phase 4d: Vault Bootstrap Integration - Implementation Summary
## Overview
Phase 4d implements automatic Vault/OpenBao integration for new NixOS hosts, enabling:
- Zero-touch secret provisioning on first boot
- Automatic AppRole authentication
- Runtime secret fetching with caching
- Periodic secret rotation
**Key principle**: Existing sops-nix infrastructure remains unchanged. This is new infrastructure running in parallel.
## Architecture
### Component Diagram
```
┌─────────────────────────────────────────────────────────────┐
│ Developer Workstation │
│ │
│ create-host --hostname myhost --ip 10.69.13.x/24 │
│ │ │
│ ├─> Generate host configs (hosts/myhost/) │
│ ├─> Update flake.nix │
│ ├─> Update terraform/vms.tf │
│ ├─> Generate terraform/vault/hosts-generated.tf │
│ ├─> Apply Vault Terraform (create AppRole) │
│ └─> Generate wrapped token (24h TTL) ───┐ │
│ │ │
└───────────────────────────────────────────────┼────────────┘
┌───────────────────────────┘
│ Wrapped Token
│ (single-use, 24h expiry)
┌─────────────────────────────────────────────────────────────┐
│ Cloud-init (VM Provisioning) │
│ │
│ /etc/environment: │
│ VAULT_ADDR=https://vault01.home.2rjus.net:8200 │
│ VAULT_WRAPPED_TOKEN=hvs.CAES... │
│ VAULT_SKIP_VERIFY=1 │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Bootstrap Service (First Boot) │
│ │
│ 1. Read VAULT_WRAPPED_TOKEN from environment │
│ 2. POST /v1/sys/wrapping/unwrap │
│ 3. Extract role_id + secret_id │
│ 4. Store in /var/lib/vault/approle/ │
│ ├─ role-id (600 permissions) │
│ └─ secret-id (600 permissions) │
│ 5. Continue with nixos-rebuild boot │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Runtime (Service Starts) │
│ │
│ vault-secret-<name>.service (ExecStartPre) │
│ │ │
│ ├─> vault-fetch <secret-path> <output-dir> │
│ │ │ │
│ │ ├─> Read role_id + secret_id │
│ │ ├─> POST /v1/auth/approle/login → token │
│ │ ├─> GET /v1/secret/data/<path> → secrets │
│ │ ├─> Write /run/secrets/<name>/password │
│ │ ├─> Write /run/secrets/<name>/api_key │
│ │ └─> Cache to /var/lib/vault/cache/<name>/ │
│ │ │
│ └─> chown/chmod secret files │
│ │
│ myservice.service │
│ └─> Reads secrets from /run/secrets/<name>/ │
└─────────────────────────────────────────────────────────────┘
```
### Data Flow
1. **Provisioning Time** (Developer → Vault):
- create-host generates AppRole configuration
- Terraform creates AppRole + policy in Vault
- Vault generates wrapped token containing role_id + secret_id
- Wrapped token stored in terraform/vms.tf
2. **Bootstrap Time** (Cloud-init → VM):
- Cloud-init injects wrapped token via /etc/environment
- Bootstrap service unwraps token (single-use operation)
- Stores unwrapped credentials persistently
3. **Runtime** (Service → Vault):
- Service starts
- ExecStartPre hook calls vault-fetch
- vault-fetch authenticates using stored credentials
- Fetches secrets and caches them
- Service reads secrets from filesystem
## Implementation Details
### 1. vault-fetch Helper (`scripts/vault-fetch/`)
**Purpose**: Fetch secrets from Vault and write to filesystem
**Features**:
- Reads AppRole credentials from `/var/lib/vault/approle/`
- Authenticates to Vault (fresh token each time)
- Fetches secret from KV v2 engine
- Writes individual files per secret key
- Updates cache for fallback
- Gracefully degrades to cache if Vault unreachable
**Usage**:
```bash
vault-fetch hosts/monitoring01/grafana /run/secrets/grafana
```
**Environment Variables**:
- `VAULT_ADDR`: Vault server (default: https://vault01.home.2rjus.net:8200)
- `VAULT_SKIP_VERIFY`: Skip TLS verification (default: 1)
**Error Handling**:
- Vault unreachable → Use cache (log warning)
- Invalid credentials → Fail with clear error
- No cache + unreachable → Fail with error
### 2. NixOS Module (`system/vault-secrets.nix`)
**Purpose**: Declarative Vault secret management for NixOS services
**Configuration Options**:
```nix
vault.enable = true; # Enable Vault integration
vault.secrets.<name> = {
secretPath = "hosts/monitoring01/grafana"; # Path in Vault
outputDir = "/run/secrets/grafana"; # Where to write secrets
cacheDir = "/var/lib/vault/cache/grafana"; # Cache location
owner = "grafana"; # File owner
group = "grafana"; # File group
mode = "0400"; # Permissions
services = [ "grafana" ]; # Dependent services
restartTrigger = true; # Enable periodic rotation
restartInterval = "daily"; # Rotation schedule
};
```
**Module Behavior**:
1. **Fetch Service**: Creates `vault-secret-<name>.service`
- Runs on boot and before dependent services
- Calls vault-fetch to populate secrets
- Sets ownership and permissions
2. **Rotation Timer**: Optionally creates `vault-secret-rotate-<name>.timer`
- Scheduled restarts for secret rotation
- Automatically excluded for critical services
- Configurable interval (daily, weekly, monthly)
3. **Critical Service Protection**:
```nix
vault.criticalServices = [ "bind" "openbao" "step-ca" ];
```
Services in this list never get auto-restart timers
### 3. create-host Tool Updates
**New Functionality**:
1. **Vault Terraform Generation** (`generators.py`):
- Creates/updates `terraform/vault/hosts-generated.tf`
- Adds host policy granting access to `secret/data/hosts/<hostname>/*`
- Adds AppRole configuration
- Idempotent (safe to re-run)
2. **Wrapped Token Generation** (`vault_helper.py`):
- Applies Vault Terraform to create AppRole
- Reads role_id from Vault
- Generates secret_id
- Wraps credentials in cubbyhole token (24h TTL, single-use)
- Returns wrapped token
3. **VM Configuration Update** (`manipulators.py`):
- Adds `vault_wrapped_token` field to VM in vms.tf
- Preserves other VM settings
**New CLI Options**:
```bash
create-host --hostname myhost --ip 10.69.13.x/24
# Full workflow with Vault integration
create-host --hostname myhost --skip-vault
# Create host without Vault (legacy behavior)
create-host --hostname myhost --force
# Regenerate everything including new wrapped token
```
**Dependencies Added**:
- `hvac`: Python Vault client library
### 4. Bootstrap Service Updates
**New Behavior** (`hosts/template2/bootstrap.nix`):
```bash
# Check for wrapped token
if [ -n "$VAULT_WRAPPED_TOKEN" ]; then
# Unwrap to get credentials
curl -X POST \
-H "X-Vault-Token: $VAULT_WRAPPED_TOKEN" \
$VAULT_ADDR/v1/sys/wrapping/unwrap
# Store role_id and secret_id
mkdir -p /var/lib/vault/approle
echo "$ROLE_ID" > /var/lib/vault/approle/role-id
echo "$SECRET_ID" > /var/lib/vault/approle/secret-id
chmod 600 /var/lib/vault/approle/*
# Continue with bootstrap...
fi
```
**Error Handling**:
- Token already used → Log error, continue bootstrap
- Token expired → Log error, continue bootstrap
- Vault unreachable → Log warning, continue bootstrap
- **Never fails bootstrap** - host can still run without Vault
### 5. Cloud-init Configuration
**Updates** (`terraform/cloud-init.tf`):
```hcl
write_files:
- path: /etc/environment
content: |
VAULT_ADDR=https://vault01.home.2rjus.net:8200
VAULT_WRAPPED_TOKEN=${vault_wrapped_token}
VAULT_SKIP_VERIFY=1
```
**VM Configuration** (`terraform/vms.tf`):
```hcl
locals {
vms = {
"myhost" = {
ip = "10.69.13.x/24"
vault_wrapped_token = "hvs.CAESIBw..." # Added by create-host
}
}
}
```
### 6. Vault Terraform Structure
**Generated Hosts File** (`terraform/vault/hosts-generated.tf`):
```hcl
locals {
generated_host_policies = {
"myhost" = {
paths = [
"secret/data/hosts/myhost/*",
]
}
}
}
resource "vault_policy" "generated_host_policies" {
for_each = local.generated_host_policies
name = "host-${each.key}"
policy = <<-EOT
path "secret/data/hosts/${each.key}/*" {
capabilities = ["read", "list"]
}
EOT
}
resource "vault_approle_auth_backend_role" "generated_hosts" {
for_each = local.generated_host_policies
backend = vault_auth_backend.approle.path
role_name = each.key
token_policies = ["host-${each.key}"]
secret_id_ttl = 0 # Never expire
token_ttl = 3600 # 1 hour tokens
}
```
**Separation of Concerns**:
- `approle.tf`: Manual host configurations (ha1, monitoring01)
- `hosts-generated.tf`: Auto-generated configurations
- `secrets.tf`: Secret definitions (manual)
- `pki.tf`: PKI infrastructure
## Security Model
### Credential Distribution
**Wrapped Token Security**:
- **Single-use**: Can only be unwrapped once
- **Time-limited**: 24h TTL
- **Safe in git**: Even if leaked, expires quickly
- **Standard Vault pattern**: Built-in Vault feature
**Why wrapped tokens are secure**:
```
Developer commits wrapped token to git
Attacker finds token in git history
Attacker tries to use token
❌ Token already used (unwrapped during bootstrap)
❌ OR: Token expired (>24h old)
```
### AppRole Credentials
**Storage**:
- Location: `/var/lib/vault/approle/`
- Permissions: `600 (root:root)`
- Persistence: Survives reboots
**Security Properties**:
- `role_id`: Non-sensitive (like username)
- `secret_id`: Sensitive (like password)
- `secret_id_ttl = 0`: Never expires (simplicity vs rotation tradeoff)
- Tokens: Ephemeral (1h TTL, not cached)
**Attack Scenarios**:
1. **Attacker gets root on host**:
- Can read AppRole credentials
- Can only access that host's secrets
- Cannot access other hosts' secrets (policy restriction)
- ✅ Blast radius limited to single host
2. **Attacker intercepts wrapped token**:
- Single-use: Already consumed during bootstrap
- Time-limited: Likely expired
- ✅ Cannot be reused
3. **Vault server compromised**:
- All secrets exposed (same as any secret storage)
- ✅ No different from sops-nix master key compromise
### Secret Storage
**Runtime Secrets**:
- Location: `/run/secrets/` (tmpfs)
- Lost on reboot
- Re-fetched on service start
- ✅ Not in Nix store
- ✅ Not persisted to disk
**Cached Secrets**:
- Location: `/var/lib/vault/cache/`
- Persists across reboots
- Only used when Vault unreachable
- ✅ Enables service availability
- ⚠️ May be stale
## Failure Modes
### Wrapped Token Expired
**Symptom**: Bootstrap logs "token expired" error
**Impact**: Host boots but has no Vault credentials
**Fix**: Regenerate token and redeploy
```bash
create-host --hostname myhost --force
cd terraform && tofu apply
```
### Vault Unreachable
**Symptom**: Service logs "WARNING: Using cached secrets"
**Impact**: Service uses stale secrets (may work or fail depending on rotation)
**Fix**: Restore Vault connectivity, restart service
### No Cache Available
**Symptom**: Service fails to start with "No cache available"
**Impact**: Service unavailable until Vault restored
**Fix**: Restore Vault, restart service
### Invalid Credentials
**Symptom**: vault-fetch logs authentication failure
**Impact**: Service cannot start
**Fix**:
1. Check AppRole exists: `vault read auth/approle/role/hostname`
2. Check policy exists: `vault policy read host-hostname`
3. Regenerate credentials if needed
## Migration Path
### Current State (Phase 4d)
- ✅ sops-nix: Used by all existing services
- ✅ Vault: Available for new services
- ✅ Parallel operation: Both work simultaneously
### Future Migration
**Gradual Service Migration**:
1. **Pick a non-critical service** (e.g., test service)
2. **Add Vault secrets**:
```nix
vault.secrets.myservice = {
secretPath = "hosts/myhost/myservice";
};
```
3. **Update service to read from Vault**:
```nix
systemd.services.myservice.serviceConfig = {
EnvironmentFile = "/run/secrets/myservice/password";
};
```
4. **Remove sops-nix secret**
5. **Test thoroughly**
6. **Repeat for next service**
**Critical Services Last**:
- DNS (bind)
- Certificate Authority (step-ca)
- Vault itself (openbao)
**Eventually**:
- All services migrated to Vault
- Remove sops-nix dependency
- Clean up `/secrets/` directory
## Performance Considerations
### Bootstrap Time
**Added overhead**: ~2-5 seconds
- Token unwrap: ~1s
- Credential storage: ~1s
**Total bootstrap time**: Still <2 minutes (acceptable)
### Service Startup
**Added overhead**: ~1-3 seconds per service
- Vault authentication: ~1s
- Secret fetch: ~1s
- File operations: <1s
**Parallel vs Serial**:
- Multiple services fetch in parallel
- No cascade delays
### Cache Benefits
**When Vault unreachable**:
- Service starts in <1s (cache read)
- No Vault dependency for startup
- High availability maintained
## Testing Checklist
Complete testing workflow documented in `vault-bootstrap-testing.md`:
- [ ] Create test host with create-host
- [ ] Add test secrets to Vault
- [ ] Deploy VM and verify bootstrap
- [ ] Verify secrets fetched successfully
- [ ] Test service restart (re-fetch)
- [ ] Test Vault unreachable (cache fallback)
- [ ] Test secret rotation
- [ ] Test wrapped token expiry
- [ ] Test token reuse prevention
- [ ] Verify critical services excluded from auto-restart
## Files Changed
### Created
- `scripts/vault-fetch/vault-fetch.sh` - Secret fetching script
- `scripts/vault-fetch/default.nix` - Nix package
- `scripts/vault-fetch/README.md` - Documentation
- `system/vault-secrets.nix` - NixOS module
- `scripts/create-host/vault_helper.py` - Vault API client
- `terraform/vault/hosts-generated.tf` - Generated Terraform
- `docs/vault-bootstrap-implementation.md` - This file
- `docs/vault-bootstrap-testing.md` - Testing guide
### Modified
- `scripts/create-host/default.nix` - Add hvac dependency
- `scripts/create-host/create_host.py` - Add Vault integration
- `scripts/create-host/generators.py` - Add Vault Terraform generation
- `scripts/create-host/manipulators.py` - Add wrapped token injection
- `terraform/cloud-init.tf` - Inject Vault credentials
- `terraform/vms.tf` - Support vault_wrapped_token field
- `hosts/template2/bootstrap.nix` - Unwrap token and store credentials
- `system/default.nix` - Import vault-secrets module
- `flake.nix` - Add vault-fetch package
### Unchanged
- All existing sops-nix configuration
- All existing service configurations
- All existing host configurations
- `/secrets/` directory
## Future Enhancements
### Phase 4e+ (Not in Scope)
1. **Dynamic Secrets**
- Database credentials with rotation
- Cloud provider credentials
- SSH certificates
2. **Secret Watcher**
- Monitor Vault for secret changes
- Automatically restart services on rotation
- Faster than periodic timers
3. **PKI Integration** (Phase 4c)
- Migrate from step-ca to Vault PKI
- Automatic certificate issuance
- Short-lived certificates
4. **Audit Logging**
- Track secret access
- Alert on suspicious patterns
- Compliance reporting
5. **Multi-Environment**
- Dev/staging/prod separation
- Per-environment Vault namespaces
- Separate AppRoles per environment
## Conclusion
Phase 4d successfully implements automatic Vault integration for new NixOS hosts with:
- ✅ Zero-touch provisioning
- ✅ Secure credential distribution
- ✅ Graceful degradation
- ✅ Backward compatibility
- ✅ Production-ready error handling
The infrastructure is ready for gradual migration of existing services from sops-nix to Vault.

View File

@@ -0,0 +1,419 @@
# Phase 4d: Vault Bootstrap Integration - Testing Guide
This guide walks through testing the complete Vault bootstrap workflow implemented in Phase 4d.
## Prerequisites
Before testing, ensure:
1. **Vault server is running**: vault01 (vault01.home.2rjus.net:8200) is accessible
2. **Vault access**: You have a Vault token with admin permissions (set `BAO_TOKEN` env var)
3. **Terraform installed**: OpenTofu is available in your PATH
4. **Git repository clean**: All Phase 4d changes are committed to a branch
## Test Scenario: Create vaulttest01
### Step 1: Create Test Host Configuration
Run the create-host tool with Vault integration:
```bash
# Ensure you have Vault token
export BAO_TOKEN="your-vault-admin-token"
# Create test host
nix run .#create-host -- \
--hostname vaulttest01 \
--ip 10.69.13.150/24 \
--cpu 2 \
--memory 2048 \
--disk 20G
# If you need to regenerate (e.g., wrapped token expired):
nix run .#create-host -- \
--hostname vaulttest01 \
--ip 10.69.13.150/24 \
--force
```
**What this does:**
- Creates `hosts/vaulttest01/` configuration
- Updates `flake.nix` with new host
- Updates `terraform/vms.tf` with VM definition
- Generates `terraform/vault/hosts-generated.tf` with AppRole and policy
- Creates a wrapped token (24h TTL, single-use)
- Adds wrapped token to VM configuration
**Expected output:**
```
✓ All validations passed
✓ Created hosts/vaulttest01/default.nix
✓ Created hosts/vaulttest01/configuration.nix
✓ Updated flake.nix
✓ Updated terraform/vms.tf
Configuring Vault integration...
✓ Updated terraform/vault/hosts-generated.tf
Applying Vault Terraform configuration...
✓ Terraform applied successfully
Reading AppRole credentials for vaulttest01...
✓ Retrieved role_id
✓ Generated secret_id
Creating wrapped token (24h TTL, single-use)...
✓ Created wrapped token: hvs.CAESIBw...
⚠️ Token expires in 24 hours
⚠️ Token can only be used once
✓ Added wrapped token to terraform/vms.tf
✓ Host configuration generated successfully!
```
### Step 2: Add Test Service Configuration
Edit `hosts/vaulttest01/configuration.nix` to enable Vault and add a test service:
```nix
{ config, pkgs, lib, ... }:
{
imports = [
../../system
../../common/vm
];
# Enable Vault secrets management
vault.enable = true;
# Define a test secret
vault.secrets.test-service = {
secretPath = "hosts/vaulttest01/test-service";
restartTrigger = true;
restartInterval = "daily";
services = [ "vault-test" ];
};
# Create a test service that uses the secret
systemd.services.vault-test = {
description = "Test Vault secret fetching";
wantedBy = [ "multi-user.target" ];
after = [ "vault-secret-test-service.service" ];
serviceConfig = {
Type = "oneshot";
RemainAfterExit = true;
ExecStart = pkgs.writeShellScript "vault-test" ''
echo "=== Vault Secret Test ==="
echo "Secret path: hosts/vaulttest01/test-service"
if [ -f /run/secrets/test-service/password ]; then
echo " Password file exists"
echo "Password length: $(wc -c < /run/secrets/test-service/password)"
else
echo " Password file missing!"
exit 1
fi
if [ -d /var/lib/vault/cache/test-service ]; then
echo " Cache directory exists"
else
echo " Cache directory missing!"
exit 1
fi
echo "Test successful!"
'';
StandardOutput = "journal+console";
};
};
# Rest of configuration...
networking.hostName = "vaulttest01";
networking.domain = "home.2rjus.net";
systemd.network.networks."10-lan" = {
matchConfig.Name = "ens18";
address = [ "10.69.13.150/24" ];
gateway = [ "10.69.13.1" ];
dns = [ "10.69.13.5" "10.69.13.6" ];
domains = [ "home.2rjus.net" ];
};
system.stateVersion = "25.11";
}
```
### Step 3: Create Test Secrets in Vault
Add test secrets to Vault using Terraform:
Edit `terraform/vault/secrets.tf`:
```hcl
locals {
secrets = {
# ... existing secrets ...
# Test secret for vaulttest01
"hosts/vaulttest01/test-service" = {
auto_generate = true
password_length = 24
}
}
}
```
Apply the Vault configuration:
```bash
cd terraform/vault
tofu apply
```
**Verify the secret exists:**
```bash
export VAULT_ADDR=https://vault01.home.2rjus.net:8200
export VAULT_SKIP_VERIFY=1
vault kv get secret/hosts/vaulttest01/test-service
```
### Step 4: Deploy the VM
**Important**: Deploy within 24 hours of creating the host (wrapped token TTL)
```bash
cd terraform
tofu plan # Review changes
tofu apply # Deploy VM
```
### Step 5: Monitor Bootstrap Process
SSH into the VM and monitor the bootstrap:
```bash
# Watch bootstrap logs
ssh root@vaulttest01
journalctl -fu nixos-bootstrap.service
# Expected log output:
# Starting NixOS bootstrap for host: vaulttest01
# Network connectivity confirmed
# Unwrapping Vault token to get AppRole credentials...
# Vault credentials unwrapped and stored successfully
# Fetching and building NixOS configuration from flake...
# Successfully built configuration for vaulttest01
# Rebooting into new configuration...
```
### Step 6: Verify Vault Integration
After the VM reboots, verify the integration:
```bash
ssh root@vaulttest01
# Check AppRole credentials were stored
ls -la /var/lib/vault/approle/
# Expected: role-id and secret-id files with 600 permissions
cat /var/lib/vault/approle/role-id
# Should show a UUID
# Check vault-secret service ran successfully
systemctl status vault-secret-test-service.service
# Should be active (exited)
journalctl -u vault-secret-test-service.service
# Should show successful secret fetch:
# [vault-fetch] Authenticating to Vault at https://vault01.home.2rjus.net:8200
# [vault-fetch] Successfully authenticated to Vault
# [vault-fetch] Fetching secret from path: hosts/vaulttest01/test-service
# [vault-fetch] Writing secrets to /run/secrets/test-service
# [vault-fetch] - Wrote secret key: password
# [vault-fetch] Successfully fetched and cached secrets
# Check test service passed
systemctl status vault-test.service
journalctl -u vault-test.service
# Should show:
# === Vault Secret Test ===
# ✓ Password file exists
# ✓ Cache directory exists
# Test successful!
# Verify secret files exist
ls -la /run/secrets/test-service/
# Should show password file with 400 permissions
# Verify cache exists
ls -la /var/lib/vault/cache/test-service/
# Should show cached password file
```
## Test Scenarios
### Scenario 1: Fresh Deployment
**Expected**: All secrets fetched successfully from Vault
### Scenario 2: Service Restart
```bash
systemctl restart vault-test.service
```
**Expected**: Secrets re-fetched from Vault, service starts successfully
### Scenario 3: Vault Unreachable
```bash
# On vault01, stop Vault temporarily
ssh root@vault01
systemctl stop openbao
# On vaulttest01, restart test service
ssh root@vaulttest01
systemctl restart vault-test.service
journalctl -u vault-secret-test-service.service | tail -20
```
**Expected**:
- Warning logged: "Using cached secrets from /var/lib/vault/cache/test-service"
- Service starts successfully using cached secrets
```bash
# Restore Vault
ssh root@vault01
systemctl start openbao
```
### Scenario 4: Secret Rotation
```bash
# Update secret in Vault
vault kv put secret/hosts/vaulttest01/test-service password="new-secret-value"
# On vaulttest01, trigger rotation
ssh root@vaulttest01
systemctl restart vault-secret-test-service.service
# Verify new secret
cat /run/secrets/test-service/password
# Should show new value
```
**Expected**: New secret fetched and cached
### Scenario 5: Expired Wrapped Token
```bash
# Wait 24+ hours after create-host, then try to deploy
cd terraform
tofu apply
```
**Expected**: Bootstrap fails with message about expired token
**Fix (Option 1 - Regenerate token only):**
```bash
# Only regenerates the wrapped token, preserves all other configuration
nix run .#create-host -- --hostname vaulttest01 --regenerate-token
cd terraform
tofu apply
```
**Fix (Option 2 - Full regeneration with --force):**
```bash
# Overwrites entire host configuration (including any manual changes)
nix run .#create-host -- --hostname vaulttest01 --force
cd terraform
tofu apply
```
**Recommendation**: Use `--regenerate-token` to avoid losing manual configuration changes.
### Scenario 6: Already-Used Wrapped Token
Try to deploy the same VM twice without regenerating token.
**Expected**: Second bootstrap fails with "token already used" message
## Cleanup
After testing:
```bash
# Destroy test VM
cd terraform
tofu destroy -target=proxmox_vm_qemu.vm[\"vaulttest01\"]
# Remove test secrets from Vault
vault kv delete secret/hosts/vaulttest01/test-service
# Remove host configuration (optional)
git rm -r hosts/vaulttest01
# Edit flake.nix to remove nixosConfigurations.vaulttest01
# Edit terraform/vms.tf to remove vaulttest01
# Edit terraform/vault/hosts-generated.tf to remove vaulttest01
```
## Success Criteria Checklist
Phase 4d is considered successful when:
- [x] create-host generates Vault configuration automatically
- [x] New hosts receive AppRole credentials via cloud-init
- [x] Bootstrap stores credentials in /var/lib/vault/approle/
- [x] Services can fetch secrets using vault.secrets option
- [x] Secrets extracted to individual files in /run/secrets/
- [x] Cached secrets work when Vault is unreachable
- [x] Periodic restart timers work for secret rotation
- [x] Critical services excluded from auto-restart
- [x] Test host deploys and verifies working
- [x] sops-nix continues to work for existing services
## Troubleshooting
### Bootstrap fails with "Failed to unwrap Vault token"
**Possible causes:**
- Token already used (wrapped tokens are single-use)
- Token expired (24h TTL)
- Invalid token
- Vault unreachable
**Solution:**
```bash
# Regenerate token
nix run .#create-host -- --hostname vaulttest01 --force
cd terraform && tofu apply
```
### Secret fetch fails with authentication error
**Check:**
```bash
# Verify AppRole exists
vault read auth/approle/role/vaulttest01
# Verify policy exists
vault policy read host-vaulttest01
# Test authentication manually
ROLE_ID=$(cat /var/lib/vault/approle/role-id)
SECRET_ID=$(cat /var/lib/vault/approle/secret-id)
vault write auth/approle/login role_id="$ROLE_ID" secret_id="$SECRET_ID"
```
### Cache not working
**Check:**
```bash
# Verify cache directory exists and has files
ls -la /var/lib/vault/cache/test-service/
# Check permissions
stat /var/lib/vault/cache/test-service/password
# Should be 600 (rw-------)
```
## Next Steps
After successful testing:
1. Gradually migrate existing services from sops-nix to Vault
2. Consider implementing secret watcher for faster rotation (future enhancement)
3. Phase 4c: Migrate from step-ca to OpenBao PKI
4. Eventually deprecate and remove sops-nix

178
docs/vault/auto-unseal.md Normal file
View File

@@ -0,0 +1,178 @@
# OpenBao TPM2 Auto-Unseal Setup
This document describes the one-time setup process for enabling TPM2-based auto-unsealing on vault01.
## Overview
The auto-unseal feature uses systemd's `LoadCredentialEncrypted` with TPM2 to securely store and retrieve an unseal key. On service start, systemd automatically decrypts the credential using the VM's TPM, and the service unseals OpenBao.
## Prerequisites
- OpenBao must be initialized (`bao operator init` completed)
- You must have at least one unseal key from the initialization
- vault01 must have a TPM2 device (virtual TPM for Proxmox VMs)
## Initial Setup
Perform these steps on vault01 after deploying the service configuration:
### 1. Save Unseal Key
```bash
# Create temporary file with one of your unseal keys
echo "paste-your-unseal-key-here" > /tmp/unseal-key.txt
```
### 2. Encrypt with TPM2
```bash
# Encrypt the key using TPM2 binding
systemd-creds encrypt \
--with-key=tpm2 \
--name=unseal-key \
/tmp/unseal-key.txt \
/var/lib/openbao/unseal-key.cred
# Set proper ownership and permissions
chown openbao:openbao /var/lib/openbao/unseal-key.cred
chmod 600 /var/lib/openbao/unseal-key.cred
```
### 3. Cleanup
```bash
# Securely delete the plaintext key
shred -u /tmp/unseal-key.txt
```
### 4. Test Auto-Unseal
```bash
# Restart the service - it should auto-unseal
systemctl restart openbao
# Verify it's unsealed
bao status
# Should show: Sealed = false
```
## TPM PCR Binding
The default `--with-key=tpm2` binds the credential to PCR 7 (Secure Boot state). For stricter binding that includes firmware and boot state:
```bash
systemd-creds encrypt \
--with-key=tpm2 \
--tpm2-pcrs=0+7+14 \
--name=unseal-key \
/tmp/unseal-key.txt \
/var/lib/openbao/unseal-key.cred
```
PCR meanings:
- **PCR 0**: BIOS/UEFI firmware measurements
- **PCR 7**: Secure Boot state (UEFI variables)
- **PCR 14**: MOK (Machine Owner Key) state
**Trade-off**: Stricter PCR binding improves security but may require re-encrypting the credential after firmware updates or kernel changes.
## Re-provisioning
If you need to reprovision vault01 from scratch:
1. **Before destroying**: Back up your root token and all unseal keys (stored securely offline)
2. **After recreating the VM**:
- Initialize OpenBao: `bao operator init`
- Follow the setup steps above to encrypt a new unseal key with TPM2
3. **Restore data** (if migrating): Copy `/var/lib/openbao` from backup
## Handling System Changes
**After firmware updates, kernel updates, or boot configuration changes**, PCR values may change, causing TPM decryption to fail.
### Symptoms
- Service fails to start
- Logs show: `Failed to decrypt credentials`
- OpenBao remains sealed after reboot
### Fix
1. Unseal manually with one of your offline unseal keys:
```bash
bao operator unseal
```
2. Re-encrypt the credential with updated PCR values:
```bash
echo "your-unseal-key" > /tmp/unseal-key.txt
systemd-creds encrypt \
--with-key=tpm2 \
--name=unseal-key \
/tmp/unseal-key.txt \
/var/lib/openbao/unseal-key.cred
chown openbao:openbao /var/lib/openbao/unseal-key.cred
chmod 600 /var/lib/openbao/unseal-key.cred
shred -u /tmp/unseal-key.txt
```
3. Restart the service:
```bash
systemctl restart openbao
```
## Security Considerations
### What This Protects Against
- **Data at rest**: Vault data is encrypted and cannot be accessed without unsealing
- **VM snapshot theft**: An attacker with a VM snapshot cannot decrypt the unseal key without the TPM state
- **TPM binding**: The key can only be decrypted by the same VM with matching PCR values
### What This Does NOT Protect Against
- **Compromised host**: If an attacker gains root access to vault01 while running, they can access unsealed data
- **Boot-time attacks**: If an attacker can modify the boot process to match PCR values, they may retrieve the key
- **VM console access**: An attacker with VM console access during boot could potentially access the unsealed vault
### Recommendations
- **Keep offline backups** of root token and all unseal keys in a secure location (password manager, encrypted USB, etc.)
- **Use Shamir secret sharing**: The default 5-key threshold means even if the TPM key is compromised, an attacker needs the other keys
- **Monitor access**: Use OpenBao's audit logging to detect unauthorized access
- **Consider stricter PCR binding** (PCR 0+7+14) for production, accepting the maintenance overhead
## Troubleshooting
### Check if credential exists
```bash
ls -la /var/lib/openbao/unseal-key.cred
```
### Test credential decryption manually
```bash
# Should output your unseal key if TPM decryption works
systemd-creds decrypt /var/lib/openbao/unseal-key.cred -
```
### View service logs
```bash
journalctl -u openbao -n 50
```
### Manual unseal
```bash
bao operator unseal
# Enter one of your offline unseal keys when prompted
```
### Check TPM status
```bash
# Check if TPM2 is available
ls /dev/tpm*
# View TPM PCR values
tpm2_pcrread
```
## References
- [systemd.exec - Credentials](https://www.freedesktop.org/software/systemd/man/systemd.exec.html#Credentials)
- [systemd-creds man page](https://www.freedesktop.org/software/systemd/man/systemd-creds.html)
- [TPM2 PCR Documentation](https://uapi-group.org/specifications/specs/linux_tpm_pcr_registry/)
- [OpenBao Documentation](https://openbao.org/docs/)

40
flake.lock generated
View File

@@ -21,27 +21,6 @@
"url": "https://git.t-juice.club/torjus/alerttonotify"
}
},
"backup-helper": {
"inputs": {
"nixpkgs": [
"nixpkgs-unstable"
]
},
"locked": {
"lastModified": 1738015166,
"narHash": "sha256-573tR4aXNjILKvYnjZUM5DZZME2H6YTHJkUKs3ZehFU=",
"ref": "master",
"rev": "f9540cc065692c7ca80735e7b08399459e0ea6d6",
"revCount": 35,
"type": "git",
"url": "https://git.t-juice.club/torjus/backup-helper"
},
"original": {
"ref": "master",
"type": "git",
"url": "https://git.t-juice.club/torjus/backup-helper"
}
},
"labmon": {
"inputs": {
"nixpkgs": [
@@ -65,11 +44,11 @@
},
"nixpkgs": {
"locked": {
"lastModified": 1769598131,
"narHash": "sha256-e7VO/kGLgRMbWtpBqdWl0uFg8Y2XWFMdz0uUJvlML8o=",
"lastModified": 1770136044,
"narHash": "sha256-tlFqNG/uzz2++aAmn4v8J0vAkV3z7XngeIIB3rM3650=",
"owner": "nixos",
"repo": "nixpkgs",
"rev": "fa83fd837f3098e3e678e6cf017b2b36102c7211",
"rev": "e576e3c9cf9bad747afcddd9e34f51d18c855b4e",
"type": "github"
},
"original": {
@@ -81,11 +60,11 @@
},
"nixpkgs-unstable": {
"locked": {
"lastModified": 1769461804,
"narHash": "sha256-msG8SU5WsBUfVVa/9RPLaymvi5bI8edTavbIq3vRlhI=",
"lastModified": 1770181073,
"narHash": "sha256-ksTL7P9QC1WfZasNlaAdLOzqD8x5EPyods69YBqxSfk=",
"owner": "nixos",
"repo": "nixpkgs",
"rev": "bfc1b8a4574108ceef22f02bafcf6611380c100d",
"rev": "bf922a59c5c9998a6584645f7d0de689512e444c",
"type": "github"
},
"original": {
@@ -98,7 +77,6 @@
"root": {
"inputs": {
"alerttonotify": "alerttonotify",
"backup-helper": "backup-helper",
"labmon": "labmon",
"nixpkgs": "nixpkgs",
"nixpkgs-unstable": "nixpkgs-unstable",
@@ -112,11 +90,11 @@
]
},
"locked": {
"lastModified": 1769469829,
"narHash": "sha256-wFcr32ZqspCxk4+FvIxIL0AZktRs6DuF8oOsLt59YBU=",
"lastModified": 1770145881,
"narHash": "sha256-ktjWTq+D5MTXQcL9N6cDZXUf9kX8JBLLBLT0ZyOTSYY=",
"owner": "Mic92",
"repo": "sops-nix",
"rev": "c5eebd4eb2e3372fe12a8d70a248a6ee9dd02eff",
"rev": "17eea6f3816ba6568b8c81db8a4e6ca438b30b7c",
"type": "github"
},
"original": {

View File

@@ -9,10 +9,6 @@
url = "github:Mic92/sops-nix";
inputs.nixpkgs.follows = "nixpkgs-unstable";
};
backup-helper = {
url = "git+https://git.t-juice.club/torjus/backup-helper?ref=master";
inputs.nixpkgs.follows = "nixpkgs-unstable";
};
alerttonotify = {
url = "git+https://git.t-juice.club/torjus/alerttonotify?ref=master";
inputs.nixpkgs.follows = "nixpkgs-unstable";
@@ -29,7 +25,6 @@
nixpkgs,
nixpkgs-unstable,
sops-nix,
backup-helper,
alerttonotify,
labmon,
...
@@ -136,7 +131,6 @@
)
./hosts/nixos-test1
sops-nix.nixosModules.sops
backup-helper.nixosModules.backup-helper
];
};
ha1 = nixpkgs.lib.nixosSystem {
@@ -153,7 +147,6 @@
)
./hosts/ha1
sops-nix.nixosModules.sops
backup-helper.nixosModules.backup-helper
];
};
template1 = nixpkgs.lib.nixosSystem {
@@ -234,7 +227,6 @@
)
./hosts/monitoring01
sops-nix.nixosModules.sops
backup-helper.nixosModules.backup-helper
labmon.nixosModules.labmon
];
};
@@ -366,11 +358,28 @@
sops-nix.nixosModules.sops
];
};
vaulttest01 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
inherit inputs self sops-nix;
};
modules = [
(
{ config, pkgs, ... }:
{
nixpkgs.overlays = commonOverlays;
}
)
./hosts/vaulttest01
sops-nix.nixosModules.sops
];
};
};
packages = forAllSystems (
{ pkgs }:
{
create-host = pkgs.callPackage ./scripts/create-host { };
vault-fetch = pkgs.callPackage ./scripts/vault-fetch { };
}
);
devShells = forAllSystems (
@@ -380,6 +389,7 @@
packages = with pkgs; [
ansible
opentofu
openbao
(pkgs.callPackage ./scripts/create-host { })
];
};

View File

@@ -11,6 +11,8 @@
../../common/vm
];
homelab.dns.cnames = [ "ldap" ];
nixpkgs.config.allowUnfree = true;
# Use the systemd-boot EFI boot loader.
boot.loader.grub = {

View File

@@ -57,14 +57,24 @@
# Backup service dirs
sops.secrets."backup_helper_secret" = { };
backup-helper = {
enable = true;
password-file = "/run/secrets/backup_helper_secret";
backup-dirs = [
services.restic.backups.ha1 = {
repository = "rest:http://10.69.12.52:8000/backup-nix";
passwordFile = "/run/secrets/backup_helper_secret";
paths = [
"/var/lib/hass"
"/var/lib/zigbee2mqtt"
"/var/lib/mosquitto"
];
timerConfig = {
OnCalendar = "daily";
Persistent = true;
};
pruneOpts = [
"--keep-daily 7"
"--keep-weekly 4"
"--keep-monthly 6"
"--keep-within 1d"
];
};
# Open ports in the firewall.

View File

@@ -11,6 +11,22 @@
../../common/vm
];
homelab.dns.cnames = [
"nzbget"
"radarr"
"sonarr"
"ha"
"z2m"
"grafana"
"prometheus"
"alertmanager"
"jelly"
"auth"
"lldap"
"pyroscope"
"pushgw"
];
nixpkgs.config.allowUnfree = true;
# Use the systemd-boot EFI boot loader.
boot.loader.grub = {

View File

@@ -26,7 +26,11 @@
};
};
};
# monitoring
homelab.monitoring.scrapeTargets = [{
job_name = "wireguard";
port = 9586;
}];
services.prometheus.exporters.wireguard = {
enable = true;
};

View File

@@ -57,15 +57,35 @@
services.qemuGuest.enable = true;
sops.secrets."backup_helper_secret" = { };
backup-helper = {
enable = true;
password-file = "/run/secrets/backup_helper_secret";
backup-dirs = [
"/var/lib/grafana/plugins"
services.restic.backups.grafana = {
repository = "rest:http://10.69.12.52:8000/backup-nix";
passwordFile = "/run/secrets/backup_helper_secret";
paths = [ "/var/lib/grafana/plugins" ];
timerConfig = {
OnCalendar = "daily";
Persistent = true;
};
pruneOpts = [
"--keep-daily 7"
"--keep-weekly 4"
"--keep-monthly 6"
"--keep-within 1d"
];
backup-commands = [
# "grafana.db:${pkgs.sqlite}/bin/sqlite /var/lib/grafana/data/grafana.db .dump"
"grafana.db:${pkgs.sqlite}/bin/sqlite3 /var/lib/grafana/data/grafana.db .dump"
};
services.restic.backups.grafana-db = {
repository = "rest:http://10.69.12.52:8000/backup-nix";
passwordFile = "/run/secrets/backup_helper_secret";
command = [ "${pkgs.sqlite}/bin/sqlite3" "/var/lib/grafana/data/grafana.db" ".dump" ];
timerConfig = {
OnCalendar = "daily";
Persistent = true;
};
pruneOpts = [
"--keep-daily 7"
"--keep-weekly 4"
"--keep-monthly 6"
"--keep-within 1d"
];
};

View File

@@ -11,6 +11,8 @@
../../common/vm
];
homelab.dns.cnames = [ "nix-cache" "actions1" ];
fileSystems."/nix" = {
device = "/dev/disk/by-label/nixcache";
fsType = "xfs";

View File

@@ -51,15 +51,25 @@
networking.firewall.enable = false;
# Secrets
# Backup helper
# Backup
sops.secrets."backup_helper_secret" = { };
backup-helper = {
enable = true;
password-file = "/run/secrets/backup_helper_secret";
backup-dirs = [
services.restic.backups.test = {
repository = "rest:http://10.69.12.52:8000/backup-nix";
passwordFile = "/run/secrets/backup_helper_secret";
paths = [
"/etc/machine-id"
"/etc/os-release"
];
timerConfig = {
OnCalendar = "daily";
Persistent = true;
};
pruneOpts = [
"--keep-daily 7"
"--keep-weekly 4"
"--keep-monthly 6"
"--keep-within 1d"
];
};
system.stateVersion = "23.11"; # Did you read the comment?

View File

@@ -8,6 +8,9 @@
../../system
];
# Template host - exclude from DNS zone generation
homelab.dns.enable = false;
boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/sda";

View File

@@ -22,6 +22,53 @@ let
fi
echo "Network connectivity confirmed"
# Unwrap Vault token and store AppRole credentials (if provided)
if [ -n "''${VAULT_WRAPPED_TOKEN:-}" ]; then
echo "Unwrapping Vault token to get AppRole credentials..."
VAULT_ADDR="''${VAULT_ADDR:-https://vault01.home.2rjus.net:8200}"
# Unwrap the token to get role_id and secret_id
UNWRAP_RESPONSE=$(curl -sk -X POST \
-H "X-Vault-Token: $VAULT_WRAPPED_TOKEN" \
"$VAULT_ADDR/v1/sys/wrapping/unwrap") || {
echo "WARNING: Failed to unwrap Vault token (network error)"
echo "Vault secrets will not be available, but continuing bootstrap..."
}
# Check if unwrap was successful
if [ -n "$UNWRAP_RESPONSE" ] && echo "$UNWRAP_RESPONSE" | jq -e '.data' >/dev/null 2>&1; then
ROLE_ID=$(echo "$UNWRAP_RESPONSE" | jq -r '.data.role_id')
SECRET_ID=$(echo "$UNWRAP_RESPONSE" | jq -r '.data.secret_id')
# Store credentials
mkdir -p /var/lib/vault/approle
echo "$ROLE_ID" > /var/lib/vault/approle/role-id
echo "$SECRET_ID" > /var/lib/vault/approle/secret-id
chmod 600 /var/lib/vault/approle/role-id
chmod 600 /var/lib/vault/approle/secret-id
echo "Vault credentials unwrapped and stored successfully"
else
echo "WARNING: Failed to unwrap Vault token"
if [ -n "$UNWRAP_RESPONSE" ]; then
echo "Response: $UNWRAP_RESPONSE"
fi
echo "Possible causes:"
echo " - Token already used (wrapped tokens are single-use)"
echo " - Token expired (24h TTL)"
echo " - Invalid token"
echo ""
echo "To regenerate token, run: create-host --hostname $HOSTNAME --force"
echo ""
echo "Vault secrets will not be available, but continuing bootstrap..."
fi
else
echo "No Vault wrapped token provided (VAULT_WRAPPED_TOKEN not set)"
echo "Skipping Vault credential setup"
fi
echo "Fetching and building NixOS configuration from flake..."
# Read git branch from environment, default to master
@@ -62,8 +109,8 @@ in
RemainAfterExit = true;
ExecStart = "${bootstrap-script}/bin/nixos-bootstrap";
# Read environment variables from /etc/environment (set by cloud-init)
EnvironmentFile = "-/etc/environment";
# Read environment variables from cloud-init (set by cloud-init write_files)
EnvironmentFile = "-/run/cloud-init-env";
# Logging to journald
StandardOutput = "journal+console";

View File

@@ -13,6 +13,9 @@
../../common/vm
];
# Test VM - exclude from DNS zone generation
homelab.dns.enable = false;
nixpkgs.config.allowUnfree = true;
boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/vda";

View File

@@ -14,6 +14,8 @@
../../services/vault
];
homelab.dns.cnames = [ "vault" ];
nixpkgs.config.allowUnfree = true;
boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/vda";

View File

@@ -0,0 +1,121 @@
{
config,
lib,
pkgs,
...
}:
{
imports = [
../template2/hardware-configuration.nix
../../system
../../common/vm
];
nixpkgs.config.allowUnfree = true;
boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/vda";
networking.hostName = "vaulttest01";
networking.domain = "home.2rjus.net";
networking.useNetworkd = true;
networking.useDHCP = false;
services.resolved.enable = true;
networking.nameservers = [
"10.69.13.5"
"10.69.13.6"
];
systemd.network.enable = true;
systemd.network.networks."ens18" = {
matchConfig.Name = "ens18";
address = [
"10.69.13.150/24"
];
routes = [
{ Gateway = "10.69.13.1"; }
];
linkConfig.RequiredForOnline = "routable";
};
time.timeZone = "Europe/Oslo";
nix.settings.experimental-features = [
"nix-command"
"flakes"
];
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim
wget
git
];
# Open ports in the firewall.
# networking.firewall.allowedTCPPorts = [ ... ];
# networking.firewall.allowedUDPPorts = [ ... ];
# Or disable the firewall altogether.
networking.firewall.enable = false;
# Testing config
# Enable Vault secrets management
vault.enable = true;
# Define a test secret
vault.secrets.test-service = {
secretPath = "hosts/vaulttest01/test-service";
restartTrigger = true;
restartInterval = "daily";
services = [ "vault-test" ];
};
# Create a test service that uses the secret
systemd.services.vault-test = {
description = "Test Vault secret fetching";
wantedBy = [ "multi-user.target" ];
after = [ "vault-secret-test-service.service" ];
serviceConfig = {
Type = "oneshot";
RemainAfterExit = true;
ExecStart = pkgs.writeShellScript "vault-test" ''
echo "=== Vault Secret Test ==="
echo "Secret path: hosts/vaulttest01/test-service"
if [ -f /run/secrets/test-service/password ]; then
echo " Password file exists"
echo "Password length: $(wc -c < /run/secrets/test-service/password)"
else
echo " Password file missing!"
exit 1
fi
if [ -d /var/lib/vault/cache/test-service ]; then
echo " Cache directory exists"
else
echo " Cache directory missing!"
exit 1
fi
echo "Test successful!"
'';
StandardOutput = "journal+console";
};
};
# Test ACME certificate issuance from OpenBao PKI
# Override the global ACME server (from system/acme.nix) to use OpenBao instead of step-ca
security.acme.defaults.server = lib.mkForce "https://vault01.home.2rjus.net:8200/v1/pki_int/acme/directory";
# Request a certificate for this host
# Using HTTP-01 challenge with standalone listener on port 80
security.acme.certs."vaulttest01.home.2rjus.net" = {
listenHTTP = ":80";
enableDebugLogs = true;
};
system.stateVersion = "25.11"; # Did you read the comment?
}

View File

@@ -0,0 +1,5 @@
{ ... }: {
imports = [
./configuration.nix
];
}

160
lib/dns-zone.nix Normal file
View File

@@ -0,0 +1,160 @@
{ lib }:
let
# Pad string on the right to reach a fixed width
rightPad = width: str:
let
len = builtins.stringLength str;
padding = if len >= width then "" else lib.strings.replicate (width - len) " ";
in
str + padding;
# Extract IP address from CIDR notation (e.g., "10.69.13.5/24" -> "10.69.13.5")
extractIP = address:
let
parts = lib.splitString "/" address;
in
builtins.head parts;
# Check if a network interface name looks like a VPN/tunnel interface
isVpnInterface = ifaceName:
lib.hasPrefix "wg" ifaceName ||
lib.hasPrefix "tun" ifaceName ||
lib.hasPrefix "tap" ifaceName ||
lib.hasPrefix "vti" ifaceName;
# Extract DNS information from a single host configuration
# Returns null if host should not be included in DNS
extractHostDNS = name: hostConfig:
let
cfg = hostConfig.config;
# Handle cases where homelab module might not be imported
dnsConfig = (cfg.homelab or { }).dns or { enable = true; cnames = [ ]; };
hostname = cfg.networking.hostName;
networks = cfg.systemd.network.networks or { };
# Filter out VPN interfaces and find networks with static addresses
# Check matchConfig.Name instead of network unit name (which can have prefixes like "40-")
physicalNetworks = lib.filterAttrs
(netName: netCfg:
let
ifaceName = netCfg.matchConfig.Name or "";
in
!(isVpnInterface ifaceName) && (netCfg.address or [ ]) != [ ])
networks;
# Get addresses from physical networks only
networkAddresses = lib.flatten (
lib.mapAttrsToList
(netName: netCfg: netCfg.address or [ ])
physicalNetworks
);
# Get the first address, if any
firstAddress = if networkAddresses != [ ] then builtins.head networkAddresses else null;
# Check if host uses DHCP (no static address)
usesDHCP = firstAddress == null ||
lib.any
(netName: (networks.${netName}.networkConfig.DHCP or "no") != "no")
(lib.attrNames networks);
in
if !(dnsConfig.enable or true) || firstAddress == null then
null
else
{
inherit hostname;
ip = extractIP firstAddress;
cnames = dnsConfig.cnames or [ ];
};
# Generate A record line
generateARecord = hostname: ip:
"${rightPad 20 hostname}IN A ${ip}";
# Generate CNAME record line
generateCNAME = alias: target:
"${rightPad 20 alias}IN CNAME ${target}";
# Generate zone file from flake configurations and external hosts
generateZone =
{ self
, externalHosts
, serial
, domain ? "home.2rjus.net"
, ttl ? 1800
, refresh ? 3600
, retry ? 900
, expire ? 1209600
, minTtl ? 120
, nameservers ? [ "ns1" "ns2" "ns3" ]
, adminEmail ? "admin.test.2rjus.net"
}:
let
# Extract DNS info from all flake hosts
nixosConfigs = self.nixosConfigurations or { };
hostDNSList = lib.filter (x: x != null) (
lib.mapAttrsToList extractHostDNS nixosConfigs
);
# Sort hosts by IP for consistent output
sortedHosts = lib.sort (a: b: a.ip < b.ip) hostDNSList;
# Generate A records for flake hosts
flakeARecords = lib.concatMapStringsSep "\n" (host:
generateARecord host.hostname host.ip
) sortedHosts;
# Generate CNAMEs for flake hosts
flakeCNAMEs = lib.concatMapStringsSep "\n" (host:
lib.concatMapStringsSep "\n" (cname:
generateCNAME cname host.hostname
) host.cnames
) (lib.filter (h: h.cnames != [ ]) sortedHosts);
# Generate A records for external hosts
externalARecords = lib.concatStringsSep "\n" (
lib.mapAttrsToList (name: ip:
generateARecord name ip
) (externalHosts.aRecords or { })
);
# Generate CNAMEs for external hosts
externalCNAMEs = lib.concatStringsSep "\n" (
lib.mapAttrsToList (alias: target:
generateCNAME alias target
) (externalHosts.cnames or { })
);
# NS records
nsRecords = lib.concatMapStringsSep "\n" (ns:
" IN NS ${ns}.${domain}."
) nameservers;
# SOA record
soa = ''
$ORIGIN ${domain}.
$TTL ${toString ttl}
@ IN SOA ns1.${domain}. ${adminEmail}. (
${toString serial} ; serial number
${toString refresh} ; refresh
${toString retry} ; retry
${toString expire} ; expire
${toString minTtl} ; ttl
)'';
in
lib.concatStringsSep "\n\n" (lib.filter (s: s != "") [
soa
nsRecords
"; Flake-managed hosts (auto-generated)"
flakeARecords
(if flakeCNAMEs != "" then "; Flake-managed CNAMEs\n${flakeCNAMEs}" else "")
"; External hosts (not managed by this flake)"
externalARecords
(if externalCNAMEs != "" then "; External CNAMEs\n${externalCNAMEs}" else "")
""
]);
in
{
inherit extractIP extractHostDNS generateARecord generateCNAME generateZone;
}

145
lib/monitoring.nix Normal file
View File

@@ -0,0 +1,145 @@
{ lib }:
let
# Extract IP address from CIDR notation (e.g., "10.69.13.5/24" -> "10.69.13.5")
extractIP = address:
let
parts = lib.splitString "/" address;
in
builtins.head parts;
# Check if a network interface name looks like a VPN/tunnel interface
isVpnInterface = ifaceName:
lib.hasPrefix "wg" ifaceName ||
lib.hasPrefix "tun" ifaceName ||
lib.hasPrefix "tap" ifaceName ||
lib.hasPrefix "vti" ifaceName;
# Extract monitoring info from a single host configuration
# Returns null if host should not be included
extractHostMonitoring = name: hostConfig:
let
cfg = hostConfig.config;
monConfig = (cfg.homelab or { }).monitoring or { enable = true; scrapeTargets = [ ]; };
dnsConfig = (cfg.homelab or { }).dns or { enable = true; };
hostname = cfg.networking.hostName;
networks = cfg.systemd.network.networks or { };
# Filter out VPN interfaces and find networks with static addresses
physicalNetworks = lib.filterAttrs
(netName: netCfg:
let
ifaceName = netCfg.matchConfig.Name or "";
in
!(isVpnInterface ifaceName) && (netCfg.address or [ ]) != [ ])
networks;
# Get addresses from physical networks only
networkAddresses = lib.flatten (
lib.mapAttrsToList
(netName: netCfg: netCfg.address or [ ])
physicalNetworks
);
firstAddress = if networkAddresses != [ ] then builtins.head networkAddresses else null;
in
if !(monConfig.enable or true) || !(dnsConfig.enable or true) || firstAddress == null then
null
else
{
inherit hostname;
ip = extractIP firstAddress;
scrapeTargets = monConfig.scrapeTargets or [ ];
};
# Generate node-exporter targets from all flake hosts
generateNodeExporterTargets = self: externalTargets:
let
nixosConfigs = self.nixosConfigurations or { };
hostList = lib.filter (x: x != null) (
lib.mapAttrsToList extractHostMonitoring nixosConfigs
);
flakeTargets = map (host: "${host.hostname}.home.2rjus.net:9100") hostList;
in
flakeTargets ++ (externalTargets.nodeExporter or [ ]);
# Generate scrape configs from all flake hosts and external targets
generateScrapeConfigs = self: externalTargets:
let
nixosConfigs = self.nixosConfigurations or { };
hostList = lib.filter (x: x != null) (
lib.mapAttrsToList extractHostMonitoring nixosConfigs
);
# Collect all scrapeTargets from all hosts, grouped by job_name
allTargets = lib.flatten (map
(host:
map
(target: {
inherit (target) job_name port metrics_path scheme scrape_interval honor_labels;
hostname = host.hostname;
})
host.scrapeTargets
)
hostList
);
# Group targets by job_name
grouped = lib.groupBy (t: t.job_name) allTargets;
# Generate a scrape config for each job
flakeScrapeConfigs = lib.mapAttrsToList
(jobName: targets:
let
first = builtins.head targets;
targetAddrs = map
(t:
let
portStr = toString t.port;
in
"${t.hostname}.home.2rjus.net:${portStr}")
targets;
config = {
job_name = jobName;
static_configs = [{
targets = targetAddrs;
}];
}
// (lib.optionalAttrs (first.metrics_path != "/metrics") {
metrics_path = first.metrics_path;
})
// (lib.optionalAttrs (first.scheme != "http") {
scheme = first.scheme;
})
// (lib.optionalAttrs (first.scrape_interval != null) {
scrape_interval = first.scrape_interval;
})
// (lib.optionalAttrs first.honor_labels {
honor_labels = true;
});
in
config
)
grouped;
# External scrape configs
externalScrapeConfigs = map
(ext: {
job_name = ext.job_name;
static_configs = [{
targets = ext.targets;
}];
} // (lib.optionalAttrs (ext ? metrics_path) {
metrics_path = ext.metrics_path;
}) // (lib.optionalAttrs (ext ? scheme) {
scheme = ext.scheme;
}) // (lib.optionalAttrs (ext ? scrape_interval) {
scrape_interval = ext.scrape_interval;
}))
(externalTargets.scrapeConfigs or [ ]);
in
flakeScrapeConfigs ++ externalScrapeConfigs;
in
{
inherit extractHostMonitoring generateNodeExporterTargets generateScrapeConfigs;
}

View File

@@ -0,0 +1,7 @@
{ ... }:
{
imports = [
./dns.nix
./monitoring.nix
];
}

20
modules/homelab/dns.nix Normal file
View File

@@ -0,0 +1,20 @@
{ config, lib, ... }:
let
cfg = config.homelab.dns;
in
{
options.homelab.dns = {
enable = lib.mkOption {
type = lib.types.bool;
default = true;
description = "Include this host in DNS zone generation";
};
cnames = lib.mkOption {
type = lib.types.listOf lib.types.str;
default = [ ];
description = "CNAME records pointing to this host";
example = [ "web" "api" ];
};
};
}

View File

@@ -0,0 +1,50 @@
{ config, lib, ... }:
let
cfg = config.homelab.monitoring;
in
{
options.homelab.monitoring = {
enable = lib.mkOption {
type = lib.types.bool;
default = true;
description = "Include this host in Prometheus node-exporter scrape targets";
};
scrapeTargets = lib.mkOption {
type = lib.types.listOf (lib.types.submodule {
options = {
job_name = lib.mkOption {
type = lib.types.str;
description = "Prometheus scrape job name";
};
port = lib.mkOption {
type = lib.types.port;
description = "Port to scrape metrics from";
};
metrics_path = lib.mkOption {
type = lib.types.str;
default = "/metrics";
description = "HTTP path to scrape metrics from";
};
scheme = lib.mkOption {
type = lib.types.str;
default = "http";
description = "HTTP scheme (http or https)";
};
scrape_interval = lib.mkOption {
type = lib.types.nullOr lib.types.str;
default = null;
description = "Override the global scrape interval for this target";
};
honor_labels = lib.mkOption {
type = lib.types.bool;
default = false;
description = "Whether to honor labels from the scraped target";
};
};
});
default = [ ];
description = "Additional Prometheus scrape targets exposed by this host";
};
};
}

View File

@@ -1,5 +1,6 @@
"""CLI tool for generating NixOS host configurations."""
import shutil
import sys
from pathlib import Path
from typing import Optional
@@ -9,9 +10,18 @@ from rich.console import Console
from rich.panel import Panel
from rich.table import Table
from generators import generate_host_files
from manipulators import update_flake_nix, update_terraform_vms
from generators import generate_host_files, generate_vault_terraform
from manipulators import (
update_flake_nix,
update_terraform_vms,
add_wrapped_token_to_vm,
remove_from_flake_nix,
remove_from_terraform_vms,
remove_from_vault_terraform,
check_entries_exist,
)
from models import HostConfig
from vault_helper import generate_wrapped_token
from validators import (
validate_hostname_format,
validate_hostname_unique,
@@ -45,7 +55,10 @@ def main(
memory: int = typer.Option(2048, "--memory", help="Memory in MB"),
disk: str = typer.Option("20G", "--disk", help="Disk size (e.g., 20G, 50G, 100G)"),
dry_run: bool = typer.Option(False, "--dry-run", help="Preview changes without creating files"),
force: bool = typer.Option(False, "--force", help="Overwrite existing host configuration"),
force: bool = typer.Option(False, "--force", help="Overwrite existing host configuration / skip confirmation for removal"),
skip_vault: bool = typer.Option(False, "--skip-vault", help="Skip Vault configuration and token generation"),
regenerate_token: bool = typer.Option(False, "--regenerate-token", help="Only regenerate Vault wrapped token (no other changes)"),
remove: bool = typer.Option(False, "--remove", help="Remove host configuration and terraform entries"),
) -> None:
"""
Create a new NixOS host configuration.
@@ -58,6 +71,56 @@ def main(
ctx.get_help()
sys.exit(1)
# Get repository root
repo_root = get_repo_root()
# Handle removal mode
if remove:
handle_remove(hostname, repo_root, dry_run, force, ip, cpu, memory, disk, skip_vault, regenerate_token)
return
# Handle token regeneration mode
if regenerate_token:
# Validate that incompatible options aren't used
if force or dry_run or skip_vault:
console.print("[bold red]Error:[/bold red] --regenerate-token cannot be used with --force, --dry-run, or --skip-vault\n")
sys.exit(1)
if ip or cpu != 2 or memory != 2048 or disk != "20G":
console.print("[bold red]Error:[/bold red] --regenerate-token only regenerates the token. Other options (--ip, --cpu, --memory, --disk) are ignored.\n")
console.print("[yellow]Tip:[/yellow] Use without those options, or use --force to update the entire configuration.\n")
sys.exit(1)
try:
console.print(f"\n[bold blue]Regenerating Vault token for {hostname}...[/bold blue]")
# Validate hostname exists
host_dir = repo_root / "hosts" / hostname
if not host_dir.exists():
console.print(f"[bold red]Error:[/bold red] Host {hostname} does not exist")
console.print(f"Host directory not found: {host_dir}")
sys.exit(1)
# Generate new wrapped token
wrapped_token = generate_wrapped_token(hostname, repo_root)
# Update only the wrapped token in vms.tf
add_wrapped_token_to_vm(hostname, wrapped_token, repo_root)
console.print("[green]✓[/green] Regenerated and updated wrapped token in terraform/vms.tf")
console.print("\n[bold green]✓ Token regenerated successfully![/bold green]")
console.print(f"\n[yellow]⚠️[/yellow] Token expires in 24 hours")
console.print(f"[yellow]⚠️[/yellow] Deploy the VM within 24h or regenerate token again\n")
console.print("[bold cyan]Next steps:[/bold cyan]")
console.print(f" cd terraform && tofu apply")
console.print(f" # Then redeploy VM to pick up new token\n")
return
except Exception as e:
console.print(f"\n[bold red]Error regenerating token:[/bold red] {e}\n")
sys.exit(1)
try:
# Build configuration
config = HostConfig(
@@ -68,9 +131,6 @@ def main(
disk=disk,
)
# Get repository root
repo_root = get_repo_root()
# Validate configuration
console.print("\n[bold blue]Validating configuration...[/bold blue]")
@@ -116,11 +176,34 @@ def main(
update_terraform_vms(config, repo_root, force=force)
console.print("[green]✓[/green] Updated terraform/vms.tf")
# Generate Vault configuration if not skipped
if not skip_vault:
console.print("\n[bold blue]Configuring Vault integration...[/bold blue]")
try:
# Generate Vault Terraform configuration
generate_vault_terraform(hostname, repo_root)
console.print("[green]✓[/green] Updated terraform/vault/hosts-generated.tf")
# Generate wrapped token
wrapped_token = generate_wrapped_token(hostname, repo_root)
# Add wrapped token to VM configuration
add_wrapped_token_to_vm(hostname, wrapped_token, repo_root)
console.print("[green]✓[/green] Added wrapped token to terraform/vms.tf")
except Exception as e:
console.print(f"\n[yellow]⚠️ Vault configuration failed: {e}[/yellow]")
console.print("[yellow]Host configuration created without Vault integration[/yellow]")
console.print("[yellow]You can add Vault support later by re-running with --force[/yellow]\n")
else:
console.print("\n[yellow]Skipped Vault configuration (--skip-vault)[/yellow]")
# Success message
console.print("\n[bold green]✓ Host configuration generated successfully![/bold green]\n")
# Display next steps
display_next_steps(hostname)
display_next_steps(hostname, skip_vault=skip_vault)
except ValueError as e:
console.print(f"\n[bold red]Error:[/bold red] {e}\n", style="red")
@@ -130,6 +213,166 @@ def main(
sys.exit(1)
def handle_remove(
hostname: str,
repo_root: Path,
dry_run: bool,
force: bool,
ip: Optional[str],
cpu: int,
memory: int,
disk: str,
skip_vault: bool,
regenerate_token: bool,
) -> None:
"""Handle the --remove workflow."""
# Validate --remove isn't used with create options
incompatible_options = []
if ip:
incompatible_options.append("--ip")
if cpu != 2:
incompatible_options.append("--cpu")
if memory != 2048:
incompatible_options.append("--memory")
if disk != "20G":
incompatible_options.append("--disk")
if skip_vault:
incompatible_options.append("--skip-vault")
if regenerate_token:
incompatible_options.append("--regenerate-token")
if incompatible_options:
console.print(
f"[bold red]Error:[/bold red] --remove cannot be used with: {', '.join(incompatible_options)}\n"
)
sys.exit(1)
# Validate hostname exists (host directory must exist)
host_dir = repo_root / "hosts" / hostname
if not host_dir.exists():
console.print(f"[bold red]Error:[/bold red] Host {hostname} does not exist")
console.print(f"Host directory not found: {host_dir}")
sys.exit(1)
# Check what entries exist
flake_exists, terraform_exists, vault_exists = check_entries_exist(hostname, repo_root)
# Collect all files in the host directory recursively
files_in_host_dir = sorted([f for f in host_dir.rglob("*") if f.is_file()])
# Check for secrets directory
secrets_dir = repo_root / "secrets" / hostname
secrets_exist = secrets_dir.exists()
# Display summary
if dry_run:
console.print("\n[yellow][DRY RUN - No changes will be made][/yellow]\n")
console.print(f"\n[bold blue]Removing host: {hostname}[/bold blue]\n")
# Show host directory contents
console.print("[bold]Directory to be deleted (and all contents):[/bold]")
console.print(f" • hosts/{hostname}/")
for f in files_in_host_dir:
rel_path = f.relative_to(host_dir)
console.print(f" - {rel_path}")
# Show entries to be removed
console.print("\n[bold]Entries to be removed:[/bold]")
if flake_exists:
console.print(f" • flake.nix (nixosConfigurations.{hostname})")
else:
console.print(f" • flake.nix [dim](not found)[/dim]")
if terraform_exists:
console.print(f' • terraform/vms.tf (locals.vms["{hostname}"])')
else:
console.print(f" • terraform/vms.tf [dim](not found)[/dim]")
if vault_exists:
console.print(f' • terraform/vault/hosts-generated.tf (generated_host_policies["{hostname}"])')
else:
console.print(f" • terraform/vault/hosts-generated.tf [dim](not found)[/dim]")
# Warn about secrets directory
if secrets_exist:
console.print(f"\n[yellow]⚠️ Warning: secrets/{hostname}/ directory exists and will NOT be deleted[/yellow]")
console.print(f" Manually remove if no longer needed: [white]rm -rf secrets/{hostname}/[/white]")
console.print(f" Also update .sops.yaml to remove the host's age key")
# Exit if dry run
if dry_run:
console.print("\n[yellow][DRY RUN - No changes made][/yellow]\n")
return
# Prompt for confirmation unless --force
if not force:
console.print("")
confirm = typer.confirm("Proceed with removal?", default=False)
if not confirm:
console.print("\n[yellow]Removal cancelled[/yellow]\n")
sys.exit(0)
# Perform removal
console.print("\n[bold blue]Removing host configuration...[/bold blue]")
# Remove from terraform/vault/hosts-generated.tf
if vault_exists:
if remove_from_vault_terraform(hostname, repo_root):
console.print("[green]✓[/green] Removed from terraform/vault/hosts-generated.tf")
else:
console.print("[yellow]⚠[/yellow] Could not remove from terraform/vault/hosts-generated.tf")
# Remove from terraform/vms.tf
if terraform_exists:
if remove_from_terraform_vms(hostname, repo_root):
console.print("[green]✓[/green] Removed from terraform/vms.tf")
else:
console.print("[yellow]⚠[/yellow] Could not remove from terraform/vms.tf")
# Remove from flake.nix
if flake_exists:
if remove_from_flake_nix(hostname, repo_root):
console.print("[green]✓[/green] Removed from flake.nix")
else:
console.print("[yellow]⚠[/yellow] Could not remove from flake.nix")
# Delete hosts/<hostname>/ directory
shutil.rmtree(host_dir)
console.print(f"[green]✓[/green] Deleted hosts/{hostname}/")
# Success message
console.print(f"\n[bold green]✓ Host {hostname} removed successfully![/bold green]\n")
# Display next steps
display_removal_next_steps(hostname, vault_exists)
def display_removal_next_steps(hostname: str, had_vault: bool) -> None:
"""Display next steps after successful removal."""
vault_file = " terraform/vault/hosts-generated.tf" if had_vault else ""
vault_apply = ""
if had_vault:
vault_apply = f"""
3. Apply Vault changes:
[white]cd terraform/vault && tofu apply[/white]
"""
next_steps = f"""[bold cyan]Next Steps:[/bold cyan]
1. Review changes:
[white]git diff[/white]
2. If VM exists in Proxmox, destroy it first:
[white]cd terraform && tofu destroy -target='proxmox_vm_qemu.vm["{hostname}"]'[/white]
{vault_apply}
4. Commit changes:
[white]git add -u hosts/{hostname} flake.nix terraform/vms.tf{vault_file}
git commit -m "hosts: remove {hostname}"[/white]
"""
console.print(Panel(next_steps, border_style="cyan"))
def display_config_summary(config: HostConfig) -> None:
"""Display configuration summary table."""
table = Table(title="Host Configuration", show_header=False)
@@ -164,8 +407,18 @@ def display_dry_run_summary(config: HostConfig, repo_root: Path) -> None:
console.print(f"{repo_root}/terraform/vms.tf (add VM definition)")
def display_next_steps(hostname: str) -> None:
def display_next_steps(hostname: str, skip_vault: bool = False) -> None:
"""Display next steps after successful generation."""
vault_files = "" if skip_vault else " terraform/vault/hosts-generated.tf"
vault_apply = ""
if not skip_vault:
vault_apply = """
4a. Apply Vault configuration:
[white]cd terraform/vault
tofu apply[/white]
"""
next_steps = f"""[bold cyan]Next Steps:[/bold cyan]
1. Review changes:
@@ -181,14 +434,16 @@ def display_next_steps(hostname: str) -> None:
tofu plan[/white]
4. Commit changes:
[white]git add hosts/{hostname} flake.nix terraform/vms.tf
[white]git add hosts/{hostname} flake.nix terraform/vms.tf{vault_files}
git commit -m "hosts: add {hostname} configuration"[/white]
5. Deploy VM (after merging to master):
{vault_apply}
5. Deploy VM (after merging to master or within 24h of token generation):
[white]cd terraform
tofu apply[/white]
6. Bootstrap the host (see Phase 3 of deployment pipeline)
6. Host will bootstrap automatically on first boot
- Wrapped token expires in 24 hours
- If expired, re-run: create-host --hostname {hostname} --force
"""
console.print(Panel(next_steps, border_style="cyan"))

View File

@@ -19,6 +19,7 @@ python3Packages.buildPythonApplication {
typer
jinja2
rich
hvac # Python Vault/OpenBao client library
];
# Install templates to share directory

View File

@@ -86,3 +86,114 @@ def generate_host_files(config: HostConfig, repo_root: Path) -> None:
state_version=config.state_version,
)
(host_dir / "configuration.nix").write_text(config_content)
def generate_vault_terraform(hostname: str, repo_root: Path) -> None:
"""
Generate or update Vault Terraform configuration for a new host.
Creates/updates terraform/vault/hosts-generated.tf with:
- Host policy granting access to hosts/<hostname>/* secrets
- AppRole configuration for the host
- Placeholder secret entry (user adds actual secrets separately)
Args:
hostname: Hostname for the new host
repo_root: Path to repository root
"""
vault_tf_path = repo_root / "terraform" / "vault" / "hosts-generated.tf"
# Read existing file if it exists, otherwise start with empty structure
if vault_tf_path.exists():
content = vault_tf_path.read_text()
else:
# Create initial file structure
content = """# WARNING: Auto-generated by create-host tool
# Manual edits will be overwritten when create-host is run
# Generated host policies
# Each host gets access to its own secrets under hosts/<hostname>/*
locals {
generated_host_policies = {
}
# Placeholder secrets - user should add actual secrets manually or via tofu
generated_secrets = {
}
}
# Create policies for generated hosts
resource "vault_policy" "generated_host_policies" {
for_each = local.generated_host_policies
name = "host-\${each.key}"
policy = <<-EOT
# Allow host to read its own secrets
%{for path in each.value.paths~}
path "${path}" {
capabilities = ["read", "list"]
}
%{endfor~}
EOT
}
# Create AppRoles for generated hosts
resource "vault_approle_auth_backend_role" "generated_hosts" {
for_each = local.generated_host_policies
backend = vault_auth_backend.approle.path
role_name = each.key
token_policies = ["host-\${each.key}"]
secret_id_ttl = 0 # Never expire (wrapped tokens provide time limit)
token_ttl = 3600
token_max_ttl = 3600
secret_id_num_uses = 0 # Unlimited uses
}
"""
# Parse existing policies from the file
import re
policies_match = re.search(
r'generated_host_policies = \{(.*?)\n \}',
content,
re.DOTALL
)
if policies_match:
policies_content = policies_match.group(1)
else:
policies_content = ""
# Check if hostname already exists
if f'"{hostname}"' in policies_content:
# Already exists, don't duplicate
return
# Add new policy entry
new_policy = f'''
"{hostname}" = {{
paths = [
"secret/data/hosts/{hostname}/*",
]
}}'''
# Insert before the closing brace
if policies_content.strip():
# There are existing entries, add after them
new_policies_content = policies_content.rstrip() + new_policy + "\n "
else:
# First entry
new_policies_content = new_policy + "\n "
# Replace the policies map
new_content = re.sub(
r'(generated_host_policies = \{)(.*?)(\n \})',
rf'\1{new_policies_content}\3',
content,
flags=re.DOTALL
)
# Write the updated file
vault_tf_path.write_text(new_content)

View File

@@ -2,10 +2,138 @@
import re
from pathlib import Path
from typing import Tuple
from models import HostConfig
def remove_from_flake_nix(hostname: str, repo_root: Path) -> bool:
"""
Remove host entry from flake.nix nixosConfigurations.
Args:
hostname: Hostname to remove
repo_root: Path to repository root
Returns:
True if found and removed, False if not found
"""
flake_path = repo_root / "flake.nix"
content = flake_path.read_text()
# Check if hostname exists
hostname_pattern = rf"^ {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem"
if not re.search(hostname_pattern, content, re.MULTILINE):
return False
# Match the entire block from "hostname = " to "};"
replace_pattern = rf"^ {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem \{{.*?^ \}};\n"
new_content, count = re.subn(replace_pattern, "", content, flags=re.MULTILINE | re.DOTALL)
if count == 0:
return False
flake_path.write_text(new_content)
return True
def remove_from_terraform_vms(hostname: str, repo_root: Path) -> bool:
"""
Remove VM entry from terraform/vms.tf locals.vms map.
Args:
hostname: Hostname to remove
repo_root: Path to repository root
Returns:
True if found and removed, False if not found
"""
terraform_path = repo_root / "terraform" / "vms.tf"
content = terraform_path.read_text()
# Check if hostname exists
hostname_pattern = rf'^\s+"{re.escape(hostname)}" = \{{'
if not re.search(hostname_pattern, content, re.MULTILINE):
return False
# Match the entire block from "hostname" = { to }
replace_pattern = rf'^\s+"{re.escape(hostname)}" = \{{.*?^\s+\}}\n'
new_content, count = re.subn(replace_pattern, "", content, flags=re.MULTILINE | re.DOTALL)
if count == 0:
return False
terraform_path.write_text(new_content)
return True
def remove_from_vault_terraform(hostname: str, repo_root: Path) -> bool:
"""
Remove host policy from terraform/vault/hosts-generated.tf.
Args:
hostname: Hostname to remove
repo_root: Path to repository root
Returns:
True if found and removed, False if not found
"""
vault_tf_path = repo_root / "terraform" / "vault" / "hosts-generated.tf"
if not vault_tf_path.exists():
return False
content = vault_tf_path.read_text()
# Check if hostname exists in the policies
if f'"{hostname}"' not in content:
return False
# Match the host entry block within generated_host_policies
# Pattern matches: "hostname" = { ... } with possible trailing newlines
replace_pattern = rf'\s*"{re.escape(hostname)}" = \{{\s*paths = \[.*?\]\s*\}}\n?'
new_content, count = re.subn(replace_pattern, "", content, flags=re.DOTALL)
if count == 0:
return False
vault_tf_path.write_text(new_content)
return True
def check_entries_exist(hostname: str, repo_root: Path) -> Tuple[bool, bool, bool]:
"""
Check which entries exist for a hostname.
Args:
hostname: Hostname to check
repo_root: Path to repository root
Returns:
Tuple of (flake_exists, terraform_vms_exists, vault_exists)
"""
# Check flake.nix
flake_path = repo_root / "flake.nix"
flake_content = flake_path.read_text()
flake_pattern = rf"^ {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem"
flake_exists = bool(re.search(flake_pattern, flake_content, re.MULTILINE))
# Check terraform/vms.tf
terraform_path = repo_root / "terraform" / "vms.tf"
terraform_content = terraform_path.read_text()
terraform_pattern = rf'^\s+"{re.escape(hostname)}" = \{{'
terraform_exists = bool(re.search(terraform_pattern, terraform_content, re.MULTILINE))
# Check terraform/vault/hosts-generated.tf
vault_tf_path = repo_root / "terraform" / "vault" / "hosts-generated.tf"
vault_exists = False
if vault_tf_path.exists():
vault_content = vault_tf_path.read_text()
vault_exists = f'"{hostname}"' in vault_content
return (flake_exists, terraform_exists, vault_exists)
def update_flake_nix(config: HostConfig, repo_root: Path, force: bool = False) -> None:
"""
Add or update host entry in flake.nix nixosConfigurations.
@@ -122,3 +250,63 @@ def update_terraform_vms(config: HostConfig, repo_root: Path, force: bool = Fals
)
terraform_path.write_text(new_content)
def add_wrapped_token_to_vm(hostname: str, wrapped_token: str, repo_root: Path) -> None:
"""
Add or update the vault_wrapped_token field in an existing VM entry.
Args:
hostname: Hostname of the VM
wrapped_token: The wrapped token to add
repo_root: Path to repository root
"""
terraform_path = repo_root / "terraform" / "vms.tf"
content = terraform_path.read_text()
# Find the VM entry
hostname_pattern = rf'^\s+"{re.escape(hostname)}" = \{{'
match = re.search(hostname_pattern, content, re.MULTILINE)
if not match:
raise ValueError(f"Could not find VM entry for {hostname} in terraform/vms.tf")
# Find the full VM block
block_pattern = rf'(^\s+"{re.escape(hostname)}" = \{{)(.*?)(^\s+\}})'
block_match = re.search(block_pattern, content, re.MULTILINE | re.DOTALL)
if not block_match:
raise ValueError(f"Could not parse VM block for {hostname}")
block_start = block_match.group(1)
block_content = block_match.group(2)
block_end = block_match.group(3)
# Check if vault_wrapped_token already exists
if "vault_wrapped_token" in block_content:
# Update existing token
block_content = re.sub(
r'vault_wrapped_token\s*=\s*"[^"]*"',
f'vault_wrapped_token = "{wrapped_token}"',
block_content
)
else:
# Add new token field (add before closing brace)
# Find the last field and add after it
block_content = block_content.rstrip()
if block_content and not block_content.endswith("\n"):
block_content += "\n"
block_content += f' vault_wrapped_token = "{wrapped_token}"\n'
# Reconstruct the block
new_block = block_start + block_content + block_end
# Replace in content
new_content = re.sub(
rf'^\s+"{re.escape(hostname)}" = \{{.*?^\s+\}}',
new_block,
content,
flags=re.MULTILINE | re.DOTALL
)
terraform_path.write_text(new_content)

View File

@@ -14,6 +14,7 @@ setup(
"validators",
"generators",
"manipulators",
"vault_helper",
],
include_package_data=True,
data_files=[
@@ -23,6 +24,7 @@ setup(
"typer",
"jinja2",
"rich",
"hvac",
],
entry_points={
"console_scripts": [

View File

@@ -0,0 +1,178 @@
"""Helper functions for Vault/OpenBao API interactions."""
import os
import subprocess
from pathlib import Path
from typing import Optional
import hvac
import typer
def get_vault_client(vault_addr: Optional[str] = None, vault_token: Optional[str] = None) -> hvac.Client:
"""
Get a Vault client instance.
Args:
vault_addr: Vault server address (defaults to BAO_ADDR env var or hardcoded default)
vault_token: Vault token (defaults to BAO_TOKEN env var or prompts user)
Returns:
Configured hvac.Client instance
Raises:
typer.Exit: If unable to create client or authenticate
"""
# Get Vault address
if vault_addr is None:
vault_addr = os.getenv("BAO_ADDR", "https://vault01.home.2rjus.net:8200")
# Get Vault token
if vault_token is None:
vault_token = os.getenv("BAO_TOKEN")
if not vault_token:
typer.echo("\n⚠️ Vault token required. Set BAO_TOKEN environment variable or enter it below.")
vault_token = typer.prompt("Vault token (BAO_TOKEN)", hide_input=True)
# Create client
try:
client = hvac.Client(url=vault_addr, token=vault_token, verify=False)
# Verify authentication
if not client.is_authenticated():
typer.echo(f"\n❌ Failed to authenticate to Vault at {vault_addr}", err=True)
typer.echo("Check your BAO_TOKEN and ensure Vault is accessible", err=True)
raise typer.Exit(code=1)
return client
except Exception as e:
typer.echo(f"\n❌ Error connecting to Vault: {e}", err=True)
raise typer.Exit(code=1)
def generate_wrapped_token(hostname: str, repo_root: Path) -> str:
"""
Generate a wrapped token containing AppRole credentials for a host.
This function:
1. Applies Terraform to ensure the AppRole exists
2. Reads the role_id for the host
3. Generates a secret_id
4. Wraps both credentials in a cubbyhole token (24h TTL, single-use)
Args:
hostname: The host to generate credentials for
repo_root: Path to repository root (for running terraform)
Returns:
Wrapped token string (hvs.CAES...)
Raises:
typer.Exit: If Terraform fails or Vault operations fail
"""
from rich.console import Console
console = Console()
# Get Vault client
client = get_vault_client()
# First, apply Terraform to ensure AppRole exists
console.print(f"\n[bold blue]Applying Vault Terraform configuration...[/bold blue]")
terraform_dir = repo_root / "terraform" / "vault"
try:
result = subprocess.run(
["tofu", "apply", "-auto-approve"],
cwd=terraform_dir,
capture_output=True,
text=True,
check=False,
)
if result.returncode != 0:
console.print(f"[red]❌ Terraform apply failed:[/red]")
console.print(result.stderr)
raise typer.Exit(code=1)
console.print("[green]✓[/green] Terraform applied successfully")
except FileNotFoundError:
console.print(f"[red]❌ Error: 'tofu' command not found[/red]")
console.print("Ensure OpenTofu is installed and in PATH")
raise typer.Exit(code=1)
# Read role_id
try:
console.print(f"[bold blue]Reading AppRole credentials for {hostname}...[/bold blue]")
role_id_response = client.read(f"auth/approle/role/{hostname}/role-id")
role_id = role_id_response["data"]["role_id"]
console.print(f"[green]✓[/green] Retrieved role_id")
except Exception as e:
console.print(f"[red]❌ Failed to read role_id for {hostname}:[/red] {e}")
console.print(f"\nEnsure the AppRole '{hostname}' exists in Vault")
raise typer.Exit(code=1)
# Generate secret_id
try:
secret_id_response = client.write(f"auth/approle/role/{hostname}/secret-id")
secret_id = secret_id_response["data"]["secret_id"]
console.print(f"[green]✓[/green] Generated secret_id")
except Exception as e:
console.print(f"[red]❌ Failed to generate secret_id:[/red] {e}")
raise typer.Exit(code=1)
# Wrap the credentials in a cubbyhole token
try:
console.print(f"[bold blue]Creating wrapped token (24h TTL, single-use)...[/bold blue]")
# Use the response wrapping feature to wrap our credentials
# This creates a temporary token that can only be used once to retrieve the actual credentials
wrap_response = client.write(
"sys/wrapping/wrap",
wrap_ttl="24h",
# The data we're wrapping
role_id=role_id,
secret_id=secret_id,
)
wrapped_token = wrap_response["wrap_info"]["token"]
console.print(f"[green]✓[/green] Created wrapped token: {wrapped_token[:20]}...")
console.print(f"[yellow]⚠️[/yellow] Token expires in 24 hours")
console.print(f"[yellow]⚠️[/yellow] Token can only be used once")
return wrapped_token
except Exception as e:
console.print(f"[red]❌ Failed to create wrapped token:[/red] {e}")
raise typer.Exit(code=1)
def verify_vault_setup(hostname: str) -> bool:
"""
Verify that Vault is properly configured for a host.
Checks:
- Vault is accessible
- AppRole exists for the hostname
- Can read role_id
Args:
hostname: The host to verify
Returns:
True if everything is configured correctly, False otherwise
"""
try:
client = get_vault_client()
# Try to read the role_id
client.read(f"auth/approle/role/{hostname}/role-id")
return True
except Exception:
return False

View File

@@ -0,0 +1,78 @@
# vault-fetch
A helper script for fetching secrets from OpenBao/Vault and writing them to the filesystem.
## Features
- **AppRole Authentication**: Uses role_id and secret_id from `/var/lib/vault/approle/`
- **Individual Secret Files**: Writes each secret key as a separate file for easy consumption
- **Caching**: Maintains a cache of secrets for fallback when Vault is unreachable
- **Graceful Degradation**: Falls back to cached secrets if Vault authentication fails
- **Secure Permissions**: Sets 600 permissions on all secret files
## Usage
```bash
vault-fetch <secret-path> <output-directory> [cache-directory]
```
### Examples
```bash
# Fetch Grafana admin secrets
vault-fetch hosts/monitoring01/grafana-admin /run/secrets/grafana /var/lib/vault/cache/grafana
# Use default cache location
vault-fetch hosts/monitoring01/grafana-admin /run/secrets/grafana
```
## How It Works
1. **Read Credentials**: Loads `role_id` and `secret_id` from `/var/lib/vault/approle/`
2. **Authenticate**: Calls `POST /v1/auth/approle/login` to get a Vault token
3. **Fetch Secret**: Retrieves secret from `GET /v1/secret/data/{path}`
4. **Extract Keys**: Parses JSON response and extracts individual secret keys
5. **Write Files**: Creates one file per secret key in output directory
6. **Update Cache**: Copies secrets to cache directory for fallback
7. **Set Permissions**: Ensures all files have 600 permissions (owner read/write only)
## Error Handling
If Vault is unreachable or authentication fails:
- Script logs a warning to stderr
- Falls back to cached secrets from previous successful fetch
- Exits with error code 1 if no cache is available
## Environment Variables
- `VAULT_ADDR`: Vault server address (default: `https://vault01.home.2rjus.net:8200`)
- `VAULT_SKIP_VERIFY`: Skip TLS verification (default: `1`)
## Integration with NixOS
This tool is designed to be called from systemd service `ExecStartPre` hooks via the `vault.secrets` NixOS module:
```nix
vault.secrets.grafana-admin = {
secretPath = "hosts/monitoring01/grafana-admin";
};
# Service automatically gets secrets fetched before start
systemd.services.grafana.serviceConfig = {
EnvironmentFile = "/run/secrets/grafana-admin/password";
};
```
## Requirements
- `curl`: For Vault API calls
- `jq`: For JSON parsing
- `coreutils`: For file operations
## Security Considerations
- AppRole credentials stored at `/var/lib/vault/approle/` should be root-owned with 600 permissions
- Tokens are ephemeral and not stored - fresh authentication on each fetch
- Secrets written to tmpfs (`/run/secrets/`) are lost on reboot
- Cache directory persists across reboots for service availability
- All secret files have restrictive permissions (600)

View File

@@ -0,0 +1,18 @@
{ pkgs, lib, ... }:
pkgs.writeShellApplication {
name = "vault-fetch";
runtimeInputs = with pkgs; [
curl # Vault API calls
jq # JSON parsing
coreutils # File operations
];
text = builtins.readFile ./vault-fetch.sh;
meta = with lib; {
description = "Fetch secrets from OpenBao/Vault and write to filesystem";
license = licenses.mit;
};
}

View File

@@ -0,0 +1,152 @@
#!/usr/bin/env bash
set -euo pipefail
# vault-fetch: Fetch secrets from OpenBao/Vault and write to filesystem
#
# Usage: vault-fetch <secret-path> <output-directory> [cache-directory]
#
# Example: vault-fetch hosts/monitoring01/grafana-admin /run/secrets/grafana /var/lib/vault/cache/grafana
#
# This script:
# 1. Authenticates to Vault using AppRole credentials from /var/lib/vault/approle/
# 2. Fetches secrets from the specified path
# 3. Writes each secret key as an individual file in the output directory
# 4. Updates cache for fallback when Vault is unreachable
# 5. Falls back to cache if Vault authentication fails or is unreachable
# Parse arguments
if [ $# -lt 2 ]; then
echo "Usage: vault-fetch <secret-path> <output-directory> [cache-directory]" >&2
echo "Example: vault-fetch hosts/monitoring01/grafana /run/secrets/grafana /var/lib/vault/cache/grafana" >&2
exit 1
fi
SECRET_PATH="$1"
OUTPUT_DIR="$2"
CACHE_DIR="${3:-/var/lib/vault/cache/$(basename "$OUTPUT_DIR")}"
# Vault configuration
VAULT_ADDR="${VAULT_ADDR:-https://vault01.home.2rjus.net:8200}"
VAULT_SKIP_VERIFY="${VAULT_SKIP_VERIFY:-1}"
APPROLE_DIR="/var/lib/vault/approle"
# TLS verification flag for curl
if [ "$VAULT_SKIP_VERIFY" = "1" ]; then
CURL_TLS_FLAG="-k"
else
CURL_TLS_FLAG=""
fi
# Logging helper
log() {
echo "[vault-fetch] $*" >&2
}
# Error handler
error() {
log "ERROR: $*"
exit 1
}
# Check if cache is available
has_cache() {
[ -d "$CACHE_DIR" ] && [ -n "$(ls -A "$CACHE_DIR" 2>/dev/null)" ]
}
# Use cached secrets
use_cache() {
if ! has_cache; then
error "No cache available and Vault is unreachable"
fi
log "WARNING: Using cached secrets from $CACHE_DIR"
mkdir -p "$OUTPUT_DIR"
cp -r "$CACHE_DIR"/* "$OUTPUT_DIR/"
chmod -R u=rw,go= "$OUTPUT_DIR"/*
}
# Fetch secrets from Vault
fetch_from_vault() {
# Read AppRole credentials
if [ ! -f "$APPROLE_DIR/role-id" ] || [ ! -f "$APPROLE_DIR/secret-id" ]; then
log "WARNING: AppRole credentials not found at $APPROLE_DIR"
use_cache
return
fi
ROLE_ID=$(cat "$APPROLE_DIR/role-id")
SECRET_ID=$(cat "$APPROLE_DIR/secret-id")
# Authenticate to Vault
log "Authenticating to Vault at $VAULT_ADDR"
AUTH_RESPONSE=$(curl -s $CURL_TLS_FLAG -X POST \
-d "{\"role_id\":\"$ROLE_ID\",\"secret_id\":\"$SECRET_ID\"}" \
"$VAULT_ADDR/v1/auth/approle/login" 2>&1) || {
log "WARNING: Failed to connect to Vault"
use_cache
return
}
# Check for errors in response
if echo "$AUTH_RESPONSE" | jq -e '.errors' >/dev/null 2>&1; then
ERRORS=$(echo "$AUTH_RESPONSE" | jq -r '.errors[]' 2>/dev/null || echo "Unknown error")
log "WARNING: Vault authentication failed: $ERRORS"
use_cache
return
fi
# Extract token
VAULT_TOKEN=$(echo "$AUTH_RESPONSE" | jq -r '.auth.client_token' 2>/dev/null)
if [ -z "$VAULT_TOKEN" ] || [ "$VAULT_TOKEN" = "null" ]; then
log "WARNING: Failed to extract Vault token from response"
use_cache
return
fi
log "Successfully authenticated to Vault"
# Fetch secret
log "Fetching secret from path: $SECRET_PATH"
SECRET_RESPONSE=$(curl -s $CURL_TLS_FLAG \
-H "X-Vault-Token: $VAULT_TOKEN" \
"$VAULT_ADDR/v1/secret/data/$SECRET_PATH" 2>&1) || {
log "WARNING: Failed to fetch secret from Vault"
use_cache
return
}
# Check for errors
if echo "$SECRET_RESPONSE" | jq -e '.errors' >/dev/null 2>&1; then
ERRORS=$(echo "$SECRET_RESPONSE" | jq -r '.errors[]' 2>/dev/null || echo "Unknown error")
log "WARNING: Failed to fetch secret: $ERRORS"
use_cache
return
fi
# Extract secret data
SECRET_DATA=$(echo "$SECRET_RESPONSE" | jq -r '.data.data' 2>/dev/null)
if [ -z "$SECRET_DATA" ] || [ "$SECRET_DATA" = "null" ]; then
log "WARNING: No secret data found at path $SECRET_PATH"
use_cache
return
fi
# Create output and cache directories
mkdir -p "$OUTPUT_DIR"
mkdir -p "$CACHE_DIR"
# Write each secret key to a separate file
log "Writing secrets to $OUTPUT_DIR"
echo "$SECRET_DATA" | jq -r 'to_entries[] | "\(.key)\n\(.value)"' | while read -r key; read -r value; do
echo -n "$value" > "$OUTPUT_DIR/$key"
echo -n "$value" > "$CACHE_DIR/$key"
chmod 600 "$OUTPUT_DIR/$key"
chmod 600 "$CACHE_DIR/$key"
log " - Wrote secret key: $key"
done
log "Successfully fetched and cached secrets"
}
# Main execution
fetch_from_vault

View File

@@ -1,5 +1,9 @@
{ pkgs, unstable, ... }:
{
homelab.monitoring.scrapeTargets = [{
job_name = "step-ca";
port = 9000;
}];
sops.secrets."ca_root_pw" = {
sopsFile = ../../secrets/ca/secrets.yaml;
owner = "step-ca";

View File

@@ -1,5 +1,11 @@
{ pkgs, config, ... }:
{
homelab.monitoring.scrapeTargets = [{
job_name = "home-assistant";
port = 8123;
metrics_path = "/api/prometheus";
scrape_interval = "60s";
}];
# Enable the Home Assistant service
services.home-assistant = {
enable = true;

View File

@@ -3,4 +3,9 @@
imports = [
./proxy.nix
];
homelab.monitoring.scrapeTargets = [{
job_name = "caddy";
port = 80;
}];
}

View File

@@ -1,5 +1,9 @@
{ pkgs, ... }:
{
homelab.monitoring.scrapeTargets = [{
job_name = "jellyfin";
port = 8096;
}];
services.jellyfin = {
enable = true;
};

View File

@@ -0,0 +1,12 @@
# Monitoring targets for hosts not managed by this flake
# These are manually maintained and combined with auto-generated targets
{
nodeExporter = [
"gunter.home.2rjus.net:9100"
];
scrapeConfigs = [
{ job_name = "smartctl"; targets = [ "gunter.home.2rjus.net:9633" ]; }
{ job_name = "ghettoptt"; targets = [ "gunter.home.2rjus.net:8989" ]; }
{ job_name = "restic_rest"; targets = [ "10.69.12.52:8000" ]; }
];
}

View File

@@ -1,4 +1,11 @@
{ ... }:
{ self, lib, ... }:
let
monLib = import ../../lib/monitoring.nix { inherit lib; };
externalTargets = import ./external-targets.nix;
nodeExporterTargets = monLib.generateNodeExporterTargets self externalTargets;
autoScrapeConfigs = monLib.generateScrapeConfigs self externalTargets;
in
{
services.prometheus = {
enable = true;
@@ -45,26 +52,16 @@
];
scrapeConfigs = [
# Auto-generated node-exporter targets from flake hosts + external
{
job_name = "node-exporter";
static_configs = [
{
targets = [
"ca.home.2rjus.net:9100"
"gunter.home.2rjus.net:9100"
"ha1.home.2rjus.net:9100"
"http-proxy.home.2rjus.net:9100"
"jelly01.home.2rjus.net:9100"
"monitoring01.home.2rjus.net:9100"
"nix-cache01.home.2rjus.net:9100"
"ns1.home.2rjus.net:9100"
"ns2.home.2rjus.net:9100"
"pgdb1.home.2rjus.net:9100"
"nats1.home.2rjus.net:9100"
];
targets = nodeExporterTargets;
}
];
}
# Local monitoring services (not auto-generated)
{
job_name = "prometheus";
static_configs = [
@@ -85,7 +82,7 @@
job_name = "grafana";
static_configs = [
{
targets = [ "localhost:3100" ];
targets = [ "localhost:3000" ];
}
];
}
@@ -98,13 +95,23 @@
];
}
{
job_name = "restic_rest";
job_name = "pushgateway";
honor_labels = true;
static_configs = [
{
targets = [ "10.69.12.52:8000" ];
targets = [ "localhost:9091" ];
}
];
}
{
job_name = "labmon";
static_configs = [
{
targets = [ "monitoring01.home.2rjus.net:9969" ];
}
];
}
# pve-exporter with complex relabel config
{
job_name = "pve-exporter";
static_configs = [
@@ -133,91 +140,8 @@
}
];
}
{
job_name = "caddy";
static_configs = [
{
targets = [ "http-proxy.home.2rjus.net" ];
}
];
}
{
job_name = "jellyfin";
static_configs = [
{
targets = [ "jelly01.home.2rjus.net:8096" ];
}
];
}
{
job_name = "smartctl";
static_configs = [
{
targets = [ "gunter.home.2rjus.net:9633" ];
}
];
}
{
job_name = "wireguard";
static_configs = [
{
targets = [ "http-proxy.home.2rjus.net:9586" ];
}
];
}
{
job_name = "home-assistant";
scrape_interval = "60s";
metrics_path = "/api/prometheus";
static_configs = [
{
targets = [ "ha1.home.2rjus.net:8123" ];
}
];
}
{
job_name = "ghettoptt";
static_configs = [
{
targets = [ "gunter.home.2rjus.net:8989" ];
}
];
}
{
job_name = "step-ca";
static_configs = [
{
targets = [ "ca.home.2rjus.net:9000" ];
}
];
}
{
job_name = "labmon";
static_configs = [
{
targets = [ "monitoring01.home.2rjus.net:9969" ];
}
];
}
{
job_name = "pushgateway";
honor_labels = true;
static_configs = [
{
targets = [ "localhost:9091" ];
}
];
}
{
job_name = "nix-cache_caddy";
scheme = "https";
static_configs = [
{
targets = [ "nix-cache.home.2rjus.net" ];
}
];
}
];
] ++ autoScrapeConfigs;
pushgateway = {
enable = true;
web = {

View File

@@ -57,6 +57,38 @@ groups:
annotations:
summary: "Promtail service not running on {{ $labels.instance }}"
description: "The promtail service has not been active on {{ $labels.instance }} for 5 minutes."
- alert: filesystem_filling_up
expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[6h], 24*3600) < 0
for: 1h
labels:
severity: warning
annotations:
summary: "Filesystem predicted to fill within 24h on {{ $labels.instance }}"
description: "Based on the last 6h trend, the root filesystem on {{ $labels.instance }} is predicted to run out of space within 24 hours."
- alert: systemd_not_running
expr: node_systemd_system_running == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Systemd not in running state on {{ $labels.instance }}"
description: "Systemd is not in running state on {{ $labels.instance }}. The system may be in a degraded state."
- alert: high_file_descriptors
expr: node_filefd_allocated / node_filefd_maximum > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High file descriptor usage on {{ $labels.instance }}"
description: "More than 80% of file descriptors are in use on {{ $labels.instance }}."
- alert: host_reboot
expr: changes(node_boot_time_seconds[10m]) > 0
for: 0m
labels:
severity: info
annotations:
summary: "Host {{ $labels.instance }} has rebooted"
description: "Host {{ $labels.instance }} has rebooted."
- name: nameserver_rules
rules:
- alert: unbound_down
@@ -75,7 +107,7 @@ groups:
annotations:
summary: "NSD not running on {{ $labels.instance }}"
description: "NSD has been down on {{ $labels.instance }} more than 5 minutes."
- name: http-proxy_rules
- name: http_proxy_rules
rules:
- alert: caddy_down
expr: node_systemd_unit_state {instance="http-proxy.home.2rjus.net:9100", name = "caddy.service", state = "active"} == 0
@@ -85,6 +117,22 @@ groups:
annotations:
summary: "Caddy not running on {{ $labels.instance }}"
description: "Caddy has been down on {{ $labels.instance }} more than 5 minutes."
- alert: caddy_upstream_unhealthy
expr: caddy_reverse_proxy_upstreams_healthy == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Caddy upstream unhealthy for {{ $labels.upstream }}"
description: "Caddy reverse proxy upstream {{ $labels.upstream }} is unhealthy on {{ $labels.instance }}."
- alert: caddy_high_error_rate
expr: rate(caddy_http_request_errors_total[5m]) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High HTTP error rate on {{ $labels.instance }}"
description: "Caddy is experiencing a high rate of HTTP errors on {{ $labels.instance }}."
- name: nats_rules
rules:
- alert: nats_down
@@ -97,7 +145,7 @@ groups:
description: "NATS has been down on {{ $labels.instance }} more than 5 minutes."
- name: nix_cache_rules
rules:
- alert: build-flakes_service_not_active_recently
- alert: build_flakes_service_not_active_recently
expr: count_over_time(node_systemd_unit_state{instance="nix-cache01.home.2rjus.net:9100", name="build-flakes.service", state="active"}[1h]) < 1
for: 0m
labels:
@@ -138,7 +186,7 @@ groups:
annotations:
summary: "Home assistant not running on {{ $labels.instance }}"
description: "Home assistant has been down on {{ $labels.instance }} more than 5 minutes."
- alert: zigbee2qmtt_down
- alert: zigbee2mqtt_down
expr: node_systemd_unit_state {instance = "ha1.home.2rjus.net:9100", name = "zigbee2mqtt.service", state = "active"} == 0
for: 5m
labels:
@@ -156,7 +204,7 @@ groups:
description: "Mosquitto has been down on {{ $labels.instance }} more than 5 minutes."
- name: smartctl_rules
rules:
- alert: SmartCriticalWarning
- alert: smart_critical_warning
expr: smartctl_device_critical_warning > 0
for: 0m
labels:
@@ -164,7 +212,7 @@ groups:
annotations:
summary: SMART critical warning (instance {{ $labels.instance }})
description: "Disk controller has critical warning on {{ $labels.instance }} drive {{ $labels.device }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: SmartMediaErrors
- alert: smart_media_errors
expr: smartctl_device_media_errors > 0
for: 0m
labels:
@@ -172,7 +220,7 @@ groups:
annotations:
summary: SMART media errors (instance {{ $labels.instance }})
description: "Disk controller detected media errors on {{ $labels.instance }} drive {{ $labels.device }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: SmartWearoutIndicator
- alert: smart_wearout_indicator
expr: smartctl_device_available_spare < smartctl_device_available_spare_threshold
for: 0m
labels:
@@ -180,20 +228,29 @@ groups:
annotations:
summary: SMART Wearout Indicator (instance {{ $labels.instance }})
description: "Device is wearing out on {{ $labels.instance }} drive {{ $labels.device }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: smartctl_high_temperature
expr: smartctl_device_temperature > 60
for: 5m
labels:
severity: warning
annotations:
summary: "Disk temperature above 60C on {{ $labels.instance }}"
description: "Disk {{ $labels.device }} on {{ $labels.instance }} has temperature {{ $value }}C."
- name: wireguard_rules
rules:
- alert: WireguardHandshake
expr: (time() - wireguard_latest_handshake_seconds{instance="http-proxy.home.2rjus.net:9586",interface="wg0",public_key="32Rb13wExcy8uI92JTnFdiOfkv0mlQ6f181WA741DHs="}) > 300
- alert: wireguard_handshake_timeout
expr: (time() - wireguard_latest_handshake_seconds{interface="wg0"}) > 300
for: 1m
labels:
severity: warning
annotations:
summary: "Wireguard handshake timeout on {{ $labels.instance }}"
description: "Wireguard handshake timeout on {{ $labels.instance }} for more than 1 minutes."
description: "Wireguard handshake timeout on {{ $labels.instance }} for peer {{ $labels.public_key }}."
- name: monitoring_rules
rules:
- alert: prometheus_not_running
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="prometheus.service", state="active"} == 0
for: 5m
labels:
severity: critical
annotations:
@@ -201,6 +258,7 @@ groups:
description: "Prometheus service not running on {{ $labels.instance }}"
- alert: alertmanager_not_running
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="alertmanager.service", state="active"} == 0
for: 5m
labels:
severity: critical
annotations:
@@ -208,13 +266,7 @@ groups:
description: "Alertmanager service not running on {{ $labels.instance }}"
- alert: pushgateway_not_running
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="pushgateway.service", state="active"} == 0
labels:
severity: critical
annotations:
summary: "Pushgateway service not running on {{ $labels.instance }}"
description: "Pushgateway service not running on {{ $labels.instance }}"
- alert: pushgateway_not_running
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="pushgateway.service", state="active"} == 0
for: 5m
labels:
severity: critical
annotations:
@@ -222,6 +274,7 @@ groups:
description: "Pushgateway service not running on {{ $labels.instance }}"
- alert: loki_not_running
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="loki.service", state="active"} == 0
for: 5m
labels:
severity: critical
annotations:
@@ -229,6 +282,7 @@ groups:
description: "Loki service not running on {{ $labels.instance }}"
- alert: grafana_not_running
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="grafana.service", state="active"} == 0
for: 5m
labels:
severity: warning
annotations:
@@ -236,6 +290,7 @@ groups:
description: "Grafana service not running on {{ $labels.instance }}"
- alert: tempo_not_running
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="tempo.service", state="active"} == 0
for: 5m
labels:
severity: warning
annotations:
@@ -243,8 +298,53 @@ groups:
description: "Tempo service not running on {{ $labels.instance }}"
- alert: pyroscope_not_running
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="podman-pyroscope.service", state="active"} == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pyroscope service not running on {{ $labels.instance }}"
description: "Pyroscope service not running on {{ $labels.instance }}"
- name: certificate_rules
rules:
- alert: certificate_expiring_soon
expr: labmon_tlsconmon_certificate_seconds_left < 86400
for: 5m
labels:
severity: warning
annotations:
summary: "TLS certificate expiring soon for {{ $labels.instance }}"
description: "TLS certificate for {{ $labels.address }} is expiring within 24 hours."
- alert: certificate_check_error
expr: labmon_tlsconmon_certificate_check_error == 1
for: 5m
labels:
severity: warning
annotations:
summary: "Error checking certificate for {{ $labels.address }}"
description: "Certificate check is failing for {{ $labels.address }} on {{ $labels.instance }}."
- alert: step_ca_certificate_expiring
expr: labmon_stepmon_certificate_seconds_left < 3600
for: 5m
labels:
severity: critical
annotations:
summary: "Step-CA certificate expiring for {{ $labels.instance }}"
description: "Step-CA certificate is expiring within 1 hour on {{ $labels.instance }}."
- name: proxmox_rules
rules:
- alert: pve_node_down
expr: pve_up{id=~"node/.*"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Proxmox node {{ $labels.id }} is down"
description: "Proxmox node {{ $labels.id }} has been down for more than 5 minutes."
- alert: pve_guest_stopped
expr: pve_up{id=~"qemu/.*"} == 0 and pve_onboot_status == 1
for: 5m
labels:
severity: warning
annotations:
summary: "Proxmox VM {{ $labels.id }} is stopped"
description: "Proxmox VM {{ $labels.id }} ({{ $labels.name }}) has onboot=1 but is stopped."

View File

@@ -6,4 +6,10 @@
./proxy.nix
./nix.nix
];
homelab.monitoring.scrapeTargets = [{
job_name = "nix-cache_caddy";
port = 443;
scheme = "https";
}];
}

View File

@@ -0,0 +1,33 @@
# DNS records for hosts not managed by this flake
# These are manually maintained and combined with auto-generated records
{
aRecords = {
# 10
"gw" = "10.69.10.1";
# 12_CORE
"nas" = "10.69.12.50";
"nzbget-jail" = "10.69.12.51";
"restic" = "10.69.12.52";
"radarr-jail" = "10.69.12.53";
"sonarr-jail" = "10.69.12.54";
"bazarr" = "10.69.12.55";
"pve1" = "10.69.12.75";
"inc1" = "10.69.12.80";
# 22_WLAN
"unifi-ctrl" = "10.69.22.5";
# 30
"gunter" = "10.69.30.105";
# 31
"media" = "10.69.31.50";
# 99_MGMT
"sw1" = "10.69.99.2";
};
cnames = {
};
}

View File

@@ -1,4 +1,16 @@
{ ... }:
{ self, lib, ... }:
let
dnsLib = import ../../lib/dns-zone.nix { inherit lib; };
externalHosts = import ./external-hosts.nix;
# Generate zone from flake hosts + external hosts
# Use lastModified from git commit as serial number
zoneData = dnsLib.generateZone {
inherit self externalHosts;
serial = self.sourceInfo.lastModified;
domain = "home.2rjus.net";
};
in
{
sops.secrets.ns_xfer_key = {
path = "/etc/nsd/xfer.key";
@@ -26,7 +38,7 @@
"home.2rjus.net" = {
provideXFR = [ "10.69.13.6 xferkey" ];
notify = [ "10.69.13.6@8053 xferkey" ];
data = builtins.readFile ./zones-home-2rjus-net.conf;
data = zoneData;
};
};
};

View File

@@ -1,4 +1,16 @@
{ ... }:
{ self, lib, ... }:
let
dnsLib = import ../../lib/dns-zone.nix { inherit lib; };
externalHosts = import ./external-hosts.nix;
# Generate zone from flake hosts + external hosts
# Used as initial zone data before first AXFR completes
zoneData = dnsLib.generateZone {
inherit self externalHosts;
serial = self.sourceInfo.lastModified;
domain = "home.2rjus.net";
};
in
{
sops.secrets.ns_xfer_key = {
path = "/etc/nsd/xfer.key";
@@ -24,7 +36,7 @@
"home.2rjus.net" = {
allowNotify = [ "10.69.13.5 xferkey" ];
requestXFR = [ "AXFR 10.69.13.5@8053 xferkey" ];
data = builtins.readFile ./zones-home-2rjus-net.conf;
data = zoneData;
};
};
};

View File

@@ -1,96 +0,0 @@
$ORIGIN home.2rjus.net.
$TTL 1800
@ IN SOA ns1.home.2rjus.net. admin.test.2rjus.net. (
2063 ; serial number
3600 ; refresh
900 ; retry
1209600 ; expire
120 ; ttl
)
IN NS ns1.home.2rjus.net.
IN NS ns2.home.2rjus.net.
IN NS ns3.home.2rjus.net.
; 8_k8s
kube-blue1 IN A 10.69.8.150
kube-blue2 IN A 10.69.8.151
kube-blue3 IN A 10.69.8.152
kube-blue4 IN A 10.69.8.153
rook IN CNAME kube-blue4
kube-blue5 IN A 10.69.8.154
git IN CNAME kube-blue5
kube-blue6 IN A 10.69.8.155
kube-blue7 IN A 10.69.8.156
kube-blue8 IN A 10.69.8.157
kube-blue9 IN A 10.69.8.158
kube-blue10 IN A 10.69.8.159
; 10
gw IN A 10.69.10.1
; 12_CORE
virt-mini1 IN A 10.69.12.11
nas IN A 10.69.12.50
nzbget-jail IN A 10.69.12.51
restic IN A 10.69.12.52
radarr-jail IN A 10.69.12.53
sonarr-jail IN A 10.69.12.54
bazarr IN A 10.69.12.55
mpnzb IN A 10.69.12.57
pve1 IN A 10.69.12.75
inc1 IN A 10.69.12.80
inc2 IN A 10.69.12.81
media1 IN A 10.69.12.82
; 13_SVC
ns1 IN A 10.69.13.5
ns2 IN A 10.69.13.6
ns3 IN A 10.69.13.7
ns4 IN A 10.69.13.8
ha1 IN A 10.69.13.9
nixos-test1 IN A 10.69.13.10
http-proxy IN A 10.69.13.11
ca IN A 10.69.13.12
monitoring01 IN A 10.69.13.13
jelly01 IN A 10.69.13.14
nix-cache01 IN A 10.69.13.15
nix-cache IN CNAME nix-cache01
actions1 IN CNAME nix-cache01
pgdb1 IN A 10.69.13.16
nats1 IN A 10.69.13.17
auth01 IN A 10.69.13.18
; http-proxy cnames
nzbget IN CNAME http-proxy
radarr IN CNAME http-proxy
sonarr IN CNAME http-proxy
ha IN CNAME http-proxy
z2m IN CNAME http-proxy
grafana IN CNAME http-proxy
prometheus IN CNAME http-proxy
alertmanager IN CNAME http-proxy
jelly IN CNAME http-proxy
auth IN CNAME http-proxy
lldap IN CNAME http-proxy
pyroscope IN CNAME http-proxy
pushgw IN CNAME http-proxy
ldap IN CNAME auth01
; 22_WLAN
unifi-ctrl IN A 10.69.22.5
; 30
gunter IN A 10.69.30.105
; 31
media IN A 10.69.31.50
; 99_MGMT
sw1 IN A 10.69.99.2
testing IN A 10.69.33.33

38
services/vault/README.md Normal file
View File

@@ -0,0 +1,38 @@
# OpenBao Service Module
NixOS service module for OpenBao (open-source Vault fork) with TPM2-based auto-unsealing.
## Features
- TLS-enabled TCP listener on `0.0.0.0:8200`
- Unix socket listener at `/run/openbao/openbao.sock`
- File-based storage at `/var/lib/openbao`
- TPM2 auto-unseal on service start
## Configuration
The module expects:
- TLS certificate: `/var/lib/openbao/cert.pem`
- TLS private key: `/var/lib/openbao/key.pem`
- TPM2-encrypted unseal key: `/var/lib/openbao/unseal-key.cred`
Certificates are loaded via systemd `LoadCredential`, and the unseal key via `LoadCredentialEncrypted`.
## Setup
For initial setup and configuration instructions, see:
- **Auto-unseal setup**: `/docs/vault/auto-unseal.md`
- **Terraform configuration**: `/terraform/vault/README.md`
## Usage
```bash
# Check seal status
bao status
# Manually seal (for maintenance)
bao operator seal
# Service will auto-unseal on restart
systemctl restart openbao
```

View File

@@ -1,8 +1,206 @@
{ ... }:
{ pkgs, ... }:
let
unsealScript = pkgs.writeShellApplication {
name = "openbao-unseal";
runtimeInputs = with pkgs; [
openbao
coreutils
gnugrep
getent
];
text = ''
# Set environment to use Unix socket
export BAO_ADDR='unix:///run/openbao/openbao.sock'
SOCKET_PATH="/run/openbao/openbao.sock"
CREDS_DIR="''${CREDENTIALS_DIRECTORY:-}"
# Wait for socket to exist
echo "Waiting for OpenBao socket..."
for _ in {1..30}; do
if [ -S "$SOCKET_PATH" ]; then
echo "Socket exists"
break
fi
sleep 1
done
# Wait for OpenBao to accept connections
echo "Waiting for OpenBao to be ready..."
for _ in {1..30}; do
output=$(timeout 2 bao status 2>&1 || true)
if echo "$output" | grep -q "Sealed.*false"; then
# Already unsealed
echo "OpenBao is already unsealed"
exit 0
elif echo "$output" | grep -qE "(Sealed|Initialized)"; then
# Got a valid response, OpenBao is ready (sealed)
echo "OpenBao is ready"
break
fi
sleep 1
done
# Check if already unsealed
if output=$(timeout 2 bao status 2>&1 || true); then
if echo "$output" | grep -q "Sealed.*false"; then
echo "OpenBao is already unsealed"
exit 0
fi
fi
# Unseal using the TPM-decrypted keys (one per line)
if [ -n "$CREDS_DIR" ] && [ -f "$CREDS_DIR/unseal-key" ]; then
echo "Unsealing OpenBao..."
while IFS= read -r key; do
# Skip empty lines
[ -z "$key" ] && continue
echo "Applying unseal key..."
bao operator unseal "$key"
# Check if unsealed after each key
if output=$(timeout 2 bao status 2>&1 || true); then
if echo "$output" | grep -q "Sealed.*false"; then
echo "OpenBao unsealed successfully"
exit 0
fi
fi
done < "$CREDS_DIR/unseal-key"
echo "WARNING: Applied all keys but OpenBao is still sealed"
exit 0
else
echo "WARNING: Unseal key credential not found, OpenBao remains sealed"
exit 0
fi
'';
};
bootstrapCertScript = pkgs.writeShellApplication {
name = "bootstrap-vault-cert";
runtimeInputs = with pkgs; [
openbao
jq
openssl
coreutils
];
text = ''
# Bootstrap vault01 with a proper certificate from its own PKI
# This solves the chicken-and-egg problem where ACME clients can't trust
# vault01's self-signed certificate.
echo "=== Bootstrapping vault01 certificate ==="
# Use Unix socket to avoid TLS issues
export BAO_ADDR='unix:///run/openbao/openbao.sock'
# ACME certificate directory
CERT_DIR="/var/lib/acme/vault01.home.2rjus.net"
# Issue certificate for vault01 with vault as SAN
echo "Issuing certificate for vault01.home.2rjus.net (with SAN: vault.home.2rjus.net)..."
OUTPUT=$(bao write -format=json pki_int/issue/homelab \
common_name="vault01.home.2rjus.net" \
alt_names="vault.home.2rjus.net" \
ttl="720h")
# Create ACME directory structure
echo "Creating ACME certificate directory..."
mkdir -p "$CERT_DIR"
# Extract certificate components to temp files
echo "$OUTPUT" | jq -r '.data.certificate' > /tmp/vault01-cert.pem
echo "$OUTPUT" | jq -r '.data.private_key' > /tmp/vault01-key.pem
echo "$OUTPUT" | jq -r '.data.issuing_ca' > /tmp/vault01-ca.pem
# Create fullchain (cert + CA)
cat /tmp/vault01-cert.pem /tmp/vault01-ca.pem > /tmp/vault01-fullchain.pem
# Backup old certificates if they exist
if [ -f "$CERT_DIR/fullchain.pem" ]; then
echo "Backing up old certificate..."
cp "$CERT_DIR/fullchain.pem" "$CERT_DIR/fullchain.pem.backup"
cp "$CERT_DIR/key.pem" "$CERT_DIR/key.pem.backup"
fi
# Install new certificates
echo "Installing new certificate..."
mv /tmp/vault01-fullchain.pem "$CERT_DIR/fullchain.pem"
mv /tmp/vault01-cert.pem "$CERT_DIR/cert.pem"
mv /tmp/vault01-ca.pem "$CERT_DIR/chain.pem"
mv /tmp/vault01-key.pem "$CERT_DIR/key.pem"
# Set proper ownership and permissions (ACME-style)
chown -R acme:acme "$CERT_DIR"
chmod 750 "$CERT_DIR"
chmod 640 "$CERT_DIR"/*.pem
echo "Certificate installed successfully!"
echo ""
echo "Certificate details:"
openssl x509 -in "$CERT_DIR/cert.pem" -noout -subject -issuer -dates
echo ""
echo "Subject Alternative Names:"
openssl x509 -in "$CERT_DIR/cert.pem" -noout -ext subjectAltName
echo ""
echo "Now restart openbao service:"
echo " systemctl restart openbao"
echo ""
echo "After restart, verify ACME endpoint is accessible:"
echo " curl https://vault01.home.2rjus.net:8200/v1/pki_int/acme/directory"
echo ""
echo "Once working, ACME will automatically manage certificate renewals."
'';
};
in
{
services.vault = {
# Make bootstrap script available as a command
environment.systemPackages = [ bootstrapCertScript ];
services.openbao = {
enable = true;
storageBackend = "file";
settings = {
ui = true;
storage.file.path = "/var/lib/openbao";
listener.default = {
type = "tcp";
address = "0.0.0.0:8200";
tls_cert_file = "/run/credentials/openbao.service/cert.pem";
tls_key_file = "/run/credentials/openbao.service/key.pem";
};
listener.socket = {
type = "unix";
address = "/run/openbao/openbao.sock";
};
};
};
systemd.services.openbao.serviceConfig = {
LoadCredential = [
"key.pem:/var/lib/acme/vault01.home.2rjus.net/key.pem"
"cert.pem:/var/lib/acme/vault01.home.2rjus.net/fullchain.pem"
];
# TPM2-encrypted unseal key (created manually, see setup instructions)
LoadCredentialEncrypted = [
"unseal-key:/var/lib/openbao/unseal-key.cred"
];
# Auto-unseal on service start
ExecStartPost = "${unsealScript}/bin/openbao-unseal";
# Add openbao user to acme group to read certificates
SupplementaryGroups = [ "acme" ];
};
# ACME certificate management
# Bootstrapped with bootstrap-vault-cert, now managed by ACME
security.acme.certs."vault01.home.2rjus.net" = {
server = "https://vault01.home.2rjus.net:8200/v1/pki_int/acme/directory";
listenHTTP = ":80";
reloadServices = [ "openbao" ];
extraDomainNames = [ "vault.home.2rjus.net" ];
};
}

View File

@@ -7,8 +7,11 @@
./packages.nix
./nix.nix
./root-user.nix
./root-ca.nix
./pki/root-ca.nix
./sops.nix
./sshd.nix
./vault-secrets.nix
../modules/homelab
];
}

View File

@@ -4,6 +4,7 @@
certificateFiles = [
"${pkgs.cacert}/etc/ssl/certs/ca-bundle.crt"
./root-ca.crt
./vault-root-ca.crt
];
};
}

View File

@@ -0,0 +1,14 @@
-----BEGIN CERTIFICATE-----
MIICIjCCAaigAwIBAgIUQ/Bd/4kNvkPjQjgGLUMynIVzGeAwCgYIKoZIzj0EAwMw
QDELMAkGA1UEBhMCTk8xEDAOBgNVBAoTB0hvbWVsYWIxHzAdBgNVBAMTFmhvbWUu
MnJqdXMubmV0IFJvb3QgQ0EwHhcNMjYwMjAxMjIxODA5WhcNMzYwMTMwMjIxODM5
WjBAMQswCQYDVQQGEwJOTzEQMA4GA1UEChMHSG9tZWxhYjEfMB0GA1UEAxMWaG9t
ZS4ycmp1cy5uZXQgUm9vdCBDQTB2MBAGByqGSM49AgEGBSuBBAAiA2IABH8xhIOl
Nd1Yb1OFhgIJQZM+OkwoFenOQiKfuQ4oPMxaF+fnXdKc77qPDVRjeDy61oGS38X3
CjPOZAzS9kjo7FmVbzdqlYK7ut/OylF+8MJkCT8mFO1xvuzIXhufnyAD4aNjMGEw
DgYDVR0PAQH/BAQDAgEGMA8GA1UdEwEB/wQFMAMBAf8wHQYDVR0OBBYEFEimBeAg
3JVeF4BqdC9hMZ8MYKw2MB8GA1UdIwQYMBaAFEimBeAg3JVeF4BqdC9hMZ8MYKw2
MAoGCCqGSM49BAMDA2gAMGUCMQCvhRElHBra/XyT93SKcG6ZzIG+K+DH3J5jm6Xr
zaGj2VtdhBRVmEKaUcjU7htgSxcCMA9qHKYFcUH72W7By763M6sy8OOiGQNDSERY
VgnNv9rLCvCef1C8G2bYh/sKGZTPGQ==
-----END CERTIFICATE-----

223
system/vault-secrets.nix Normal file
View File

@@ -0,0 +1,223 @@
{ config, lib, pkgs, ... }:
with lib;
let
cfg = config.vault;
# Import vault-fetch package
vault-fetch = pkgs.callPackage ../scripts/vault-fetch { };
# Secret configuration type
secretType = types.submodule ({ name, config, ... }: {
options = {
secretPath = mkOption {
type = types.str;
description = ''
Path to the secret in Vault (without /v1/secret/data/ prefix).
Example: "hosts/monitoring01/grafana-admin"
'';
};
outputDir = mkOption {
type = types.str;
default = "/run/secrets/${name}";
description = ''
Directory where secret files will be written.
Each key in the secret becomes a separate file.
'';
};
cacheDir = mkOption {
type = types.str;
default = "/var/lib/vault/cache/${name}";
description = ''
Directory for caching secrets when Vault is unreachable.
'';
};
owner = mkOption {
type = types.str;
default = "root";
description = "Owner of the secret files";
};
group = mkOption {
type = types.str;
default = "root";
description = "Group of the secret files";
};
mode = mkOption {
type = types.str;
default = "0400";
description = "Permissions mode for secret files";
};
restartTrigger = mkOption {
type = types.bool;
default = false;
description = ''
Whether to create a systemd timer that periodically restarts
services using this secret to rotate credentials.
'';
};
restartInterval = mkOption {
type = types.str;
default = "weekly";
description = ''
How often to restart services for secret rotation.
Uses systemd.time format (e.g., "daily", "weekly", "monthly").
Only applies if restartTrigger is true.
'';
};
services = mkOption {
type = types.listOf types.str;
default = [];
description = ''
List of systemd service names that depend on this secret.
Used for periodic restart if restartTrigger is enabled.
'';
};
};
});
in
{
options.vault = {
enable = mkEnableOption "Vault secrets management" // {
default = false;
};
secrets = mkOption {
type = types.attrsOf secretType;
default = {};
description = ''
Secrets to fetch from Vault.
Each attribute name becomes a secret identifier.
'';
example = literalExpression ''
{
grafana-admin = {
secretPath = "hosts/monitoring01/grafana-admin";
owner = "grafana";
group = "grafana";
restartTrigger = true;
restartInterval = "daily";
services = [ "grafana" ];
};
}
'';
};
criticalServices = mkOption {
type = types.listOf types.str;
default = [ "bind" "openbao" "step-ca" ];
description = ''
Services that should never get auto-restart timers for secret rotation.
These are critical infrastructure services where automatic restarts
could cause cascading failures.
'';
};
vaultAddress = mkOption {
type = types.str;
default = "https://vault01.home.2rjus.net:8200";
description = "Vault server address";
};
skipTlsVerify = mkOption {
type = types.bool;
default = true;
description = "Skip TLS certificate verification (useful for self-signed certs)";
};
};
config = mkIf (cfg.enable && cfg.secrets != {}) {
# Create systemd services for fetching secrets and rotation
systemd.services =
# Fetch services
(mapAttrs' (name: secretCfg: nameValuePair "vault-secret-${name}" {
description = "Fetch Vault secret: ${name}";
before = map (svc: "${svc}.service") secretCfg.services;
wantedBy = [ "multi-user.target" ];
# Ensure vault-fetch is available
path = [ vault-fetch ];
serviceConfig = {
Type = "oneshot";
RemainAfterExit = true;
# Fetch the secret
ExecStart = pkgs.writeShellScript "fetch-${name}" ''
set -euo pipefail
# Set Vault environment variables
export VAULT_ADDR="${cfg.vaultAddress}"
export VAULT_SKIP_VERIFY="${if cfg.skipTlsVerify then "1" else "0"}"
# Fetch secret using vault-fetch
${vault-fetch}/bin/vault-fetch \
"${secretCfg.secretPath}" \
"${secretCfg.outputDir}" \
"${secretCfg.cacheDir}"
# Set ownership and permissions
chown -R ${secretCfg.owner}:${secretCfg.group} "${secretCfg.outputDir}"
chmod ${secretCfg.mode} "${secretCfg.outputDir}"/*
'';
# Logging
StandardOutput = "journal";
StandardError = "journal";
};
}) cfg.secrets)
//
# Rotation services
(mapAttrs' (name: secretCfg: nameValuePair "vault-secret-rotate-${name}"
(mkIf (secretCfg.restartTrigger && secretCfg.services != [] &&
!any (svc: elem svc cfg.criticalServices) secretCfg.services) {
description = "Rotate Vault secret and restart services: ${name}";
serviceConfig = {
Type = "oneshot";
};
script = ''
# Restart the secret fetch service
systemctl restart vault-secret-${name}.service
# Restart all dependent services
${concatMapStringsSep "\n" (svc: "systemctl restart ${svc}.service") secretCfg.services}
'';
})
) cfg.secrets);
# Create systemd timers for periodic secret rotation (if enabled)
systemd.timers = mapAttrs' (name: secretCfg: nameValuePair "vault-secret-rotate-${name}"
(mkIf (secretCfg.restartTrigger && secretCfg.services != [] &&
!any (svc: elem svc cfg.criticalServices) secretCfg.services) {
description = "Rotate Vault secret and restart services: ${name}";
wantedBy = [ "timers.target" ];
timerConfig = {
OnCalendar = secretCfg.restartInterval;
Persistent = true;
RandomizedDelaySec = "1h";
};
})
) cfg.secrets;
# Ensure runtime and cache directories exist
systemd.tmpfiles.rules =
[ "d /run/secrets 0755 root root -" ] ++
[ "d /var/lib/vault/cache 0700 root root -" ] ++
flatten (mapAttrsToList (name: secretCfg: [
"d ${secretCfg.outputDir} 0755 root root -"
"d ${secretCfg.cacheDir} 0700 root root -"
]) cfg.secrets);
};
}

View File

@@ -10,23 +10,30 @@ resource "proxmox_cloud_init_disk" "ci" {
pve_node = each.value.target_node
storage = "local" # Cloud-init disks must be on storage that supports ISO/snippets
# User data includes SSH keys and optionally NIXOS_FLAKE_BRANCH
# User data includes SSH keys and optionally NIXOS_FLAKE_BRANCH and Vault credentials
user_data = <<-EOT
#cloud-config
ssh_authorized_keys:
- ${each.value.ssh_public_key}
${each.value.flake_branch != null ? <<-BRANCH
${each.value.flake_branch != null || each.value.vault_wrapped_token != null ? <<-FILES
write_files:
- path: /etc/environment
- path: /run/cloud-init-env
content: |
%{~if each.value.flake_branch != null~}
NIXOS_FLAKE_BRANCH=${each.value.flake_branch}
append: true
BRANCH
%{~endif~}
%{~if each.value.vault_wrapped_token != null~}
VAULT_ADDR=https://vault01.home.2rjus.net:8200
VAULT_WRAPPED_TOKEN=${each.value.vault_wrapped_token}
VAULT_SKIP_VERIFY=1
%{~endif~}
permissions: '0600'
FILES
: ""}
EOT
# Network configuration - static IP or DHCP
network_config = each.value.ip != null ? yamlencode({
# Network configuration - static IP or DHCP
network_config = each.value.ip != null ? yamlencode({
version = 1
config = [{
type = "physical"
@@ -48,11 +55,11 @@ resource "proxmox_cloud_init_disk" "ci" {
type = "dhcp"
}]
}]
})
})
# Instance metadata
meta_data = yamlencode({
# Instance metadata
meta_data = yamlencode({
instance_id = sha1(each.key)
local-hostname = each.key
})
})
}

View File

@@ -33,7 +33,7 @@ variable "default_target_node" {
variable "default_template_name" {
description = "Default template VM name to clone from"
type = string
default = "nixos-25.11.20260128.fa83fd8"
default = "nixos-25.11.20260131.41e216c"
}
variable "default_ssh_public_key" {

37
terraform/vault/.terraform.lock.hcl generated Normal file
View File

@@ -0,0 +1,37 @@
# This file is maintained automatically by "tofu init".
# Manual edits may be lost in future updates.
provider "registry.opentofu.org/hashicorp/random" {
version = "3.8.1"
constraints = "~> 3.6"
hashes = [
"h1:EHn3jsqOKhWjbg0X+psk0Ww96yz3N7ASqEKKuFvDFwo=",
"zh:25c458c7c676f15705e872202dad7dcd0982e4a48e7ea1800afa5fc64e77f4c8",
"zh:2edeaf6f1b20435b2f81855ad98a2e70956d473be9e52a5fdf57ccd0098ba476",
"zh:44becb9d5f75d55e36dfed0c5beabaf4c92e0a2bc61a3814d698271c646d48e7",
"zh:7699032612c3b16cc69928add8973de47b10ce81b1141f30644a0e8a895b5cd3",
"zh:86d07aa98d17703de9fbf402c89590dc1e01dbe5671dd6bc5e487eb8fe87eee0",
"zh:8c411c77b8390a49a8a1bc9f176529e6b32369dd33a723606c8533e5ca4d68c1",
"zh:a5ecc8255a612652a56b28149994985e2c4dc046e5d34d416d47fa7767f5c28f",
"zh:aea3fe1a5669b932eda9c5c72e5f327db8da707fe514aaca0d0ef60cb24892f9",
"zh:f56e26e6977f755d7ae56fa6320af96ecf4bb09580d47cb481efbf27f1c5afff",
]
}
provider "registry.opentofu.org/hashicorp/vault" {
version = "4.8.0"
constraints = "~> 4.0"
hashes = [
"h1:SQkjClJDo6SETUnq912GO8BdEExhU1ko8IG2mr4X/2A=",
"zh:0c07ef884c03083b08a54c2cf782f3ff7e124b05e7a4438a0b90a86e60c8d080",
"zh:13dcf2ed494c79e893b447249716d96b665616a868ffaf8f2c5abef07c7eee6f",
"zh:6f15a29fae3a6178e5904e3c95ba22b20f362d8ee491da816048c89f30e6b2de",
"zh:94b92a4bf7a2d250d9698a021f1ab60d1957d01b5bab81f7d9c00c2d6a9b3747",
"zh:a9e207540ef12cd2402e37b3b7567e08de14061a0a2635fd2f4fd09e0a3382aa",
"zh:b41667938ba541e8492036415b3f51fbd1758e456f6d5f0b63e26f4ad5728b21",
"zh:df0b73aff5f4b51e08fc0c273db7f677994db29a81deda66d91acfcfe3f1a370",
"zh:df904b217dc79b71a8b5f5f3ab2e52316d0f890810383721349cc10a72f7265b",
"zh:f0e0b3e6782e0126c40f05cf87ec80978c7291d90f52d7741300b5de1d9c01ba",
"zh:f8e599718b0ea22658eaa3e590671d3873aa723e7ce7d00daf3460ab41d3af14",
]
}

280
terraform/vault/README.md Normal file
View File

@@ -0,0 +1,280 @@
# OpenBao Terraform Configuration
This directory contains Terraform/OpenTofu configuration for managing OpenBao (Vault) infrastructure as code.
## Overview
Manages the following OpenBao resources:
- **AppRole Authentication**: For host-based authentication
- **PKI Infrastructure**: Root CA + Intermediate CA for TLS certificates
- **KV Secrets Engine**: Key-value secret storage (v2)
- **Policies**: Access control policies
## Setup
1. **Copy the example tfvars file:**
```bash
cp terraform.tfvars.example terraform.tfvars
```
2. **Edit `terraform.tfvars` with your OpenBao credentials:**
```hcl
vault_address = "https://vault01.home.2rjus.net:8200"
vault_token = "hvs.your-root-token-here"
vault_skip_tls_verify = true
```
3. **Initialize Terraform:**
```bash
tofu init
```
4. **Review the plan:**
```bash
tofu plan
```
5. **Apply the configuration:**
```bash
tofu apply
```
## Files
- `main.tf` - Provider configuration
- `variables.tf` - Variable definitions
- `approle.tf` - AppRole authentication backend and roles
- `pki.tf` - PKI engines (root CA and intermediate CA)
- `secrets.tf` - KV secrets engine and test secrets
- `terraform.tfvars` - Credentials (gitignored)
- `terraform.tfvars.example` - Example configuration
## Resources Created
### AppRole Authentication
- AppRole backend at `approle/`
- Host-based roles and policies (defined in `locals.host_policies`)
### PKI Infrastructure
- Root CA at `pki/` (10 year TTL)
- Intermediate CA at `pki_int/` (5 year TTL)
- Role `homelab` for issuing certificates to `*.home.2rjus.net`
- Certificate max TTL: 30 days
### Secrets
- KV v2 engine at `secret/`
- Secrets and policies defined in `locals.secrets` and `locals.host_policies`
## Usage Examples
### Adding a New Host
1. **Define the host policy in `approle.tf`:**
```hcl
locals {
host_policies = {
"monitoring01" = {
paths = [
"secret/data/hosts/monitoring01/*",
"secret/data/services/prometheus/*",
]
}
}
}
```
2. **Add secrets in `secrets.tf`:**
```hcl
locals {
secrets = {
"hosts/monitoring01/grafana-admin" = {
auto_generate = true
password_length = 32
}
}
}
```
3. **Apply changes:**
```bash
tofu apply
```
4. **Get AppRole credentials:**
```bash
# Get role_id
bao read auth/approle/role/monitoring01/role-id
# Generate secret_id
bao write -f auth/approle/role/monitoring01/secret-id
```
### Issue Certificates from PKI
**Method 1: ACME (Recommended for automated services)**
First, enable ACME support:
```bash
bao write pki_int/config/acme enabled=true
```
ACME directory endpoint:
```
https://vault01.home.2rjus.net:8200/v1/pki_int/acme/directory
```
Use with ACME clients (lego, certbot, cert-manager, etc.):
```bash
# Example with lego
lego --email admin@home.2rjus.net \
--dns manual \
--server https://vault01.home.2rjus.net:8200/v1/pki_int/acme/directory \
--accept-tos \
run -d test.home.2rjus.net
```
**Method 2: Static certificates via Terraform**
Define in `pki.tf`:
```hcl
locals {
static_certificates = {
"monitoring" = {
common_name = "monitoring.home.2rjus.net"
alt_names = ["grafana.home.2rjus.net", "prometheus.home.2rjus.net"]
ttl = "720h"
}
}
}
```
Terraform will auto-issue and auto-renew these certificates.
**Method 3: Manual CLI issuance**
```bash
# Issue certificate for a host
bao write pki_int/issue/homelab \
common_name="test.home.2rjus.net" \
ttl="720h"
```
### Read a secret
```bash
# Authenticate with AppRole first
bao write auth/approle/login \
role_id="..." \
secret_id="..."
# Read the test secret
bao kv get secret/test/example
```
## Managing Secrets
Secrets are defined in the `locals.secrets` block in `secrets.tf` using a declarative pattern:
### Auto-Generated Secrets (Recommended)
Most secrets can be auto-generated using the `random_password` provider:
```hcl
locals {
secrets = {
"hosts/monitoring01/grafana-admin" = {
auto_generate = true
password_length = 32
}
}
}
```
### Manual Secrets
For secrets that must have specific values (external services, etc.):
```hcl
# In variables.tf
variable "smtp_password" {
type = string
sensitive = true
}
# In secrets.tf locals block
locals {
secrets = {
"shared/smtp/credentials" = {
auto_generate = false
data = {
username = "notifications@2rjus.net"
password = var.smtp_password
server = "smtp.gmail.com"
}
}
}
}
# In terraform.tfvars
smtp_password = "super-secret-password"
```
### Path Structure
Secrets follow a three-tier hierarchy:
- `hosts/{hostname}/*` - Host-specific secrets
- `services/{service}/*` - Service-wide secrets (any host running the service)
- `shared/{category}/*` - Shared secrets (SMTP, backup, etc.)
## Security Notes
- `terraform.tfvars` is gitignored to prevent credential leakage
- Root token should be stored securely (consider using a limited admin token instead)
- `skip_tls_verify = true` is acceptable for self-signed certs in homelab
- AppRole secret_ids can be scoped to specific CIDR ranges for additional security
## Initial Setup Steps
After deploying this configuration, perform these one-time setup tasks:
### 1. Enable ACME
```bash
export BAO_ADDR='https://vault01.home.2rjus.net:8200'
export BAO_TOKEN='your-root-token'
export BAO_SKIP_VERIFY=1
# Configure cluster path (required for ACME)
bao write pki_int/config/cluster path=https://vault01.home.2rjus.net:8200/v1/pki_int
# Enable ACME on intermediate CA
bao write pki_int/config/acme enabled=true
# Verify ACME is enabled
curl -k https://vault01.home.2rjus.net:8200/v1/pki_int/acme/directory
```
### 2. Download Root CA Certificate
For trusting the internal CA on clients:
```bash
# Download root CA certificate
bao read -field=certificate pki/cert/ca > homelab-root-ca.crt
# Install on NixOS hosts (add to system/default.nix or similar)
security.pki.certificateFiles = [ ./homelab-root-ca.crt ];
```
### 3. Test Certificate Issuance
```bash
# Manual test
bao write pki_int/issue/homelab common_name="test.home.2rjus.net" ttl="24h"
```
## Next Steps
1. Replace step-ca ACME endpoint with OpenBao in `system/acme.nix`
2. Add more AppRoles for different host types
3. Migrate existing sops-nix secrets to OpenBao KV
4. Set up SSH CA for host and user certificates
5. Configure auto-unseal for vault01

View File

@@ -0,0 +1,74 @@
# Enable AppRole auth backend
resource "vault_auth_backend" "approle" {
type = "approle"
path = "approle"
}
# Define host access policies
locals {
host_policies = {
# Example: monitoring01 host
# "monitoring01" = {
# paths = [
# "secret/data/hosts/monitoring01/*",
# "secret/data/services/prometheus/*",
# "secret/data/services/grafana/*",
# "secret/data/shared/smtp/*"
# ]
# }
# Example: ha1 host
# "ha1" = {
# paths = [
# "secret/data/hosts/ha1/*",
# "secret/data/shared/mqtt/*"
# ]
# }
# TODO: actually use this policy
"ha1" = {
paths = [
"secret/data/hosts/ha1/*",
]
}
# TODO: actually use this policy
"monitoring01" = {
paths = [
"secret/data/hosts/monitoring01/*",
]
}
}
}
# Generate policies for each host
resource "vault_policy" "host_policies" {
for_each = local.host_policies
name = "${each.key}-policy"
policy = <<EOT
%{~for path in each.value.paths~}
path "${path}" {
capabilities = ["read", "list"]
}
%{~endfor~}
EOT
}
# Generate AppRoles for each host
resource "vault_approle_auth_backend_role" "hosts" {
for_each = local.host_policies
backend = vault_auth_backend.approle.path
role_name = each.key
token_policies = ["${each.key}-policy"]
# Token configuration
token_ttl = 3600 # 1 hour
token_max_ttl = 86400 # 24 hours
# Security settings
bind_secret_id = true
secret_id_ttl = 0 # Never expire (we'll rotate manually)
}

View File

@@ -0,0 +1,48 @@
# WARNING: Auto-generated by create-host tool
# Manual edits will be overwritten when create-host is run
# Generated host policies
# Each host gets access to its own secrets under hosts/<hostname>/*
locals {
generated_host_policies = {
"vaulttest01" = {
paths = [
"secret/data/hosts/vaulttest01/*",
]
}
}
# Placeholder secrets - user should add actual secrets manually or via tofu
generated_secrets = {
}
}
# Create policies for generated hosts
resource "vault_policy" "generated_host_policies" {
for_each = local.generated_host_policies
name = "host-${each.key}"
policy = <<-EOT
# Allow host to read its own secrets
%{for path in each.value.paths~}
path "${path}" {
capabilities = ["read", "list"]
}
%{endfor~}
EOT
}
# Create AppRoles for generated hosts
resource "vault_approle_auth_backend_role" "generated_hosts" {
for_each = local.generated_host_policies
backend = vault_auth_backend.approle.path
role_name = each.key
token_policies = ["host-${each.key}"]
secret_id_ttl = 0 # Never expire (wrapped tokens provide time limit)
token_ttl = 3600
token_max_ttl = 3600
secret_id_num_uses = 0 # Unlimited uses
}

19
terraform/vault/main.tf Normal file
View File

@@ -0,0 +1,19 @@
terraform {
required_version = ">= 1.0"
required_providers {
vault = {
source = "hashicorp/vault"
version = "~> 4.0"
}
random = {
source = "hashicorp/random"
version = "~> 3.6"
}
}
}
provider "vault" {
address = var.vault_address
token = var.vault_token
skip_tls_verify = var.vault_skip_tls_verify
}

224
terraform/vault/pki.tf Normal file
View File

@@ -0,0 +1,224 @@
# ============================================================================
# PKI Infrastructure Configuration
# ============================================================================
#
# This file configures a two-tier PKI hierarchy:
# - Root CA (pki/) - 10 year validity, EC P-384, kept offline (internal to Vault)
# - Intermediate CA (pki_int/) - 5 year validity, EC P-384, used for issuing certificates
# - Leaf certificates - Default to EC P-256 for optimal performance
#
# Key Type Choices:
# - Root/Intermediate: EC P-384 (secp384r1) for long-term security
# - Leaf certificates: EC P-256 (secp256r1) for performance and compatibility
# - EC provides smaller keys, faster operations, and lower CPU usage vs RSA
#
# Certificate Issuance Methods:
#
# 1. ACME (Automated Certificate Management Environment)
# - Services fetch certificates automatically using ACME protocol
# - ACME directory: https://vault01.home.2rjus.net:8200/v1/pki_int/acme/directory
# - Enable ACME: bao write pki_int/config/acme enabled=true
# - Compatible with cert-manager, lego, certbot, etc.
#
# 2. Direct Issuance (Non-ACME)
# - Certificates defined in locals.static_certificates
# - Terraform manages lifecycle (issuance, renewal)
# - Useful for services without ACME support
# - Certificates auto-renew 7 days before expiry
#
# 3. Manual Issuance (CLI)
# - bao write pki_int/issue/homelab common_name="service.home.2rjus.net"
# - Useful for one-off certificates or testing
#
# ============================================================================
# Root CA
resource "vault_mount" "pki_root" {
path = "pki"
type = "pki"
description = "Root CA"
default_lease_ttl_seconds = 315360000 # 10 years
max_lease_ttl_seconds = 315360000 # 10 years
}
resource "vault_pki_secret_backend_root_cert" "root" {
backend = vault_mount.pki_root.path
type = "internal"
common_name = "home.2rjus.net Root CA"
ttl = "315360000" # 10 years
format = "pem"
private_key_format = "der"
key_type = "ec"
key_bits = 384 # P-384 curve (NIST P-384, secp384r1)
exclude_cn_from_sans = true
organization = "Homelab"
country = "NO"
}
# Intermediate CA
resource "vault_mount" "pki_int" {
path = "pki_int"
type = "pki"
description = "Intermediate CA"
default_lease_ttl_seconds = 157680000 # 5 years
max_lease_ttl_seconds = 157680000 # 5 years
# Required for ACME support - allow ACME-specific response headers
allowed_response_headers = [
"Replay-Nonce",
"Link",
"Location"
]
}
resource "vault_pki_secret_backend_intermediate_cert_request" "intermediate" {
backend = vault_mount.pki_int.path
type = "internal"
common_name = "home.2rjus.net Intermediate CA"
key_type = "ec"
key_bits = 384 # P-384 curve (NIST P-384, secp384r1)
organization = "Homelab"
country = "NO"
}
resource "vault_pki_secret_backend_root_sign_intermediate" "intermediate" {
backend = vault_mount.pki_root.path
csr = vault_pki_secret_backend_intermediate_cert_request.intermediate.csr
common_name = "Homelab Intermediate CA"
ttl = "157680000" # 5 years
exclude_cn_from_sans = true
organization = "Homelab"
country = "NO"
}
resource "vault_pki_secret_backend_intermediate_set_signed" "intermediate" {
backend = vault_mount.pki_int.path
certificate = vault_pki_secret_backend_root_sign_intermediate.intermediate.certificate
}
# PKI Role for issuing certificates via ACME and direct issuance
resource "vault_pki_secret_backend_role" "homelab" {
backend = vault_mount.pki_int.path
name = "homelab"
allowed_domains = ["home.2rjus.net"]
allow_subdomains = true
max_ttl = 2592000 # 30 days
ttl = 2592000 # 30 days default
# Key configuration - EC (Elliptic Curve) by default
key_type = "ec"
key_bits = 256 # P-256 curve (NIST P-256, secp256r1)
# ACME-friendly settings
allow_ip_sans = true # Allow IP addresses in SANs
allow_localhost = false # Disable localhost
allow_bare_domains = false # Require subdomain or FQDN
allow_glob_domains = false # Don't allow glob patterns in domain names
# Server authentication
server_flag = true
client_flag = false
code_signing_flag = false
email_protection_flag = false
# Key usage (appropriate for EC certificates)
key_usage = [
"DigitalSignature",
"KeyAgreement",
]
ext_key_usage = ["ServerAuth"]
# Certificate properties
require_cn = false # ACME doesn't always use CN
}
# Configure CRL and issuing URLs
resource "vault_pki_secret_backend_config_urls" "config_urls" {
backend = vault_mount.pki_int.path
issuing_certificates = [
"${var.vault_address}/v1/pki_int/ca"
]
crl_distribution_points = [
"${var.vault_address}/v1/pki_int/crl"
]
ocsp_servers = [
"${var.vault_address}/v1/pki_int/ocsp"
]
}
# Configure cluster path (required for ACME)
resource "vault_pki_secret_backend_config_cluster" "cluster" {
backend = vault_mount.pki_int.path
path = "${var.vault_address}/v1/${vault_mount.pki_int.path}"
aia_path = "${var.vault_address}/v1/${vault_mount.pki_int.path}"
}
# Enable ACME support
resource "vault_generic_endpoint" "acme_config" {
depends_on = [
vault_pki_secret_backend_config_cluster.cluster,
vault_pki_secret_backend_role.homelab
]
path = "${vault_mount.pki_int.path}/config/acme"
ignore_absent_fields = true
disable_read = true
disable_delete = true
data_json = jsonencode({
enabled = true
allowed_issuers = ["*"]
allowed_roles = ["*"]
default_directory_policy = "sign-verbatim"
})
}
# ============================================================================
# Direct Certificate Issuance (Non-ACME)
# ============================================================================
# Define static certificates to be issued directly (not via ACME)
# Useful for services that don't support ACME or need long-lived certificates
locals {
static_certificates = {
# Example: Issue a certificate for a specific service
# "vault" = {
# common_name = "vault01.home.2rjus.net"
# alt_names = ["vault01.home.2rjus.net"]
# ip_sans = ["10.69.13.19"]
# ttl = "8760h" # 1 year
# }
}
}
# Issue static certificates
resource "vault_pki_secret_backend_cert" "static_certs" {
for_each = local.static_certificates
backend = vault_mount.pki_int.path
name = vault_pki_secret_backend_role.homelab.name
common_name = each.value.common_name
alt_names = lookup(each.value, "alt_names", [])
ip_sans = lookup(each.value, "ip_sans", [])
ttl = lookup(each.value, "ttl", "720h") # 30 days default
auto_renew = true
min_seconds_remaining = 604800 # Renew 7 days before expiry
}
# Output static certificate data for use in configurations
output "static_certificates" {
description = "Static certificates issued by Vault PKI"
value = {
for k, v in vault_pki_secret_backend_cert.static_certs : k => {
common_name = v.common_name
serial = v.serial_number
expiration = v.expiration
issuing_ca = v.issuing_ca
certificate = v.certificate
private_key = v.private_key
}
}
sensitive = true
}

View File

@@ -0,0 +1,80 @@
# Enable KV v2 secrets engine
resource "vault_mount" "kv" {
path = "secret"
type = "kv"
options = { version = "2" }
description = "KV Version 2 secret store"
}
# Define all secrets with auto-generation support
locals {
secrets = {
# Example host-specific secrets
# "hosts/monitoring01/grafana-admin" = {
# auto_generate = true
# password_length = 32
# }
# "hosts/ha1/mqtt-password" = {
# auto_generate = true
# password_length = 24
# }
# Example service secrets
# "services/prometheus/remote-write" = {
# auto_generate = true
# password_length = 40
# }
# Example shared secrets with manual values
# "shared/smtp/credentials" = {
# auto_generate = false
# data = {
# username = "notifications@2rjus.net"
# password = var.smtp_password # Define in variables.tf and set in terraform.tfvars
# server = "smtp.gmail.com"
# }
# }
# TODO: actually use the secret
"hosts/monitoring01/grafana-admin" = {
auto_generate = true
password_length = 32
}
# TODO: actually use the secret
"hosts/ha1/mqtt-password" = {
auto_generate = true
password_length = 24
}
# TODO: Remove after testing
"hosts/vaulttest01/test-service" = {
auto_generate = true
password_length = 32
}
}
}
# Auto-generate passwords for secrets with auto_generate = true
resource "random_password" "auto_secrets" {
for_each = {
for k, v in local.secrets : k => v
if lookup(v, "auto_generate", false)
}
length = each.value.password_length
special = true
}
# Create all secrets in Vault
resource "vault_kv_secret_v2" "secrets" {
for_each = local.secrets
mount = vault_mount.kv.path
name = each.key
data_json = jsonencode(
lookup(each.value, "auto_generate", false)
? { password = random_password.auto_secrets[each.key].result }
: each.value.data
)
}

View File

@@ -0,0 +1,6 @@
# Copy this file to terraform.tfvars and fill in your values
# terraform.tfvars is gitignored to keep credentials safe
vault_address = "https://vault01.home.2rjus.net:8200"
vault_token = "hvs.XXXXXXXXXXXXXXXXXXXX"
vault_skip_tls_verify = true

View File

@@ -0,0 +1,26 @@
variable "vault_address" {
description = "OpenBao server address"
type = string
default = "https://vault01.home.2rjus.net:8200"
}
variable "vault_token" {
description = "OpenBao root or admin token"
type = string
sensitive = true
}
variable "vault_skip_tls_verify" {
description = "Skip TLS verification (for self-signed certs)"
type = bool
default = true
}
# Example variables for manual secrets
# Uncomment and add to terraform.tfvars as needed
# variable "smtp_password" {
# description = "SMTP password for notifications"
# type = string
# sensitive = true
# }

View File

@@ -43,6 +43,15 @@ locals {
cpu_cores = 2
memory = 2048
disk_size = "20G"
flake_branch = "vault-setup" # Bootstrap from this branch instead of master
}
"vaulttest01" = {
ip = "10.69.13.150/24"
cpu_cores = 2
memory = 2048
disk_size = "20G"
flake_branch = "pki-migration"
vault_wrapped_token = "s.UCpQCOp7cOKDdtGGBvfRWwAt"
}
}
@@ -65,6 +74,8 @@ locals {
gateway = lookup(vm, "gateway", var.default_gateway)
# Branch configuration for bootstrap (optional, uses master if not set)
flake_branch = lookup(vm, "flake_branch", null)
# Vault configuration (optional, for automatic secret provisioning)
vault_wrapped_token = lookup(vm, "vault_wrapped_token", null)
}
}
}
@@ -118,6 +129,11 @@ resource "proxmox_vm_qemu" "vm" {
}
}
# TPM device
tpm_state {
storage = each.value.storage
}
# Start on boot
start_at_node_boot = true
@@ -132,4 +148,12 @@ resource "proxmox_vm_qemu" "vm" {
source = "/dev/urandom"
period = 1000
}
# Lifecycle configuration
lifecycle {
ignore_changes = [
clone, # Template name can change without recreating VMs
startup_shutdown, # Proxmox sets defaults (-1) that we don't need to manage
]
}
}