18 KiB
Phase 4d: Vault Bootstrap Integration - Implementation Summary
Overview
Phase 4d implements automatic Vault/OpenBao integration for new NixOS hosts, enabling:
- Zero-touch secret provisioning on first boot
- Automatic AppRole authentication
- Runtime secret fetching with caching
- Periodic secret rotation
Key principle: Existing sops-nix infrastructure remains unchanged. This is new infrastructure running in parallel.
Architecture
Component Diagram
┌─────────────────────────────────────────────────────────────┐
│ Developer Workstation │
│ │
│ create-host --hostname myhost --ip 10.69.13.x/24 │
│ │ │
│ ├─> Generate host configs (hosts/myhost/) │
│ ├─> Update flake.nix │
│ ├─> Update terraform/vms.tf │
│ ├─> Generate terraform/vault/hosts-generated.tf │
│ ├─> Apply Vault Terraform (create AppRole) │
│ └─> Generate wrapped token (24h TTL) ───┐ │
│ │ │
└───────────────────────────────────────────────┼────────────┘
│
┌───────────────────────────┘
│ Wrapped Token
│ (single-use, 24h expiry)
↓
┌─────────────────────────────────────────────────────────────┐
│ Cloud-init (VM Provisioning) │
│ │
│ /etc/environment: │
│ VAULT_ADDR=https://vault01.home.2rjus.net:8200 │
│ VAULT_WRAPPED_TOKEN=hvs.CAES... │
│ VAULT_SKIP_VERIFY=1 │
└─────────────────────────────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────┐
│ Bootstrap Service (First Boot) │
│ │
│ 1. Read VAULT_WRAPPED_TOKEN from environment │
│ 2. POST /v1/sys/wrapping/unwrap │
│ 3. Extract role_id + secret_id │
│ 4. Store in /var/lib/vault/approle/ │
│ ├─ role-id (600 permissions) │
│ └─ secret-id (600 permissions) │
│ 5. Continue with nixos-rebuild boot │
└─────────────────────────────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────┐
│ Runtime (Service Starts) │
│ │
│ vault-secret-<name>.service (ExecStartPre) │
│ │ │
│ ├─> vault-fetch <secret-path> <output-dir> │
│ │ │ │
│ │ ├─> Read role_id + secret_id │
│ │ ├─> POST /v1/auth/approle/login → token │
│ │ ├─> GET /v1/secret/data/<path> → secrets │
│ │ ├─> Write /run/secrets/<name>/password │
│ │ ├─> Write /run/secrets/<name>/api_key │
│ │ └─> Cache to /var/lib/vault/cache/<name>/ │
│ │ │
│ └─> chown/chmod secret files │
│ │
│ myservice.service │
│ └─> Reads secrets from /run/secrets/<name>/ │
└─────────────────────────────────────────────────────────────┘
Data Flow
-
Provisioning Time (Developer → Vault):
- create-host generates AppRole configuration
- Terraform creates AppRole + policy in Vault
- Vault generates wrapped token containing role_id + secret_id
- Wrapped token stored in terraform/vms.tf
-
Bootstrap Time (Cloud-init → VM):
- Cloud-init injects wrapped token via /etc/environment
- Bootstrap service unwraps token (single-use operation)
- Stores unwrapped credentials persistently
-
Runtime (Service → Vault):
- Service starts
- ExecStartPre hook calls vault-fetch
- vault-fetch authenticates using stored credentials
- Fetches secrets and caches them
- Service reads secrets from filesystem
Implementation Details
1. vault-fetch Helper (scripts/vault-fetch/)
Purpose: Fetch secrets from Vault and write to filesystem
Features:
- Reads AppRole credentials from
/var/lib/vault/approle/ - Authenticates to Vault (fresh token each time)
- Fetches secret from KV v2 engine
- Writes individual files per secret key
- Updates cache for fallback
- Gracefully degrades to cache if Vault unreachable
Usage:
vault-fetch hosts/monitoring01/grafana /run/secrets/grafana
Environment Variables:
VAULT_ADDR: Vault server (default: https://vault01.home.2rjus.net:8200)VAULT_SKIP_VERIFY: Skip TLS verification (default: 1)
Error Handling:
- Vault unreachable → Use cache (log warning)
- Invalid credentials → Fail with clear error
- No cache + unreachable → Fail with error
2. NixOS Module (system/vault-secrets.nix)
Purpose: Declarative Vault secret management for NixOS services
Configuration Options:
vault.enable = true; # Enable Vault integration
vault.secrets.<name> = {
secretPath = "hosts/monitoring01/grafana"; # Path in Vault
outputDir = "/run/secrets/grafana"; # Where to write secrets
cacheDir = "/var/lib/vault/cache/grafana"; # Cache location
owner = "grafana"; # File owner
group = "grafana"; # File group
mode = "0400"; # Permissions
services = [ "grafana" ]; # Dependent services
restartTrigger = true; # Enable periodic rotation
restartInterval = "daily"; # Rotation schedule
};
Module Behavior:
-
Fetch Service: Creates
vault-secret-<name>.service- Runs on boot and before dependent services
- Calls vault-fetch to populate secrets
- Sets ownership and permissions
-
Rotation Timer: Optionally creates
vault-secret-rotate-<name>.timer- Scheduled restarts for secret rotation
- Automatically excluded for critical services
- Configurable interval (daily, weekly, monthly)
-
Critical Service Protection:
vault.criticalServices = [ "bind" "openbao" "step-ca" ];Services in this list never get auto-restart timers
3. create-host Tool Updates
New Functionality:
-
Vault Terraform Generation (
generators.py):- Creates/updates
terraform/vault/hosts-generated.tf - Adds host policy granting access to
secret/data/hosts/<hostname>/* - Adds AppRole configuration
- Idempotent (safe to re-run)
- Creates/updates
-
Wrapped Token Generation (
vault_helper.py):- Applies Vault Terraform to create AppRole
- Reads role_id from Vault
- Generates secret_id
- Wraps credentials in cubbyhole token (24h TTL, single-use)
- Returns wrapped token
-
VM Configuration Update (
manipulators.py):- Adds
vault_wrapped_tokenfield to VM in vms.tf - Preserves other VM settings
- Adds
New CLI Options:
create-host --hostname myhost --ip 10.69.13.x/24
# Full workflow with Vault integration
create-host --hostname myhost --skip-vault
# Create host without Vault (legacy behavior)
create-host --hostname myhost --force
# Regenerate everything including new wrapped token
Dependencies Added:
hvac: Python Vault client library
4. Bootstrap Service Updates
New Behavior (hosts/template2/bootstrap.nix):
# Check for wrapped token
if [ -n "$VAULT_WRAPPED_TOKEN" ]; then
# Unwrap to get credentials
curl -X POST \
-H "X-Vault-Token: $VAULT_WRAPPED_TOKEN" \
$VAULT_ADDR/v1/sys/wrapping/unwrap
# Store role_id and secret_id
mkdir -p /var/lib/vault/approle
echo "$ROLE_ID" > /var/lib/vault/approle/role-id
echo "$SECRET_ID" > /var/lib/vault/approle/secret-id
chmod 600 /var/lib/vault/approle/*
# Continue with bootstrap...
fi
Error Handling:
- Token already used → Log error, continue bootstrap
- Token expired → Log error, continue bootstrap
- Vault unreachable → Log warning, continue bootstrap
- Never fails bootstrap - host can still run without Vault
5. Cloud-init Configuration
Updates (terraform/cloud-init.tf):
write_files:
- path: /etc/environment
content: |
VAULT_ADDR=https://vault01.home.2rjus.net:8200
VAULT_WRAPPED_TOKEN=${vault_wrapped_token}
VAULT_SKIP_VERIFY=1
VM Configuration (terraform/vms.tf):
locals {
vms = {
"myhost" = {
ip = "10.69.13.x/24"
vault_wrapped_token = "hvs.CAESIBw..." # Added by create-host
}
}
}
6. Vault Terraform Structure
Generated Hosts File (terraform/vault/hosts-generated.tf):
locals {
generated_host_policies = {
"myhost" = {
paths = [
"secret/data/hosts/myhost/*",
]
}
}
}
resource "vault_policy" "generated_host_policies" {
for_each = local.generated_host_policies
name = "host-${each.key}"
policy = <<-EOT
path "secret/data/hosts/${each.key}/*" {
capabilities = ["read", "list"]
}
EOT
}
resource "vault_approle_auth_backend_role" "generated_hosts" {
for_each = local.generated_host_policies
backend = vault_auth_backend.approle.path
role_name = each.key
token_policies = ["host-${each.key}"]
secret_id_ttl = 0 # Never expire
token_ttl = 3600 # 1 hour tokens
}
Separation of Concerns:
approle.tf: Manual host configurations (ha1, monitoring01)hosts-generated.tf: Auto-generated configurationssecrets.tf: Secret definitions (manual)pki.tf: PKI infrastructure
Security Model
Credential Distribution
Wrapped Token Security:
- Single-use: Can only be unwrapped once
- Time-limited: 24h TTL
- Safe in git: Even if leaked, expires quickly
- Standard Vault pattern: Built-in Vault feature
Why wrapped tokens are secure:
Developer commits wrapped token to git
↓
Attacker finds token in git history
↓
Attacker tries to use token
↓
❌ Token already used (unwrapped during bootstrap)
↓
❌ OR: Token expired (>24h old)
AppRole Credentials
Storage:
- Location:
/var/lib/vault/approle/ - Permissions:
600 (root:root) - Persistence: Survives reboots
Security Properties:
role_id: Non-sensitive (like username)secret_id: Sensitive (like password)secret_id_ttl = 0: Never expires (simplicity vs rotation tradeoff)- Tokens: Ephemeral (1h TTL, not cached)
Attack Scenarios:
-
Attacker gets root on host:
- Can read AppRole credentials
- Can only access that host's secrets
- Cannot access other hosts' secrets (policy restriction)
- ✅ Blast radius limited to single host
-
Attacker intercepts wrapped token:
- Single-use: Already consumed during bootstrap
- Time-limited: Likely expired
- ✅ Cannot be reused
-
Vault server compromised:
- All secrets exposed (same as any secret storage)
- ✅ No different from sops-nix master key compromise
Secret Storage
Runtime Secrets:
- Location:
/run/secrets/(tmpfs) - Lost on reboot
- Re-fetched on service start
- ✅ Not in Nix store
- ✅ Not persisted to disk
Cached Secrets:
- Location:
/var/lib/vault/cache/ - Persists across reboots
- Only used when Vault unreachable
- ✅ Enables service availability
- ⚠️ May be stale
Failure Modes
Wrapped Token Expired
Symptom: Bootstrap logs "token expired" error
Impact: Host boots but has no Vault credentials
Fix: Regenerate token and redeploy
create-host --hostname myhost --force
cd terraform && tofu apply
Vault Unreachable
Symptom: Service logs "WARNING: Using cached secrets"
Impact: Service uses stale secrets (may work or fail depending on rotation)
Fix: Restore Vault connectivity, restart service
No Cache Available
Symptom: Service fails to start with "No cache available"
Impact: Service unavailable until Vault restored
Fix: Restore Vault, restart service
Invalid Credentials
Symptom: vault-fetch logs authentication failure
Impact: Service cannot start
Fix:
- Check AppRole exists:
vault read auth/approle/role/hostname - Check policy exists:
vault policy read host-hostname - Regenerate credentials if needed
Migration Path
Current State (Phase 4d)
- ✅ sops-nix: Used by all existing services
- ✅ Vault: Available for new services
- ✅ Parallel operation: Both work simultaneously
Future Migration
Gradual Service Migration:
- Pick a non-critical service (e.g., test service)
- Add Vault secrets:
vault.secrets.myservice = { secretPath = "hosts/myhost/myservice"; }; - Update service to read from Vault:
systemd.services.myservice.serviceConfig = { EnvironmentFile = "/run/secrets/myservice/password"; }; - Remove sops-nix secret
- Test thoroughly
- Repeat for next service
Critical Services Last:
- DNS (bind)
- Certificate Authority (step-ca)
- Vault itself (openbao)
Eventually:
- All services migrated to Vault
- Remove sops-nix dependency
- Clean up
/secrets/directory
Performance Considerations
Bootstrap Time
Added overhead: ~2-5 seconds
- Token unwrap: ~1s
- Credential storage: ~1s
Total bootstrap time: Still <2 minutes (acceptable)
Service Startup
Added overhead: ~1-3 seconds per service
- Vault authentication: ~1s
- Secret fetch: ~1s
- File operations: <1s
Parallel vs Serial:
- Multiple services fetch in parallel
- No cascade delays
Cache Benefits
When Vault unreachable:
- Service starts in <1s (cache read)
- No Vault dependency for startup
- High availability maintained
Testing Checklist
Complete testing workflow documented in vault-bootstrap-testing.md:
- Create test host with create-host
- Add test secrets to Vault
- Deploy VM and verify bootstrap
- Verify secrets fetched successfully
- Test service restart (re-fetch)
- Test Vault unreachable (cache fallback)
- Test secret rotation
- Test wrapped token expiry
- Test token reuse prevention
- Verify critical services excluded from auto-restart
Files Changed
Created
scripts/vault-fetch/vault-fetch.sh- Secret fetching scriptscripts/vault-fetch/default.nix- Nix packagescripts/vault-fetch/README.md- Documentationsystem/vault-secrets.nix- NixOS modulescripts/create-host/vault_helper.py- Vault API clientterraform/vault/hosts-generated.tf- Generated Terraformdocs/vault-bootstrap-implementation.md- This filedocs/vault-bootstrap-testing.md- Testing guide
Modified
scripts/create-host/default.nix- Add hvac dependencyscripts/create-host/create_host.py- Add Vault integrationscripts/create-host/generators.py- Add Vault Terraform generationscripts/create-host/manipulators.py- Add wrapped token injectionterraform/cloud-init.tf- Inject Vault credentialsterraform/vms.tf- Support vault_wrapped_token fieldhosts/template2/bootstrap.nix- Unwrap token and store credentialssystem/default.nix- Import vault-secrets moduleflake.nix- Add vault-fetch package
Unchanged
- All existing sops-nix configuration
- All existing service configurations
- All existing host configurations
/secrets/directory
Future Enhancements
Phase 4e+ (Not in Scope)
-
Dynamic Secrets
- Database credentials with rotation
- Cloud provider credentials
- SSH certificates
-
Secret Watcher
- Monitor Vault for secret changes
- Automatically restart services on rotation
- Faster than periodic timers
-
PKI Integration (Phase 4c)
- Migrate from step-ca to Vault PKI
- Automatic certificate issuance
- Short-lived certificates
-
Audit Logging
- Track secret access
- Alert on suspicious patterns
- Compliance reporting
-
Multi-Environment
- Dev/staging/prod separation
- Per-environment Vault namespaces
- Separate AppRoles per environment
Conclusion
Phase 4d successfully implements automatic Vault integration for new NixOS hosts with:
- ✅ Zero-touch provisioning
- ✅ Secure credential distribution
- ✅ Graceful degradation
- ✅ Backward compatibility
- ✅ Production-ready error handling
The infrastructure is ready for gradual migration of existing services from sops-nix to Vault.