Files
nixos-servers/docs/vault-bootstrap-testing.md
Torjus Håkestad f5904738b0
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Run nix flake check / flake-check (pull_request) Successful in 2m30s
vault: implement bootstrap integration
2026-02-03 01:09:43 +01:00

11 KiB

Phase 4d: Vault Bootstrap Integration - Testing Guide

This guide walks through testing the complete Vault bootstrap workflow implemented in Phase 4d.

Prerequisites

Before testing, ensure:

  1. Vault server is running: vault01 (vault01.home.2rjus.net:8200) is accessible
  2. Vault access: You have a Vault token with admin permissions (set BAO_TOKEN env var)
  3. Terraform installed: OpenTofu is available in your PATH
  4. Git repository clean: All Phase 4d changes are committed to a branch

Test Scenario: Create vaulttest01

Step 1: Create Test Host Configuration

Run the create-host tool with Vault integration:

# Ensure you have Vault token
export BAO_TOKEN="your-vault-admin-token"

# Create test host
nix run .#create-host -- \
  --hostname vaulttest01 \
  --ip 10.69.13.150/24 \
  --cpu 2 \
  --memory 2048 \
  --disk 20G

# If you need to regenerate (e.g., wrapped token expired):
nix run .#create-host -- \
  --hostname vaulttest01 \
  --ip 10.69.13.150/24 \
  --force

What this does:

  • Creates hosts/vaulttest01/ configuration
  • Updates flake.nix with new host
  • Updates terraform/vms.tf with VM definition
  • Generates terraform/vault/hosts-generated.tf with AppRole and policy
  • Creates a wrapped token (24h TTL, single-use)
  • Adds wrapped token to VM configuration

Expected output:

✓ All validations passed
✓ Created hosts/vaulttest01/default.nix
✓ Created hosts/vaulttest01/configuration.nix
✓ Updated flake.nix
✓ Updated terraform/vms.tf

Configuring Vault integration...
✓ Updated terraform/vault/hosts-generated.tf
Applying Vault Terraform configuration...
✓ Terraform applied successfully
Reading AppRole credentials for vaulttest01...
✓ Retrieved role_id
✓ Generated secret_id
Creating wrapped token (24h TTL, single-use)...
✓ Created wrapped token: hvs.CAESIBw...
⚠️  Token expires in 24 hours
⚠️  Token can only be used once
✓ Added wrapped token to terraform/vms.tf

✓ Host configuration generated successfully!

Step 2: Add Test Service Configuration

Edit hosts/vaulttest01/configuration.nix to enable Vault and add a test service:

{ config, pkgs, lib, ... }:
{
  imports = [
    ../../system
    ../../common/vm
  ];

  # Enable Vault secrets management
  vault.enable = true;

  # Define a test secret
  vault.secrets.test-service = {
    secretPath = "hosts/vaulttest01/test-service";
    restartTrigger = true;
    restartInterval = "daily";
    services = [ "vault-test" ];
  };

  # Create a test service that uses the secret
  systemd.services.vault-test = {
    description = "Test Vault secret fetching";
    wantedBy = [ "multi-user.target" ];
    after = [ "vault-secret-test-service.service" ];

    serviceConfig = {
      Type = "oneshot";
      RemainAfterExit = true;

      ExecStart = pkgs.writeShellScript "vault-test" ''
        echo "=== Vault Secret Test ==="
        echo "Secret path: hosts/vaulttest01/test-service"

        if [ -f /run/secrets/test-service/password ]; then
          echo "✓ Password file exists"
          echo "Password length: $(wc -c < /run/secrets/test-service/password)"
        else
          echo "✗ Password file missing!"
          exit 1
        fi

        if [ -d /var/lib/vault/cache/test-service ]; then
          echo "✓ Cache directory exists"
        else
          echo "✗ Cache directory missing!"
          exit 1
        fi

        echo "Test successful!"
      '';

      StandardOutput = "journal+console";
    };
  };

  # Rest of configuration...
  networking.hostName = "vaulttest01";
  networking.domain = "home.2rjus.net";

  systemd.network.networks."10-lan" = {
    matchConfig.Name = "ens18";
    address = [ "10.69.13.150/24" ];
    gateway = [ "10.69.13.1" ];
    dns = [ "10.69.13.5" "10.69.13.6" ];
    domains = [ "home.2rjus.net" ];
  };

  system.stateVersion = "25.11";
}

Step 3: Create Test Secrets in Vault

Add test secrets to Vault using Terraform:

Edit terraform/vault/secrets.tf:

locals {
  secrets = {
    # ... existing secrets ...

    # Test secret for vaulttest01
    "hosts/vaulttest01/test-service" = {
      auto_generate = true
      password_length = 24
    }
  }
}

Apply the Vault configuration:

cd terraform/vault
tofu apply

Verify the secret exists:

export VAULT_ADDR=https://vault01.home.2rjus.net:8200
export VAULT_SKIP_VERIFY=1

vault kv get secret/hosts/vaulttest01/test-service

Step 4: Deploy the VM

Important: Deploy within 24 hours of creating the host (wrapped token TTL)

cd terraform
tofu plan   # Review changes
tofu apply  # Deploy VM

Step 5: Monitor Bootstrap Process

SSH into the VM and monitor the bootstrap:

# Watch bootstrap logs
ssh root@vaulttest01
journalctl -fu nixos-bootstrap.service

# Expected log output:
# Starting NixOS bootstrap for host: vaulttest01
# Network connectivity confirmed
# Unwrapping Vault token to get AppRole credentials...
# Vault credentials unwrapped and stored successfully
# Fetching and building NixOS configuration from flake...
# Successfully built configuration for vaulttest01
# Rebooting into new configuration...

Step 6: Verify Vault Integration

After the VM reboots, verify the integration:

ssh root@vaulttest01

# Check AppRole credentials were stored
ls -la /var/lib/vault/approle/
# Expected: role-id and secret-id files with 600 permissions

cat /var/lib/vault/approle/role-id
# Should show a UUID

# Check vault-secret service ran successfully
systemctl status vault-secret-test-service.service
# Should be active (exited)

journalctl -u vault-secret-test-service.service
# Should show successful secret fetch:
# [vault-fetch] Authenticating to Vault at https://vault01.home.2rjus.net:8200
# [vault-fetch] Successfully authenticated to Vault
# [vault-fetch] Fetching secret from path: hosts/vaulttest01/test-service
# [vault-fetch] Writing secrets to /run/secrets/test-service
# [vault-fetch]   - Wrote secret key: password
# [vault-fetch] Successfully fetched and cached secrets

# Check test service passed
systemctl status vault-test.service
journalctl -u vault-test.service
# Should show:
# === Vault Secret Test ===
# ✓ Password file exists
# ✓ Cache directory exists
# Test successful!

# Verify secret files exist
ls -la /run/secrets/test-service/
# Should show password file with 400 permissions

# Verify cache exists
ls -la /var/lib/vault/cache/test-service/
# Should show cached password file

Test Scenarios

Scenario 1: Fresh Deployment

Expected: All secrets fetched successfully from Vault

Scenario 2: Service Restart

systemctl restart vault-test.service

Expected: Secrets re-fetched from Vault, service starts successfully

Scenario 3: Vault Unreachable

# On vault01, stop Vault temporarily
ssh root@vault01
systemctl stop openbao

# On vaulttest01, restart test service
ssh root@vaulttest01
systemctl restart vault-test.service
journalctl -u vault-secret-test-service.service | tail -20

Expected:

  • Warning logged: "Using cached secrets from /var/lib/vault/cache/test-service"
  • Service starts successfully using cached secrets
# Restore Vault
ssh root@vault01
systemctl start openbao

Scenario 4: Secret Rotation

# Update secret in Vault
vault kv put secret/hosts/vaulttest01/test-service password="new-secret-value"

# On vaulttest01, trigger rotation
ssh root@vaulttest01
systemctl restart vault-secret-test-service.service

# Verify new secret
cat /run/secrets/test-service/password
# Should show new value

Expected: New secret fetched and cached

Scenario 5: Expired Wrapped Token

# Wait 24+ hours after create-host, then try to deploy
cd terraform
tofu apply

Expected: Bootstrap fails with message about expired token

Fix (Option 1 - Regenerate token only):

# Only regenerates the wrapped token, preserves all other configuration
nix run .#create-host -- --hostname vaulttest01 --regenerate-token
cd terraform
tofu apply

Fix (Option 2 - Full regeneration with --force):

# Overwrites entire host configuration (including any manual changes)
nix run .#create-host -- --hostname vaulttest01 --force
cd terraform
tofu apply

Recommendation: Use --regenerate-token to avoid losing manual configuration changes.

Scenario 6: Already-Used Wrapped Token

Try to deploy the same VM twice without regenerating token.

Expected: Second bootstrap fails with "token already used" message

Cleanup

After testing:

# Destroy test VM
cd terraform
tofu destroy -target=proxmox_vm_qemu.vm[\"vaulttest01\"]

# Remove test secrets from Vault
vault kv delete secret/hosts/vaulttest01/test-service

# Remove host configuration (optional)
git rm -r hosts/vaulttest01
# Edit flake.nix to remove nixosConfigurations.vaulttest01
# Edit terraform/vms.tf to remove vaulttest01
# Edit terraform/vault/hosts-generated.tf to remove vaulttest01

Success Criteria Checklist

Phase 4d is considered successful when:

  • create-host generates Vault configuration automatically
  • New hosts receive AppRole credentials via cloud-init
  • Bootstrap stores credentials in /var/lib/vault/approle/
  • Services can fetch secrets using vault.secrets option
  • Secrets extracted to individual files in /run/secrets/
  • Cached secrets work when Vault is unreachable
  • Periodic restart timers work for secret rotation
  • Critical services excluded from auto-restart
  • Test host deploys and verifies working
  • sops-nix continues to work for existing services

Troubleshooting

Bootstrap fails with "Failed to unwrap Vault token"

Possible causes:

  • Token already used (wrapped tokens are single-use)
  • Token expired (24h TTL)
  • Invalid token
  • Vault unreachable

Solution:

# Regenerate token
nix run .#create-host -- --hostname vaulttest01 --force
cd terraform && tofu apply

Secret fetch fails with authentication error

Check:

# Verify AppRole exists
vault read auth/approle/role/vaulttest01

# Verify policy exists
vault policy read host-vaulttest01

# Test authentication manually
ROLE_ID=$(cat /var/lib/vault/approle/role-id)
SECRET_ID=$(cat /var/lib/vault/approle/secret-id)
vault write auth/approle/login role_id="$ROLE_ID" secret_id="$SECRET_ID"

Cache not working

Check:

# Verify cache directory exists and has files
ls -la /var/lib/vault/cache/test-service/

# Check permissions
stat /var/lib/vault/cache/test-service/password
# Should be 600 (rw-------)

Next Steps

After successful testing:

  1. Gradually migrate existing services from sops-nix to Vault
  2. Consider implementing secret watcher for faster rotation (future enhancement)
  3. Phase 4c: Migrate from step-ca to OpenBao PKI
  4. Eventually deprecate and remove sops-nix