- Update homelab-deploy input to get metrics support
- Enable metrics endpoint on port 9972
- Add scrape target for prometheus auto-discovery
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Compare VictoriaMetrics and Thanos as options for extending
metrics retention beyond 30 days while managing disk usage.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add homelab.deploy.enable option (requires vault.enable)
- Create shared homelab-deploy Vault policy for all hosts
- Enable homelab.deploy on all vault-enabled hosts
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add system/homelab-deploy.nix module that automatically enables the
listener on all hosts with vault.enable=true. Uses homelab.host.tier
and homelab.host.role for NATS subject subscriptions.
- Add homelab-deploy access to all host AppRole policies
- Remove manual listener config from vaulttest01 (now handled by system module)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Update homelab-deploy to include bugfix. Add CLI to devShell for
easier testing and deployment operations.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Ensure homelab-deploy-listener waits for the NKey secret to be
fetched from Vault before starting.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add homelab-deploy flake input and NixOS module for message-based
deployments across the fleet. Configure DEPLOY account in NATS with
tiered access control (listener, test-deployer, admin-deployer).
Enable listener on vaulttest01 as initial test host.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add a shared `homelab.host` module that provides host metadata for
multiple consumers:
- tier: deployment tier (test/prod) for future homelab-deploy service
- priority: alerting priority (high/low) for Prometheus label filtering
- role: primary role of the host (dns, database, monitoring, etc.)
- labels: free-form labels for additional metadata
Host configurations updated with appropriate values:
- ns1, ns2: role=dns with dns_role labels
- nix-cache01: priority=low, role=build-host
- vault01: role=vault
- jump: role=bastion
- template, template2, testvm01, vaulttest01: tier=test, priority=low
The module is now imported via commonModules in flake.nix, making it
available to all hosts including minimal configurations like template2.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
MCP exposes two tools:
- deploy: test-tier only, always available
- deploy_admin: all tiers, requires --enable-admin flag
Three security layers: CLI flag, NATS authz, Claude Code permissions.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Support deploying to all hosts in a tier or all hosts with a role:
- deploy.<tier>.all - broadcast to all hosts in tier
- deploy.<tier>.role.<role> - broadcast to hosts with matching role
MCP can deploy to all test hosts at once, admin can deploy to any group.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add plan for NATS-based deployment service (homelab-deploy) that enables
on-demand NixOS configuration updates via messaging. Features tiered
permissions (test/prod) enforced at NATS layer.
Update prometheus-scrape-target-labels plan to share the homelab.host
module for host metadata (tier, priority, role, labels) - single source
of truth for both deployment tiers and prometheus labels.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Increase duration from 5m to 10m and demote severity from critical to
warning. Brief degraded states during nixos-rebuild are normal and were
causing false positive alerts.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Merge Prometheus metrics and Loki logs into a unified troubleshooting
skill. Adds LogQL query patterns, label reference, and common service
units for log searching.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Reference guide for exploring Prometheus metrics when troubleshooting
homelab issues, including the new nixos_flake_info metrics.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update nixos-exporter to 0.2.3
- Set system.configurationRevision for all hosts so the exporter
can report the flake's git revision
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add nixos-exporter prometheus exporter to track NixOS generation metrics
and flake revision status across all hosts.
Changes:
- Add nixos-exporter flake input
- Add commonModules list in flake.nix for modules shared by all hosts
- Enable nixos-exporter in system/monitoring/metrics.nix
- Configure Prometheus to scrape nixos-exporter on all hosts
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
RemainAfterExit=true kept the service in "active" state, which
prevented OnUnitActiveSec from scheduling new triggers since there
was no new "activation" event. Removing it allows the service to
properly go inactive, enabling the timer to reschedule correctly.
Also fix ExecStart to use lib.getExe for proper path resolution
with writeShellApplication.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add proposed dns_role label to distinguish primary/secondary DNS
resolvers. This addresses the unbound_low_cache_hit_ratio alert
firing on ns2, which has a cold cache due to low traffic.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Backups to the shared restic repository were all scheduled at exactly
midnight, causing lock conflicts. Adding RandomizedDelaySec spreads
them out over a 2-hour window to prevent simultaneous access.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The homeassistant override key should match the entity type in the
MQTT discovery topic path. For battery sensors, the topic is
homeassistant/sensor/<device>/battery/config, so the key should be
"battery" not "sensor_battery".
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Displays FQDN and flake commit hash with timestamp on login.
Templates can override with their own MOTD via mkDefault.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Convert remaining writeShellScript usages to writeShellApplication for
shellcheck validation and strict bash options.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds a helper script deployed to all hosts for testing feature branches.
Usage: nixos-rebuild-test <action> <branch>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Instead of creating a long-lived Vault token in Terraform (which gets
invalidated when Terraform recreates it), monitoring01 now uses its
existing AppRole credentials to fetch a fresh token for Prometheus.
Changes:
- Add prometheus-metrics policy to monitoring01's AppRole
- Remove vault_token.prometheus_metrics resource from Terraform
- Remove openbao-token KV secret from Terraform
- Add systemd service to fetch AppRole token on boot
- Add systemd timer to refresh token every 30 minutes
This ensures Prometheus always has a valid token without depending on
Terraform state or manual intervention.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove auth01 host configuration and associated services in preparation
for new auth stack with different provisioning system.
Removed:
- hosts/auth01/ - host configuration
- services/authelia/ - authelia service module
- services/lldap/ - lldap service module
- secrets/auth01/ - sops secrets
- Reverse proxy entries for auth and lldap
- Monitoring alert rules for authelia and lldap
- SOPS configuration for auth01
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Document TrueNAS CORE LDAP integration approach (NFS-only) and
future NixOS NAS migration path with native Kanidm PAM/NSS.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Evaluate options for replacing LLDAP+Authelia with a unified auth solution.
Recommends Kanidm for its native NixOS PAM/NSS integration and built-in OIDC.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Updated plan with:
- Full device inventory from ha1
- Backup verification details
- Branch and commit references
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
WSDCGQ12LM sensors report battery: 0 due to firmware quirk. Override
battery calculation using voltage via homeassistant value_template.
Also adds zigbee_sensor_stale alert for detecting dead sensors regardless
of battery reporting accuracy (1 hour threshold).
Device configuration moved from external devices.yaml to inline NixOS
config for declarative management.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
NATS HTTP monitoring endpoint serves JSON, not Prometheus format.
Use the prometheus-nats-exporter which queries the NATS endpoint
and exposes proper Prometheus metrics.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add exporters and scrape targets for services lacking monitoring:
- PostgreSQL: postgres-exporter on pgdb1
- Authelia: native telemetry metrics on auth01
- Unbound: unbound-exporter with remote-control on ns1/ns2
- NATS: HTTP monitoring endpoint on nats1
- OpenBao: telemetry config and Prometheus scrape with token auth
- Systemd: systemd-exporter on all hosts for per-service metrics
Add alert rules for postgres, auth (authelia + lldap), jellyfin,
vault (openbao), plus extend existing nats and unbound rules.
Add Terraform config for Prometheus metrics policy and token. The
token is created via vault_token resource and stored in KV, so no
manual token creation is needed.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Document Loki log query labels and patterns, and Prometheus job names
with example queries for the lab-monitoring MCP server.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Update CLAUDE.md and README.md to reflect that secrets are now managed
by OpenBao, with sops only remaining for ca. Update migration plans
with sops cleanup checklist and auth01 decommission.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The read-based loop split multiline values on newlines, causing only
the first line to be written. Use jq -j to write each key's value
directly to files, preserving multiline content.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove backup_helper_secret variable and switch shared/backup/password
to auto_generate. New password will be added alongside existing restic
repository key.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace sops-nix secrets with OpenBao vault secrets across all hosts.
Hardcode root password hash, add extractKey option to vault-secrets
module, update Terraform with secrets/policies for all hosts, and
create AppRole provisioning playbook.
Hosts migrated: ha1, monitoring01, ns1, ns2, http-proxy, nix-cache01
Wave 1 hosts (nats1, jelly01, pgdb1) get AppRole policies only.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
nix-cache01 regularly hits high CPU during nix builds, causing flappy
alerts. Keep the 15m threshold for all other hosts.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The step-ca serving certificate is auto-renewed with a 24h lifetime,
so it always triggers the general < 86400s threshold. Exclude it and
add a dedicated step_ca_serving_cert_expiring alert at < 1h instead.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Move nix-cache_caddy back to a manual config in prometheus.nix using the
service CNAME (nix-cache.home.2rjus.net) instead of the hostname. The
auto-generated target used nix-cache01.home.2rjus.net which doesn't
match the TLS certificate SAN.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add homelab.monitoring NixOS options (enable, scrapeTargets) following
the same pattern as homelab.dns. Prometheus scrape configs are now
auto-generated from flake host configurations and external targets,
replacing hardcoded target lists.
Also cleans up alert rules: snake_case naming, fix zigbee2mqtt typo,
remove duplicate pushgateway alert, add for clauses to monitoring_rules,
remove hardcoded WireGuard public key, and add new alerts for
certificates, proxmox, caddy, smartctl temperature, filesystem
prediction, systemd state, file descriptors, and host reboots.
Fixes grafana scrape target port from 3100 to 3000.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add /modules/ and /lib/ to directory structure
- Document homelab.dns options and zone auto-generation
- Update "Adding a New Host" workflow (no manual zone editing)
- Expand DNS Architecture section with auto-generation details
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace static zone file with dynamically generated records:
- Add homelab.dns module with enable/cnames options
- Extract IPs from systemd.network configs (filters VPN interfaces)
- Use git commit timestamp as zone serial number
- Move external hosts to separate external-hosts.nix
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Rename nixos-options to nixpkgs-options and add new nixpkgs-packages
server for package search functionality. Update CLAUDE.md to document
both MCP servers and their available tools.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace custom backup-helper flake input with NixOS native
services.restic.backups module for ha1, monitoring01, and nixos-test1.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Fix bug where new hosts were added outside of nixosConfigurations block
instead of inside it.
Issues fixed:
1. Pattern was looking for "packages =" but actual text is "packages = forAllSystems"
2. Replacement was putting new entry AFTER closing brace instead of BEFORE
3. testvm01 was at top-level flake output instead of in nixosConfigurations
Changes:
- Update pattern to match "packages = forAllSystems"
- Put new entry BEFORE the closing brace of nixosConfigurations
- Move testvm01 to correct location inside nixosConfigurations block
Result: nix flake show now correctly shows testvm01 as NixOS configuration
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fix error "500 can't upload to storage type 'zfspool'" by using "local"
storage pool for cloud-init disks instead of the VM's storage pool.
Cloud-init disks require storage that supports ISO/snippet content types,
which zfspool does not. The "local" storage pool (directory-based) supports
this content type.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fix OpenTofu error where static IP and DHCP branches had different object
structures in the subnets array. Move conditional to network_config level
so both branches return complete, consistent yamlencode() results.
Error was: "The true and false result expressions must have consistent types"
Solution: Make network_config itself conditional rather than the subnets
array, ensuring both branches return the same type (string from yamlencode).
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Remove mention of .generated/ directory and clarify that cloud-init.tf
manages all cloud-init disks, not just branch-specific ones.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Replace SSH upload approach with native proxmox_cloud_init_disk resource
for cleaner, more maintainable cloud-init management.
Changes:
- Use proxmox_cloud_init_disk for all VMs (not just branch-specific ones)
- Include SSH keys, network config, and metadata in cloud-init disk
- Conditionally include NIXOS_FLAKE_BRANCH for VMs with flake_branch set
- Replace ide2 cloudinit disk with cdrom reference to cloud-init disk
- Remove built-in cloud-init parameters (ciuser, sshkeys, etc.)
- Remove cicustom parameter (no longer needed)
- Remove proxmox_host variable (no SSH uploads required)
- Remove .gitignore entry for .generated/ directory
Benefits:
- No SSH access to Proxmox required
- All cloud-init config managed in Terraform
- Consistent approach for all VMs
- Cleaner state management
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Implement dual improvements to enable efficient testing of pipeline changes
without polluting master branch:
1. Add --force flag to create-host script
- Skip hostname/IP uniqueness validation
- Overwrite existing host configurations
- Update entries in flake.nix and terraform/vms.tf (no duplicates)
- Useful for iterating on configurations during testing
2. Add branch support to bootstrap mechanism
- Bootstrap service reads NIXOS_FLAKE_BRANCH environment variable
- Defaults to master if not set
- Uses branch in git URL via ?ref= parameter
- Service loads environment from /etc/environment
3. Add cloud-init disk support for branch configuration
- VMs can specify flake_branch field in terraform/vms.tf
- Automatically generates cloud-init snippet setting NIXOS_FLAKE_BRANCH
- Uploads snippet to Proxmox via SSH
- Production VMs omit flake_branch and use master
4. Update documentation
- Document --force flag usage in create-host README
- Add branch testing examples in terraform README
- Update TODO.md with testing workflow
- Add .generated/ to gitignore
Testing workflow: Create feature branch, set flake_branch in VM definition,
deploy with terraform, iterate with --force flag, clean up before merging.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add filesystem configuration matching Proxmox image builder output
to allow template2 to build with both `nixos-rebuild build` and
`nixos-rebuild build-image --image-variant proxmox`.
Filesystem specs discovered from running VM:
- ext4 filesystem with label "nixos"
- x-systemd.growfs option for automatic partition growth
- No swap partition
Using lib.mkDefault ensures these definitions work for normal builds
while allowing the Proxmox image builder to override when needed.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add systemd service that automatically bootstraps freshly deployed VMs
with their host-specific NixOS configuration from the flake repository.
Changes:
- hosts/template2/bootstrap.nix: New systemd oneshot service that:
- Runs after cloud-init completes (ensures hostname is set)
- Reads hostname from hostnamectl (set by cloud-init from Terraform)
- Checks network connectivity via HTTPS (curl)
- Runs nixos-rebuild boot with flake URL
- Reboots on success, fails gracefully with clear errors on failure
- hosts/template2/configuration.nix: Configure cloud-init datasource
- Changed from NoCloud to ConfigDrive (used by Proxmox)
- Allows cloud-init to receive config from Proxmox
- hosts/template2/default.nix: Import bootstrap.nix module
- terraform/vms.tf: Add cloud-init disk to VMs
- Configure disks.ide.ide2.cloudinit block
- Removed invalid cloudinit_cdrom_storage parameter
- Enables Proxmox to inject cloud-init configuration
- TODO.md: Mark Phase 3 as completed
This eliminates the manual nixos-rebuild step from the deployment workflow.
VMs now automatically pull and apply their configuration on first boot.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Implements Phase 2 of the automated deployment pipeline.
This commit adds a Python CLI tool that automates the creation of NixOS host
configurations, eliminating manual boilerplate and reducing errors.
Features:
- Python CLI using typer framework with rich terminal UI
- Comprehensive validation (hostname format/uniqueness, IP subnet/uniqueness)
- Jinja2 templates for NixOS configurations
- Automatic updates to flake.nix and terraform/vms.tf
- Support for both static IP and DHCP configurations
- Dry-run mode for safe previews
- Packaged as Nix derivation and added to devShell
Usage:
create-host --hostname myhost --ip 10.69.13.50/24
The tool generates:
- hosts/<hostname>/default.nix
- hosts/<hostname>/configuration.nix
- Updates flake.nix with new nixosConfigurations entry
- Updates terraform/vms.tf with new VM definition
All generated configurations include full system imports (monitoring, SOPS,
autoupgrade, etc.) and are validated with nix flake check and tofu validate.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Phase 1 is now fully implemented with parameterized multi-VM deployments
via OpenTofu. Updated status, tasks, and added implementation details.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Implements Phase 1 of the OpenTofu deployment plan:
- Replace single-VM configuration with locals-based for_each pattern
- Support multiple VMs in single deployment
- Automatic DHCP vs static IP detection
- Configurable defaults with per-VM overrides
- Dynamic outputs for VM IPs and specifications
New files:
- outputs.tf: Dynamic outputs for deployed VMs
- vms.tf: VM definitions using locals.vms map
Updated files:
- variables.tf: Added default variables for VM configuration
- README.md: Comprehensive documentation and examples
Removed files:
- vm.tf: Replaced by new vms.tf (archived as vm.tf.old, then removed)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Document multi-phase plan for automating NixOS host creation, deployment, and configuration on Proxmox including OpenTofu parameterization, config generation, bootstrap mechanism, secrets management, and Nix-based DNS automation.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add automated workflow for building and deploying NixOS VMs on Proxmox including template2 host configuration, Ansible playbook for image building/deployment, and OpenTofu configuration for VM provisioning with cloud-init.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
description: Reference guide for exploring Prometheus metrics and Loki logs when troubleshooting homelab issues. Use when investigating system state, deployments, service health, or searching logs.
---
# Observability Troubleshooting Guide
Quick reference for exploring Prometheus metrics and Loki logs to troubleshoot homelab issues.
## Available Tools
Use the `lab-monitoring` MCP server tools:
**Metrics:**
-`search_metrics` - Find metrics by name substring
-`get_metric_metadata` - Get type/help for a specific metric
-`query` - Execute PromQL queries
-`list_targets` - Check scrape target health
-`list_alerts` / `get_alert` - View active alerts
**Logs:**
-`query_logs` - Execute LogQL queries against Loki
-`list_labels` - List available log labels
-`list_label_values` - List values for a specific label
description: Create a planning document for a future homelab project. Use when the user wants to document ideas for future work without implementing immediately.
argument-hint: [topic or feature to plan]
---
# Quick Plan Generator
Create a planning document for a future homelab infrastructure project. Plans are for documenting ideas and approaches that will be implemented later, not immediately.
## Input
The user provides: $ARGUMENTS
## Process
1.**Understand the topic**: Research the codebase to understand:
- Current state of related systems
- Existing patterns and conventions
- Relevant NixOS options or packages
- Any constraints or dependencies
2.**Evaluate options**: If there are multiple approaches, research and compare them with pros/cons.
3.**Draft the plan**: Create a markdown document following the structure below.
4.**Save the plan**: Write to `docs/plans/<topic-slug>.md` using a kebab-case filename derived from the topic.
## Plan Structure
Use these sections as appropriate (not all plans need every section):
```markdown
# Title
## Overview/Goal
Brief description of what this plan addresses and why.
## Current State
What exists today that's relevant to this plan.
## Options Evaluated (if multiple approaches)
For each option:
- **Option Name**
- **Pros:** bullet points
- **Cons:** bullet points
- **Verdict:** brief assessment
Or use a comparison table for structured evaluation.
## Recommendation/Decision
What approach is recommended and why. Include rationale.
## Implementation Steps
Numbered phases or steps. Be specific but not overly detailed.
Can use sub-sections for major phases.
## Open Questions
Things still to be determined. Use checkbox format:
- [ ] Question 1?
- [ ] Question 2?
## Notes (optional)
Additional context, caveats, or references.
```
## Style Guidelines
- **Concise**: Use bullet points, avoid verbose paragraphs
- **Technical but accessible**: Include NixOS config snippets when relevant
- **Future-oriented**: These are plans, not specifications
- **Acknowledge uncertainty**: Use "Open Questions" for unresolved decisions
- **Reference existing patterns**: Mention how this fits with existing infrastructure
- **Tables for comparisons**: Use markdown tables when comparing options
- **Practical focus**: Emphasize what needs to happen, not theory
## Examples of Good Plans
Reference these existing plans for style guidance:
-`docs/plans/auth-system-replacement.md` - Good option evaluation with table
-`docs/plans/truenas-migration.md` - Good decision documentation with rationale
-`docs/plans/remote-access.md` - Good multi-option comparison
-`docs/plans/prometheus-scrape-target-labels.md` - Good implementation detail level
## After Creating the Plan
1. Tell the user the plan was saved to `docs/plans/<filename>.md`
2. Summarize the key points
3. Ask if they want any adjustments before committing
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Repository Overview
This is a Nix Flake-based NixOS configuration repository for managing a homelab infrastructure consisting of 16 server configurations. The repository uses a modular architecture with shared system configurations, reusable service modules, and per-host customization.
## Common Commands
### Building Configurations
```bash
# List all available configurations
nix flake show
# Build a specific host configuration locally (without deploying)
**Important:** Do NOT pipe `nix build` commands to other commands like `tail` or `head`. Piping can hide errors and make builds appear successful when they actually failed. Always run `nix build` without piping to see the full output.
```bash
# BAD - hides errors
nix build .#create-host 2>&1| tail -20
# GOOD - shows all output and errors
nix build .#create-host
```
### Deployment
Do not automatically deploy changes. Deployments are usually done by updating the master branch, and then triggering the auto update on the specific host.
### Testing Feature Branches on Hosts
All hosts have the `nixos-rebuild-test` helper script for testing feature branches before merging:
```bash
# On the target host, test a feature branch
nixos-rebuild-test boot <branch-name>
nixos-rebuild-test switch <branch-name>
# Additional arguments are passed through to nixos-rebuild
nixos-rebuild-test boot my-feature --show-trace
```
When working on a feature branch that requires testing on a live host, suggest using this command instead of the full flake URL syntax.
### Flake Management
```bash
# Check flake for errors
nix flake check
```
Do not run `nix flake update`. Should only be done manually by user.
### Development Environment
```bash
# Enter development shell (provides ansible, python3)
nix develop
```
### Secrets Management
Secrets are managed by OpenBao (Vault) using AppRole authentication. Most hosts use the
`vault.secrets` option defined in `system/vault-secrets.nix` to fetch secrets at boot.
Terraform manages the secrets and AppRole policies in `terraform/vault/`.
Legacy sops-nix is still present but only actively used by the `ca` host. Do not edit any
`.sops.yaml` or any file within `secrets/`. Ask the user to modify if necessary.
### Git Workflow
**Important:** Never commit directly to `master` unless the user explicitly asks for it. Always create a feature branch for changes.
**Important:** Never amend commits to `master` unless the user explicitly asks for it. Amending rewrites history and causes issues for deployed configurations.
When starting a new plan or task, the first step should typically be to create and checkout a new branch with an appropriate name (e.g., `git checkout -b dns-automation` or `git checkout -b fix-nginx-config`).
### Plan Management
When creating plans for large features, follow this workflow:
1. When implementation begins, save a copy of the plan to `docs/plans/` (e.g., `docs/plans/feature-name.md`)
2. Once the feature is fully implemented, move the plan to `docs/plans/completed/`
### Git Commit Messages
Commit messages should follow the format: `topic: short description`
Examples:
-`flake: add opentofu to devshell`
-`template2: add proxmox image configuration`
-`terraform: add VM deployment configuration`
### Clipboard
To copy text to the clipboard, pipe to `wl-copy` (Wayland):
```bash
echo"text"| wl-copy
```
### NixOS Options and Packages Lookup
Two MCP servers are available for searching NixOS options and packages:
- **nixpkgs-options** - Search and lookup NixOS configuration option documentation
- **nixpkgs-packages** - Search and lookup Nix packages from nixpkgs
**Session Setup:** At the start of each session, index the nixpkgs revision from `flake.lock` to ensure documentation matches the project's nixpkgs version:
1. Read `flake.lock` and find the `nixpkgs` node's `rev` field
2. Call `index_revision` with that git hash (both servers share the same index)
**Options Tools (nixpkgs-options):**
-`search_options` - Search for options by name or description (e.g., query "nginx" or "postgresql")
-`get_option` - Get full details for a specific option (e.g., `services.loki.configuration`)
-`get_file` - Fetch the source file from nixpkgs that declares an option
**Package Tools (nixpkgs-packages):**
-`search_packages` - Search for packages by name or description (e.g., query "nginx" or "python")
-`get_package` - Get full details for a specific package by attribute path (e.g., `firefox`, `python312Packages.requests`)
-`get_file` - Fetch the source file from nixpkgs that defines a package
This ensures documentation matches the exact nixpkgs version (currently NixOS 25.11) used by this flake.
### Lab Monitoring Log Queries
The **lab-monitoring** MCP server can query logs from Loki. All hosts ship systemd journal logs via Promtail.
**Loki Label Reference:**
-`host` - Hostname (e.g., `ns1`, `ns2`, `monitoring01`, `ha1`). Use this label, not `hostname`.
-`systemd_unit` - Systemd unit name (e.g., `nsd.service`, `prometheus.service`, `nixos-upgrade.service`)
-`job` - Either `systemd-journal` (most logs) or `varlog` (file-based logs like caddy access logs)
-`filename` - For `varlog` job, the log file path (e.g., `/var/log/caddy/nix-cache.log`)
Journal log entries are JSON-formatted with the actual log message in the `MESSAGE` field. Other useful fields include `PRIORITY` and `SYSLOG_IDENTIFIER`.
OpenTofu outputs the VM's IP address after deployment for easy SSH access.
#### Template Rebuilding and Terraform State
When the Proxmox template is rebuilt (via `build-and-deploy-template.yml`), the template name may change. This would normally cause Terraform to want to recreate all existing VMs, but that's unnecessary since VMs are independent once cloned.
**Solution**: The `terraform/vms.tf` file includes a lifecycle rule to ignore certain attributes that don't need management:
```hcl
lifecycle {
ignore_changes=[
clone, # Template name can change without recreating VMs
startup_shutdown, # Proxmox sets defaults (-1) that we don't need to manage
]
}
```
This means:
- **clone**: Existing VMs are not affected by template name changes; only new VMs use the updated template
- **startup_shutdown**: Proxmox sets default startup order/delay values (-1) that Terraform would otherwise try to remove
- You can safely update `default_template_name` in `terraform/variables.tf` without recreating VMs
-`tofu plan` won't show spurious changes for Proxmox-managed defaults
**When rebuilding the template:**
1. Run `nix develop -c ansible-playbook -i playbooks/inventory.ini playbooks/build-and-deploy-template.yml`
2. Update `default_template_name` in `terraform/variables.tf` if the name changed
3. Run `tofu plan` - should show no VM recreations (only template name in state)
4. Run `tofu apply` - updates state without touching existing VMs
5. New VMs created after this point will use the new template
### Adding a New Host
1. Create `/hosts/<hostname>/` directory
2. Copy structure from `template1` or similar host
3. Add host entry to `flake.nix` nixosConfigurations
4. Configure networking in `configuration.nix` (static IP via `systemd.network.networks`, DNS servers)
5. (Optional) Add `homelab.dns.cnames` if the host needs CNAME aliases
6. Add `vault.enable = true;` to the host configuration
7. Add AppRole policy in `terraform/vault/approle.tf` and any secrets in `secrets.tf`
13. Deploy by running `nixos-rebuild boot --flake URL#<hostname>` on the host.
14. Run auto-upgrade on DNS servers (ns1, ns2) to pick up the new host's DNS entry
**Note:** DNS A records and Prometheus node-exporter scrape targets are auto-generated from the host's `systemd.network.networks` static IP configuration. No manual zone file or Prometheus config editing is required.
### Important Patterns
**Overlay usage**: Access unstable packages via `pkgs.unstable.<package>` (defined in flake.nix overlay-unstable)
**Service composition**: Services in `/services/` are designed to be imported by multiple hosts. Keep them modular and reusable.
**Hardware configuration reuse**: Multiple hosts share `/hosts/template/hardware-configuration.nix` for VM instances.
**State version**: All hosts use stateVersion `"23.11"` - do not change this on existing hosts.
**Firewall**: Disabled on most hosts (trusted network). Enable selectively in host configuration if needed.
**Shell scripts**: Use `pkgs.writeShellApplication` instead of `pkgs.writeShellScript` or `pkgs.writeShellScriptBin` for creating shell scripts. `writeShellApplication` provides automatic shellcheck validation, sets strict bash options (`set -euo pipefail`), and allows declaring `runtimeInputs` for dependencies. When referencing the executable path (e.g., in `ExecStart`), use `lib.getExe myScript` to get the proper `bin/` path.
### Monitoring Stack
All hosts ship metrics and logs to `monitoring01`:
- **Metrics**: Prometheus scrapes node-exporter from all hosts
- **Logs**: Promtail ships logs to Loki on monitoring01
- **Access**: Grafana at monitoring01 for visualization
- **Tracing**: Tempo for distributed tracing
- **Profiling**: Pyroscope for continuous profiling
**Scrape Target Auto-Generation:**
Prometheus scrape targets are automatically generated from host configurations, following the same pattern as DNS zone generation:
- **Node-exporter**: All flake hosts with static IPs are automatically added as node-exporter targets
- **Service targets**: Defined via `homelab.monitoring.scrapeTargets` in service modules
- **External targets**: Non-flake hosts defined in `/services/monitoring/external-targets.nix`
- **Library**: `lib/monitoring.nix` provides `generateNodeExporterTargets` and `generateScrapeConfigs`
Host monitoring options (`homelab.monitoring.*`):
-`enable` (default: `true`) - Include host in Prometheus node-exporter scrape targets
-`scrapeTargets` (default: `[]`) - Additional scrape targets exposed by this host (job_name, port, metrics_path, scheme, scrape_interval, honor_labels)
Service modules declare their scrape targets directly (e.g., `services/ca/default.nix` declares step-ca on port 9000). The Prometheus config on monitoring01 auto-generates scrape configs from all hosts.
To add monitoring targets for non-NixOS hosts, edit `/services/monitoring/external-targets.nix`.
### DNS Architecture
-`ns1` (10.69.13.5) - Primary authoritative DNS + resolver
-`ns2` (10.69.13.6) - Secondary authoritative DNS (AXFR from ns1)
- All hosts point to ns1/ns2 for DNS resolution
**Zone Auto-Generation:**
DNS zone entries are automatically generated from host configurations:
- **Flake-managed hosts**: A records extracted from `systemd.network.networks` static IPs
- **CNAMEs**: Defined via `homelab.dns.cnames` option in host configs
- **External hosts**: Non-flake hosts defined in `/services/ns/external-hosts.nix`
NixOS Flake-based configuration repository for a homelab infrastructure. All hosts run NixOS 25.11 and are managed declaratively through this single repository.
## Configurations in use
## Hosts
* ha1
* ns1
* ns2
* template1
| Host | Role |
|------|------|
| `ns1`, `ns2` | Primary/secondary authoritative DNS |
| `ca` | Internal Certificate Authority |
| `ha1` | Home Assistant + Zigbee2MQTT + Mosquitto |
**Automatic DNS zone generation** - A records are derived from each host's static IP configuration. CNAME aliases are defined via `homelab.dns.cnames`. No manual zone file editing required.
**OpenBao (Vault) secrets** - Hosts authenticate via AppRole and fetch secrets at boot. Secrets and policies are managed as code in `terraform/vault/`. Legacy SOPS remains only for the `ca` host.
**Daily auto-upgrades** - All hosts pull from the master branch and automatically rebuild and reboot on a randomized schedule.
**Shared base configuration** - Every host automatically gets SSH, monitoring (node-exporter + Promtail), internal ACME certificates, and Nix binary cache access via the `system/` modules.
**Proxmox VM provisioning** - Build VM templates with Ansible and deploy VMs with OpenTofu from `terraform/`.
**OpenBao (Vault) secrets** - Centralized secrets management with AppRole authentication, PKI infrastructure, and automated bootstrap. Managed as code in `terraform/vault/`.
## Usage
```bash
# Enter dev shell (provides ansible, opentofu, openbao, create-host)
- [x] Tested ACME issuance on vaulttest01 successfully
- [x] Verified certificate chain and trust
- [x] Migrate vault01's own certificate
- [x] Created `bootstrap-vault-cert` script for initial certificate issuance via bao CLI
- [x] Issued certificate with SANs (vault01.home.2rjus.net + vault.home.2rjus.net)
- [x] Updated service to read certificates from `/var/lib/acme/vault01.home.2rjus.net/`
- [x] Configured ACME for automatic renewals
- [ ] Migrate hosts from step-ca to OpenBao
- [x] Tested on vaulttest01 (non-production host)
- [ ] Standardize hostname usage across all configurations
- [ ] Use `vault.home.2rjus.net` (CNAME) consistently everywhere
- [ ] Update NixOS configurations to use CNAME instead of vault01
- [ ] Update Terraform configurations to use CNAME
- [ ] Audit and fix mixed usage of vault01.home.2rjus.net vs vault.home.2rjus.net
- [ ] Update `system/acme.nix` to use OpenBao ACME endpoint
- [ ] Change server to `https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory`
- [ ] Roll out to all hosts via auto-upgrade
- [ ] Configure SSH CA in OpenBao (optional, future work)
- [ ] Enable SSH secrets engine (`ssh/` mount)
- [ ] Generate SSH signing keys
- [ ] Create roles for host and user certificates
- [ ] Configure TTLs and allowed principals
- [ ] Distribute SSH CA public key to all hosts
- [ ] Update sshd_config to trust OpenBao CA
- [ ] Decommission step-ca
- [ ] Verify all ACME services migrated and working
- [ ] Stop step-ca service on ca host
- [ ] Archive step-ca configuration for backup
- [ ] Update documentation
**Implementation Details (2026-02-03):**
**ACME Configuration Fix:**
The key blocker was that OpenBao's PKI mount was filtering out required ACME response headers. The solution was to add `allowed_response_headers` to the Terraform mount configuration:
```hcl
allowed_response_headers=[
"Replay-Nonce", # Required for ACME nonce generation
"Link", # Required for ACME navigation
"Location" # Required for ACME resource location
]
```
**Cluster Path Configuration:**
ACME requires the cluster path to include the full API path:
This document describes the physical and virtual infrastructure components that support the NixOS-managed servers in this repository.
## Overview
The homelab consists of several core infrastructure components:
- **Proxmox VE** - Hypervisor hosting all NixOS VMs
- **TrueNAS** - Network storage and backup target
- **Ubiquiti EdgeRouter** - Primary router and gateway
- **Mikrotik Switch** - Core network switching
All NixOS configurations in this repository run as VMs on Proxmox and rely on these underlying infrastructure components.
## Network Topology
### Subnets
VLAN numbers are based on third octet of ip address.
TODO: VLAN naming is currently inconsistent across router/switch/Proxmox configurations. Need to standardize VLAN names and update all device configs to use consistent naming.
-`10.69.8.x` - Kubernetes (no longer in use)
-`10.69.12.x` - Core services
-`10.69.13.x` - NixOS VMs and core services
-`10.69.30.x` - Client network 1
-`10.69.31.x` - Clients network 2
-`10.69.99.x` - Management network
### Core Network Services
- **Gateway**: Web UI exposed on 10.69.10.1
- **DNS**: ns1 (10.69.13.5), ns2 (10.69.13.6)
- **Primary DNS Domain**: `home.2rjus.net`
## Hardware Components
### Proxmox Hypervisor
**Purpose**: Hosts all NixOS VMs defined in this repository
**Hardware**:
- CPU: AMD Ryzen 9 3900X 12-Core Processor
- RAM: 96GB (94Gi)
- Storage: 1TB NVMe SSD (nvme0n1)
**Management**:
- Web UI: `https://pve1.home.2rjus.net:8006`
- Cluster: Standalone
- Version: Proxmox VE 8.4.16 (kernel 6.8.12-18-pve)
**VM Provisioning**:
- Template VM: ID 9000 (built from `hosts/template2`)
- See `/terraform` directory for automated VM deployment using OpenTofu
**Storage**:
- ZFS pool: `rpool` on NVMe partition (nvme0n1p3)
- Total capacity: ~900GB (232GB used, 667GB available)
- Configuration: Single disk (no RAID)
- Scrub status: Last scrub completed successfully with 0 errors
TODO: Improve monitoring for physical hosts (proxmox, nas)
TODO: Improve monitoring for networking equipment
All NixOS VMs ship metrics to monitoring01 via node-exporter and logs via Promtail. See `/services/monitoring/` for the observability stack configuration.
Replace the current auth01 setup (LLDAP + Authelia) with a modern, unified authentication solution. The current setup is not in active use, making this a good time to evaluate alternatives.
## Goals
1.**Central user database** - Manage users across all homelab hosts from a single source
2.**Linux PAM/NSS integration** - Users can SSH into hosts using central credentials
3.**UID/GID consistency** - Proper POSIX attributes for NAS share permissions
4.**OIDC provider** - Single sign-on for homelab web services (Grafana, etc.)
## Options Evaluated
### OpenLDAP (raw)
- **NixOS Support:** Good (`services.openldap` with `declarativeContents`)
- **Pros:** Most widely supported, very flexible
- **Cons:** LDIF format is painful, schema management is complex, no built-in OIDC, requires SSSD on each client
**Goal:** Automatically generate DNS entries from host configurations
**Approach:** Leverage Nix to generate zone file entries from flake host configurations
Since most hosts use static IPs defined in their NixOS configurations, we can extract this information and automatically generate A records. This keeps DNS in sync with the actual host configs.
## Implementation
- [x] Add optional CNAME field to host configurations
- [x] Added `homelab.dns.cnames` option in `modules/homelab/dns.nix`
- [x] Added `homelab.dns.enable` to allow opting out (defaults to true)
- [x] Documented in CLAUDE.md
- [x] Create Nix function to extract DNS records from all hosts
- [x] Created `lib/dns-zone.nix` with extraction functions
- [x] Parses each host's `networking.hostName` and `systemd.network.networks` IP configuration
- [x] Collects CNAMEs from `homelab.dns.cnames`
- [x] Filters out VPN interfaces (wg*, tun*, tap*, vti*)
- [x] Generates complete zone file with A and CNAME records
- [x] Integrate auto-generated records into zone files
- [x] External hosts separated to `services/ns/external-hosts.nix`
- [x] Zone includes comments showing which records are auto-generated vs external
- [x] Update zone file serial number automatically
Audit of services running in the homelab that lack monitoring coverage, either missing Prometheus scrape targets, alerting rules, or both.
## Services with No Monitoring
### PostgreSQL (`pgdb1`)
- **Current state:** No scrape targets, no alert rules
- **Risk:** A database outage would go completely unnoticed by Prometheus
- **Recommendation:** Enable `services.prometheus.exporters.postgres` (available in nixpkgs). This exposes connection counts, query throughput, replication lag, table/index stats, and more. Add alerts for at least `postgres_down` (systemd unit state) and connection pool exhaustion.
### Authelia (`auth01`)
- **Current state:** No scrape targets, no alert rules
- **Risk:** The authentication gateway being down blocks access to all proxied services
- **Recommendation:** Authelia exposes Prometheus metrics natively at `/metrics`. Add a scrape target and at minimum an `authelia_down` systemd unit state alert.
### LLDAP (`auth01`)
- **Current state:** No scrape targets, no alert rules
- **Risk:** LLDAP is a dependency of Authelia -- if LDAP is down, authentication breaks even if Authelia is running
- **Recommendation:** Add an `lldap_down` systemd unit state alert. LLDAP does not expose Prometheus metrics natively, so systemd unit monitoring via node-exporter may be sufficient.
### Vault / OpenBao (`vault01`)
- **Current state:** No scrape targets, no alert rules
- **Risk:** Secrets management service failures go undetected
- **Recommendation:** OpenBao supports Prometheus telemetry output natively. Add a scrape target for the telemetry endpoint and alerts for `vault_down` (systemd unit) and seal status.
### Gitea Actions Runner
- **Current state:** No scrape targets, no alert rules
- **Risk:** CI/CD failures go undetected
- **Recommendation:** Add at minimum a systemd unit state alert. The runner itself has limited metrics exposure.
## Services with Partial Monitoring
### Jellyfin (`jelly01`)
- **Current state:** Has scrape targets (port 8096), metrics are being collected, but zero alert rules
- **Metrics available:** 184 metrics, all .NET runtime / ASP.NET Core level. No Jellyfin-specific metrics (active streams, library size, transcoding sessions). Key useful metrics:
-`microsoft_aspnetcore_hosting_failed_requests` - rate of HTTP errors
- **Recommendation:** Add a `jellyfin_down` alert using either `up{job="jellyfin"} == 0` or systemd unit state. Consider alerting on sustained `failed_requests` rate increase.
### NATS (`nats1`)
- **Current state:** Has a `nats_down` alert (systemd unit state via node-exporter), but no NATS-specific metrics
- **Metrics available:** NATS has a built-in `/metrics` endpoint exposing connection counts, message throughput, JetStream consumer lag, and more
- **Recommendation:** Add a scrape target for the NATS metrics endpoint. Consider alerts for connection count spikes, slow consumers, and JetStream storage usage.
### DNS - Unbound (`ns1`, `ns2`)
- **Current state:** Has `unbound_down` alert (systemd unit state), but no DNS query metrics
- **Available in nixpkgs:** `services.prometheus.exporters.unbound.enable` (package: `prometheus-unbound-exporter` v0.5.0). Exposes query counts, cache hit ratios, response types (SERVFAIL, NXDOMAIN), upstream latency.
- **Recommendation:** Enable the unbound exporter on ns1/ns2. Add alerts for cache hit ratio drops and SERVFAIL rate spikes.
### DNS - NSD (`ns1`, `ns2`)
- **Current state:** Has `nsd_down` alert (systemd unit state), no NSD-specific metrics
- **Available in nixpkgs:** Nothing. No exporter package or NixOS module. Community `nsd_exporter` exists but is not packaged.
- **Recommendation:** The existing systemd unit alert is likely sufficient. NSD is a simple authoritative-only server with limited operational metrics. Not worth packaging a custom exporter for now.
## Existing Monitoring (for reference)
These services have adequate alerting and/or scrape targets:
No per-service CPU, memory, or IO metrics are collected. The existing node-exporter systemd collector only provides unit state (active/inactive/failed), socket stats, and timer triggers. While systemd tracks per-unit resource usage via cgroups internally (visible in `systemctl status` and `systemd-cgtop`), this data is not exported to Prometheus.
### Available Solution
The `prometheus-systemd-exporter` package (v0.7.0) is available in nixpkgs with a ready-made NixOS module:
This exporter reads cgroup data and exposes per-unit metrics including:
- CPU seconds consumed per service
- Memory usage per service
- Task/process counts per service
- Restart counts
- IO usage
### Recommendation
Enable on all hosts via the shared `system/` config (same pattern as node-exporter). Add a corresponding scrape job on monitoring01. This would give visibility into resource consumption per service across the fleet, useful for capacity planning and diagnosing noisy-neighbor issues on shared hosts.
## Suggested Priority
1.**PostgreSQL** - Critical infrastructure, easy to add with existing nixpkgs module
2.**Authelia + LLDAP** - Auth outage affects all proxied services
3.**Unbound exporter** - Ready-to-go NixOS module, just needs enabling
4.**Jellyfin alerts** - Metrics already collected, just needs alert rules
5.**NATS metrics** - Built-in endpoint, just needs a scrape target
Three Aqara Zigbee temperature sensors report `battery: 0` in their MQTT payload, making the `hass_sensor_battery_percent` Prometheus metric useless for battery monitoring on these devices.
Affected sensors:
- **Temp Living Room** (`0x54ef441000a54d3c`) — WSDCGQ12LM
### Solution 1: Calculate battery from voltage in Zigbee2MQTT (Implemented)
Override the Home Assistant battery entity's `value_template` in Zigbee2MQTT device configuration to calculate battery percentage from voltage.
**Formula:**`(voltage - 2100) / 9` (maps 2100-3000mV to 0-100%)
**Changes in `services/home-assistant/default.nix`:**
- Device configuration moved from external `devices.yaml` to inline NixOS config
- Three affected sensors have `homeassistant.sensor_battery.value_template` override
- All 12 devices now declaratively managed
**Expected battery values based on current voltages:**
| Sensor | Voltage | Expected Battery |
|--------|---------|------------------|
| Temp Living Room | 2710 mV | ~68% |
| Temp Office | 2658 mV | ~62% |
| temp_server | 2765 mV | ~74% |
### Solution 2: Alert on sensor staleness (Implemented)
Added Prometheus alert `zigbee_sensor_stale` in `services/monitoring/rules.yml` that fires when a Zigbee temperature sensor hasn't updated in over 1 hour. This provides defense-in-depth for detecting dead sensors regardless of battery reporting accuracy.
-`/var/lib/vault` — Contains AppRole credentials; not backed up (can be re-provisioned via Ansible)
-`/var/lib/sops-nix` — Legacy; ha1 uses Vault now
## Post-Deployment Steps
After deploying to ha1:
1. Restart zigbee2mqtt service (automatic on NixOS rebuild)
2. In Home Assistant, the battery entities may need to be re-discovered:
- Go to Settings → Devices & Services → MQTT
- The new `value_template` should take effect after entity re-discovery
- If not, try disabling and re-enabling the battery entities
## Notes
- Device configuration is now declarative in NixOS. Future device additions via Zigbee2MQTT frontend will need to be added to the NixOS config to persist.
- The `devices.yaml` file on ha1 will be overwritten on service start but can be removed after confirming the new config works.
- The NixOS zigbee2mqtt module defaults to `devices = "devices.yaml"` but our explicit inline config overrides this.
Build a Prometheus exporter for metrics specific to our homelab infrastructure. Unlike the generic nixos-exporter, this covers services and patterns unique to our environment.
## Current State
### Existing Exporters
- **node-exporter** (all hosts): System metrics
- **systemd-exporter** (all hosts): Service restart counts, IP accounting
- **labmon** (monitoring01): TLS certificate monitoring, step-ca health
Current Prometheus configuration retains metrics for 30 days (`retentionTime = "30d"`). Extending retention further raises disk usage concerns on the homelab hypervisor with limited local storage.
Prometheus does not support downsampling - it stores all data at full resolution until the retention period expires, then deletes it entirely.
## Current Configuration
Location: `services/monitoring/prometheus.nix`
- **Retention**: 30 days
- **Scrape interval**: 15s
- **Features**: Alertmanager, Pushgateway, auto-generated scrape configs from flake hosts
- **Storage**: Local disk on monitoring01
## Options Evaluated
### Option 1: VictoriaMetrics
VictoriaMetrics is a Prometheus-compatible TSDB with significantly better compression (5-10x smaller storage footprint).
**NixOS Options Available:**
-`services.victoriametrics.enable`
-`services.victoriametrics.prometheusConfig` - accepts Prometheus scrape config format
-`services.victoriametrics.retentionPeriod` - e.g., "6m" for 6 months
-`services.vmagent` - dedicated scraping agent
-`services.vmalert` - alerting rules evaluation
**Pros:**
- Simple migration - single service replacement
- Same PromQL query language - Grafana dashboards work unchanged
- Same scrape config format - existing auto-generated configs work as-is
- 5-10x better compression means 30 days of Prometheus data could become 180+ days
- Lightweight, single binary
**Cons:**
- No automatic downsampling (relies on compression alone)
- Alerting requires switching to vmalert instead of Prometheus alertmanager integration
- Would need to migrate existing data or start fresh
**Migration Steps:**
1. Replace `services.prometheus` with `services.victoriametrics`
2. Move scrape configs to `prometheusConfig`
3. Set up `services.vmalert` for alerting rules
4. Update Grafana datasource to VictoriaMetrics port (8428)
5. Keep Alertmanager for notification routing
### Option 2: Thanos
Thanos extends Prometheus with long-term storage and automatic downsampling by uploading data to object storage.
**NixOS Options Available:**
-`services.thanos.sidecar` - uploads Prometheus blocks to object storage
-`services.thanos.compact` - compacts and downsamples data
-`services.thanos.query` - unified query gateway
-`services.thanos.query-frontend` - query caching and parallelization
-`services.thanos.downsample` - dedicated downsampling service
**Downsampling Behavior:**
- Raw resolution kept for configurable period (default: indefinite)
- 5-minute resolution created after 40 hours
- 1-hour resolution created after 10 days
**Retention Configuration (in compactor):**
```nix
services.thanos.compact={
retention.resolution-raw="30d";# Keep raw for 30 days
retention.resolution-5m="180d";# Keep 5m samples for 6 months
retention.resolution-1h="2y";# Keep 1h samples for 2 years
};
```
**Pros:**
- True downsampling - older data uses progressively less storage
- Keep metrics for years with minimal storage impact
- Prometheus continues running unchanged
- Existing Alertmanager integration preserved
**Cons:**
- Requires object storage (MinIO, S3, or local filesystem)
- Multiple services to manage (sidecar, compactor, query)
- More complex architecture
- Additional infrastructure (MinIO) may be needed
**Required Components:**
1. Thanos Sidecar (runs alongside Prometheus)
2. Object storage (MinIO or local filesystem)
3. Thanos Compactor (handles downsampling)
4. Thanos Query (provides unified query endpoint)
**Migration Steps:**
1. Deploy object storage (MinIO or configure filesystem backend)
2. Add Thanos sidecar pointing to Prometheus data directory
3. Add Thanos compactor with retention policies
4. Add Thanos query gateway
5. Update Grafana datasource to Thanos Query port (10902)
| Grafana changes | Change port only | Change port only |
| Alerting changes | Need vmalert | Keep existing |
## Recommendation
**Start with VictoriaMetrics** for simplicity. The compression alone may provide 6+ months of retention in the same disk space currently used for 30 days.
If multi-year retention with true downsampling becomes necessary, Thanos can be evaluated later. However, it requires deploying object storage infrastructure (MinIO) which adds operational complexity.
Create a message-based deployment system that allows triggering NixOS configuration updates on-demand, rather than waiting for the daily auto-upgrade timer. This enables faster iteration when testing changes and immediate fleet-wide deployments.
## Goals
1.**On-demand deployment** - Trigger config updates immediately via NATS message
2.**Targeted deployment** - Deploy to specific hosts or all hosts
3.**Branch/revision support** - Test feature branches before merging to master
4.**MCP integration** - Allow Claude Code to trigger deployments during development
## Current State
- **Auto-upgrade**: All hosts run `nixos-upgrade.service` daily, pulling from master
- **Manual testing**: `nixos-rebuild-test <action> <branch>` helper exists on all hosts
- **NATS**: Running on nats1 with JetStream enabled, using NKey authentication
- **Accounts**: ADMIN (system) and HOMELAB (user workloads with JetStream)
| Deploy to all test hosts | `deploy.test.all` | ✅ | ✅ |
| Deploy to all prod hosts | `deploy.prod.all` | ❌ | ✅ |
| Deploy to all DNS servers | `deploy.prod.role.dns` | ❌ | ✅ |
All NKeys stored in Vault - MCP gets limited credentials, admin CLI gets full-access credentials.
### Host Metadata
Rather than defining `tier` in the listener config, use a central `homelab.host` module that provides host metadata for multiple consumers. This aligns with the approach proposed in `docs/plans/prometheus-scrape-target-labels.md`.
**Status:** The `homelab.host` module is implemented in `modules/homelab/host.nix`.
Hosts can be filtered by tier using `config.homelab.host.tier`.
**Module definition (in `modules/homelab/host.nix`):**
- **Audit logging**: Log all deployment requests with source identity
- **Network isolation**: NATS only accessible from internal network
## Decisions
All open questions have been resolved. See Notes section for decision rationale.
## Notes
- The existing `nixos-rebuild-test` helper provides a good reference for the rebuild logic
- Uses NATS request/reply pattern for immediate validation feedback and completion status
- Consider using NATS headers for metadata (request ID, timestamp)
- **Timeout decision**: Metrics show no-change upgrades complete in 5-55 seconds. A 10-minute default provides ample headroom for actual updates with package downloads. Per-host override available for hosts with known longer build times.
- **Rollback**: Not needed as a separate feature - deploy an older commit hash to effectively rollback.
- **Offline hosts**: No message persistence - if host is offline, deploy fails. Daily auto-upgrade is the safety net. Avoids complexity of JetStream deduplication (host coming online and applying 10 queued updates instead of just the latest).
- **Deploy history**: Use existing Loki - listener logs deployments to journald, queryable via Loki. No need for separate JetStream persistence.
- **Naming**: `homelab-deploy` - ties it to the infrastructure rather than implementation details.
This document contains planned improvements to the NixOS infrastructure that are not directly part of the automated deployment pipeline.
## Planned
### Custom NixOS Options for Service and System Configuration
Currently, most service configurations in `services/` and shared system configurations in `system/` are written as plain NixOS module imports without declaring custom options. This means host-specific customization is done by directly setting upstream NixOS options or by duplicating configuration across hosts.
The `homelab.dns` module (`modules/homelab/dns.nix`) is the first example of defining custom options under a `homelab.*` namespace. This pattern should be extended to more of the repository's configuration.
**Goals:**
- Define `homelab.*` options for services and shared configuration where it makes sense, following the pattern established by `homelab.dns`
- Allow hosts to enable/configure services declaratively (e.g. `homelab.monitoring.enable`, `homelab.http-proxy.virtualHosts`) rather than importing opaque module files
- Keep options simple and focused — wrap only the parts that vary between hosts or that benefit from a clearer interface. Not everything needs a custom option.
**Candidate areas:**
-`system/` modules (e.g. auto-upgrade schedule, ACME CA URL, monitoring endpoints)
-`services/` modules where multiple hosts use the same service with different parameters
- Cross-cutting concerns that are currently implicit (e.g. which Loki endpoint promtail ships to)
## Completed
- [DNS Automation](completed/dns-automation.md) - Automatically generate DNS entries from host configurations
Add support for custom per-host labels on Prometheus scrape targets, enabling alert rules to reference host metadata (priority, role) instead of hardcoding instance names.
**Related:** This plan shares the `homelab.host` module with `docs/plans/nats-deploy-service.md`, which uses the same metadata for deployment tier assignment.
## Motivation
Some hosts have workloads that make generic alert thresholds inappropriate. For example, `nix-cache01` regularly hits high CPU during builds, requiring a longer `for` duration on `high_cpu_load`. Currently this is handled by excluding specific instance names in PromQL expressions, which is brittle and doesn't scale.
With per-host labels, alert rules can use semantic filters like `{priority!="low"}` instead of `{instance!="nix-cache01.home.2rjus.net:9100"}`.
## Proposed Labels
### `priority`
Indicates alerting importance. Hosts with `priority = "low"` can have relaxed thresholds or longer durations in alert rules.
Values: `"high"` (default), `"low"`
### `role`
Describes the function of the host. Useful for grouping in dashboards and targeting role-specific alert rules.
Values: free-form string, e.g. `"dns"`, `"build-host"`, `"database"`, `"monitoring"`
**Note on multiple roles:** Prometheus labels are strictly string values, not lists. For hosts that serve multiple roles there are a few options:
- **Separate boolean labels:** `role_build_host = "true"`, `role_cache_server = "true"` -- flexible but verbose, and requires updating the module when new roles are added.
- **Delimited string:** `role = "build-host,cache-server"` -- works with regex matchers (`{role=~".*build-host.*"}`), but regex matching is less clean and more error-prone.
- **Pick a primary role:** `role = "build-host"` -- simplest, and probably sufficient since most hosts have one primary role.
Recommendation: start with a single primary role string. If multi-role matching becomes a real need, switch to separate boolean labels.
### `dns_role`
For DNS servers specifically, distinguish between primary and secondary resolvers. The secondary resolver (ns2) receives very little traffic and has a cold cache, making generic cache hit ratio alerts inappropriate.
Values: `"primary"`, `"secondary"`
Example use case: The `unbound_low_cache_hit_ratio` alert fires on ns2 because its cache hit ratio (~62%) is lower than ns1 (~90%). This is expected behavior since ns2 gets ~100x less traffic. With a `dns_role` label, the alert can either exclude secondaries or use different thresholds:
This implementation uses a shared `homelab.host` module that provides host metadata for multiple consumers (Prometheus labels, deployment tiers, etc.). See also `docs/plans/nats-deploy-service.md` which uses the same module for deployment tier assignment.
### 1. Create `homelab.host` module
**Status:** Step 1 (Create `homelab.host` module) is complete. The module is in
`modules/homelab/host.nix` with tier, priority, role, and labels options.
Create `modules/homelab/host.nix` with shared host metadata options:
This requires grouping hosts by their label attrset and producing one `static_configs` entry per unique label combination. Hosts with default values (priority=high, no role, no labels) get grouped together with no extra labels (preserving current behavior).
Change the node-exporter scrape config to use the new structured output:
```nix
# Before:
static_configs=[{targets=nodeExporterTargets;}];
# After:
static_configs=nodeExporterTargets;
```
### 4. Set metadata on hosts
Example in `hosts/nix-cache01/configuration.nix`:
```nix
homelab.host={
tier="test";# can be deployed by MCP (used by homelab-deploy)
priority="low";# relaxed alerting thresholds
role="build-host";
};
```
Example in `hosts/ns1/configuration.nix`:
```nix
homelab.host={
tier="prod";
priority="high";
role="dns";
labels.dns_role="primary";
};
```
### 5. Update alert rules
After implementing labels, review and update `services/monitoring/rules.yml`:
- Replace instance-name exclusions with label-based filters (e.g. `{priority!="low"}` instead of `{instance!="nix-cache01.home.2rjus.net:9100"}`).
- Consider whether any other rules should differentiate by priority or role.
Specifically, the `high_cpu_load` rule currently has a nix-cache01 exclusion that should be replaced with a `priority`-based filter.
### 6. Consider labels for `generateScrapeConfigs` (service targets)
The same label propagation could be applied to service-level scrape targets. This is optional and can be deferred -- service targets are more specialized and less likely to need generic label-based filtering.
Enable remote access to some or all homelab services from outside the internal network, without exposing anything directly to the internet.
## Current State
- All services are only accessible from the internal 10.69.13.x network
- Exception: jelly01 has a WireGuard link to an external VPS
- No services are directly exposed to the public internet
## Constraints
- Nothing should be directly accessible from the outside
- Must use VPN or overlay network (no port forwarding of services)
- Self-hosted solutions preferred over managed services
## Options
### 1. WireGuard Gateway (Internal Router)
A dedicated NixOS host on the internal network with a WireGuard tunnel out to the VPS. The VPS becomes the public entry point, and the gateway routes traffic to internal services. Firewall rules on the gateway control which services are reachable.
**Pros:**
- Simple, well-understood technology
- Already running WireGuard for jelly01
- Full control over routing and firewall rules
- Excellent NixOS module support
- No extra dependencies
**Cons:**
- Hub-and-spoke topology (all traffic goes through VPS)
- Manual peer management
- Adding a new client device means editing configs on both VPS and gateway
### 2. WireGuard Mesh (No Relay)
Each client device connects directly to a WireGuard endpoint. Could be on the VPS which forwards to the homelab, or if there is a routable IP at home, directly to an internal host.
**Pros:**
- Simple and fast
- No extra software
**Cons:**
- Manual key and endpoint management for every peer
- Doesn't scale well
- If behind CGNAT, still needs the VPS as intermediary
### 3. Headscale (Self-Hosted Tailscale)
Run a Headscale control server (on the VPS or internally) and install the Tailscale client on homelab hosts and personal devices. Gets the Tailscale mesh networking UX without depending on Tailscale's infrastructure.
**Pros:**
- Mesh topology - devices communicate directly via NAT traversal (DERP relay as fallback)
- Easy to add/remove devices
- ACL support for granular access control
- MagicDNS for service discovery
- Good NixOS support for both headscale server and tailscale client
- Subnet routing lets you expose the entire 10.69.13.x network or specific hosts without installing tailscale on every host
**Cons:**
- More moving parts than plain WireGuard
- Headscale is a third-party reimplementation, can lag behind Tailscale features
- Need to run and maintain the control server
### 4. Tailscale (Managed)
Same as Headscale but using Tailscale's hosted control plane.
**Pros:**
- Zero infrastructure to manage on the control plane side
- Polished UX, well-maintained clients
- Free tier covers personal use
**Cons:**
- Dependency on Tailscale's service
- Less aligned with self-hosting preference
- Coordination metadata goes through their servers (data plane is still peer-to-peer)
### 5. Netbird (Self-Hosted)
Open-source alternative to Tailscale with a self-hostable management server. WireGuard-based, supports ACLs and NAT traversal.
**Pros:**
- Fully self-hostable
- Web UI for management
- ACL and peer grouping support
**Cons:**
- Heavier to self-host (needs multiple components: management server, signal server, TURN relay)
- Less mature NixOS module support compared to Tailscale/Headscale
### 6. Nebula (by Defined Networking)
Certificate-based mesh VPN. Each node gets a certificate from a CA you control. No central coordination server needed at runtime.
**Pros:**
- No always-on control plane
- Certificate-based identity
- Lightweight
**Cons:**
- Less convenient for ad-hoc device addition (need to issue certs)
- NAT traversal less mature than Tailscale's
- Smaller community/ecosystem
## Key Decision Points
- **Static public IP vs CGNAT?** Determines whether clients can connect directly to home network or need VPS relay.
- **Number of client devices?** If just phone and laptop, plain WireGuard via VPS is fine. More devices favors Headscale.
- **Per-service vs per-network access?** Gateway with firewall rules gives per-service control. Headscale ACLs can also do this. Plain WireGuard gives network-level access with gateway firewall for finer control.
- **Subnet routing vs per-host agents?** With Headscale/Tailscale, can either install client on every host, or use a single subnet router that advertises the 10.69.13.x range. The latter is closer to the gateway approach and avoids touching every host.
## Leading Candidates
Based on existing WireGuard experience, self-hosting preference, and NixOS stack:
1.**Headscale with a subnet router** - Best balance of convenience and self-hosting
2.**WireGuard gateway via VPS** - Simplest, most transparent, builds on existing setup
This document describes the one-time setup process for enabling TPM2-based auto-unsealing on vault01.
## Overview
The auto-unseal feature uses systemd's `LoadCredentialEncrypted` with TPM2 to securely store and retrieve an unseal key. On service start, systemd automatically decrypts the credential using the VM's TPM, and the service unseals OpenBao.
## Prerequisites
- OpenBao must be initialized (`bao operator init` completed)
- You must have at least one unseal key from the initialization
- vault01 must have a TPM2 device (virtual TPM for Proxmox VMs)
## Initial Setup
Perform these steps on vault01 after deploying the service configuration:
### 1. Save Unseal Key
```bash
# Create temporary file with one of your unseal keys
console.print("[bold red]Error:[/bold red] --regenerate-token cannot be used with --force, --dry-run, or --skip-vault\n")
sys.exit(1)
ifiporcpu!=2ormemory!=2048ordisk!="20G":
console.print("[bold red]Error:[/bold red] --regenerate-token only regenerates the token. Other options (--ip, --cpu, --memory, --disk) are ignored.\n")
console.print("[yellow]Tip:[/yellow] Use without those options, or use --force to update the entire configuration.\n")
sys.exit(1)
try:
console.print(f"\n[bold blue]Regenerating Vault token for {hostname}...[/bold blue]")
# Validate hostname exists
host_dir=repo_root/"hosts"/hostname
ifnothost_dir.exists():
console.print(f"[bold red]Error:[/bold red] Host {hostname} does not exist")
console.print(f"Host directory not found: {host_dir}")
log "WARNING: No secret data found at path $SECRET_PATH"
use_cache
return
fi
# Create output and cache directories
mkdir -p "$OUTPUT_DIR"
mkdir -p "$CACHE_DIR"
# Write each secret key to a separate file
log "Writing secrets to $OUTPUT_DIR"
for key in $(echo"$SECRET_DATA"| jq -r 'keys[]');do
echo"$SECRET_DATA"| jq -j --arg k "$key"'.[$k]' > "$OUTPUT_DIR/$key"
echo"$SECRET_DATA"| jq -j --arg k "$key"'.[$k]' > "$CACHE_DIR/$key"
chmod 600"$OUTPUT_DIR/$key"
chmod 600"$CACHE_DIR/$key"
log " - Wrote secret key: $key"
done
log "Successfully fetched and cached secrets"
}
# Main execution
fetch_from_vault
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.