Compare commits
41 Commits
4e8ecb8a99
...
pipe-to-lo
| Author | SHA1 | Date | |
|---|---|---|---|
|
78eb04205f
|
|||
| 19cb61ebbc | |||
|
9ed09c9a9c
|
|||
|
b31c64f1b9
|
|||
|
54b6e37420
|
|||
|
b845a8bb8b
|
|||
|
bfbf0cea68
|
|||
|
3abe5e83a7
|
|||
|
67c27555f3
|
|||
|
1674b6a844
|
|||
|
311be282b6
|
|||
|
11cbb64097
|
|||
|
e2dd21c994
|
|||
|
463342133e
|
|||
|
de36b9d016
|
|||
|
3f1d966919
|
|||
|
7fcc043a4d
|
|||
|
70ec5f8109
|
|||
|
c2ec34cab9
|
|||
|
8fbf1224fa
|
|||
|
8959829f77
|
|||
|
93dbb45802
|
|||
|
538c2ad097
|
|||
|
d99c82c74c
|
|||
|
ca0e3fd629
|
|||
|
732e9b8c22
|
|||
|
3a14ffd6b5
|
|||
|
f9a3961457
|
|||
|
003d4ccf03
|
|||
|
735b8a9ee3
|
|||
|
94feae82a0
|
|||
|
3f94f7ee95
|
|||
|
b7e398c9a7
|
|||
|
8ec2a083bd
|
|||
|
ec4ac1477e
|
|||
|
e937c68965
|
|||
|
98e808cd6c
|
|||
|
ba9f47f914
|
|||
|
1066e81ba8
|
|||
|
f0950b33de
|
|||
|
bf199bd7c6
|
180
.claude/agents/auditor.md
Normal file
180
.claude/agents/auditor.md
Normal file
@@ -0,0 +1,180 @@
|
||||
---
|
||||
name: auditor
|
||||
description: Analyzes audit logs to investigate user activity, command execution, and suspicious behavior on hosts. Can be used standalone for security reviews or called by other agents for behavioral context.
|
||||
tools: Read, Grep, Glob
|
||||
mcpServers:
|
||||
- lab-monitoring
|
||||
---
|
||||
|
||||
You are a security auditor for a NixOS homelab infrastructure. Your task is to analyze audit logs and reconstruct user activity on hosts.
|
||||
|
||||
## Input
|
||||
|
||||
You may receive:
|
||||
- A host or list of hosts to investigate
|
||||
- A time window (e.g., "last hour", "today", "between 14:00 and 15:00")
|
||||
- Optional context: specific events to look for, user to focus on, or suspicious activity to investigate
|
||||
- Optional context from a parent investigation (e.g., "a service stopped at 14:32, what happened around that time?")
|
||||
|
||||
## Audit Log Structure
|
||||
|
||||
Logs are shipped to Loki via promtail. Audit events use these labels:
|
||||
- `host` - hostname
|
||||
- `systemd_unit` - typically `auditd.service` for audit logs
|
||||
- `job` - typically `systemd-journal`
|
||||
|
||||
Audit log entries contain structured data:
|
||||
- `EXECVE` - command execution with full arguments
|
||||
- `USER_LOGIN` / `USER_LOGOUT` - session start/end
|
||||
- `USER_CMD` - sudo command execution
|
||||
- `CRED_ACQ` / `CRED_DISP` - credential acquisition/disposal
|
||||
- `SERVICE_START` / `SERVICE_STOP` - systemd service events
|
||||
|
||||
## Investigation Techniques
|
||||
|
||||
### 1. SSH Session Activity
|
||||
|
||||
Find SSH logins and session activity:
|
||||
```logql
|
||||
{host="<hostname>", systemd_unit="sshd.service"}
|
||||
```
|
||||
|
||||
Look for:
|
||||
- Accepted/Failed authentication
|
||||
- Session opened/closed
|
||||
- Unusual source IPs or users
|
||||
|
||||
### 2. Command Execution
|
||||
|
||||
Query executed commands (filter out noise):
|
||||
```logql
|
||||
{host="<hostname>"} |= "EXECVE" != "PATH item" != "PROCTITLE" != "SYSCALL" != "BPF"
|
||||
```
|
||||
|
||||
Further filtering:
|
||||
- Exclude systemd noise: `!= "systemd" != "/nix/store"`
|
||||
- Focus on specific commands: `|= "rm" |= "-rf"`
|
||||
- Focus on specific user: `|= "uid=1000"`
|
||||
|
||||
### 3. Sudo Activity
|
||||
|
||||
Check for privilege escalation:
|
||||
```logql
|
||||
{host="<hostname>"} |= "sudo" |= "COMMAND"
|
||||
```
|
||||
|
||||
Or via audit:
|
||||
```logql
|
||||
{host="<hostname>"} |= "USER_CMD"
|
||||
```
|
||||
|
||||
### 4. Service Manipulation
|
||||
|
||||
Check if services were manually stopped/started:
|
||||
```logql
|
||||
{host="<hostname>"} |= "EXECVE" |= "systemctl"
|
||||
```
|
||||
|
||||
### 5. File Operations
|
||||
|
||||
Look for file modifications (if auditd rules are configured):
|
||||
```logql
|
||||
{host="<hostname>"} |= "EXECVE" |= "vim"
|
||||
{host="<hostname>"} |= "EXECVE" |= "nano"
|
||||
{host="<hostname>"} |= "EXECVE" |= "rm"
|
||||
```
|
||||
|
||||
## Query Guidelines
|
||||
|
||||
**Start narrow, expand if needed:**
|
||||
- Begin with `limit: 20-30`
|
||||
- Use tight time windows: `start: "15m"` or `start: "30m"`
|
||||
- Add filters progressively
|
||||
|
||||
**Avoid:**
|
||||
- Querying all audit logs without EXECVE filter (extremely verbose)
|
||||
- Large time ranges without specific filters
|
||||
- Limits over 50 without tight filters
|
||||
|
||||
**Time-bounded queries:**
|
||||
When investigating around a specific event:
|
||||
```logql
|
||||
{host="<hostname>"} |= "EXECVE" != "systemd"
|
||||
```
|
||||
With `start: "2026-02-08T14:30:00Z"` and `end: "2026-02-08T14:35:00Z"`
|
||||
|
||||
## Suspicious Patterns to Watch For
|
||||
|
||||
1. **Unusual login times** - Activity outside normal hours
|
||||
2. **Failed authentication** - Brute force attempts
|
||||
3. **Privilege escalation** - Unexpected sudo usage
|
||||
4. **Reconnaissance commands** - `whoami`, `id`, `uname`, `cat /etc/passwd`
|
||||
5. **Data exfiltration indicators** - `curl`, `wget`, `scp`, `rsync` to external destinations
|
||||
6. **Persistence mechanisms** - Cron modifications, systemd service creation
|
||||
7. **Log tampering** - Commands targeting log files
|
||||
8. **Lateral movement** - SSH to other internal hosts
|
||||
9. **Service manipulation** - Stopping security services, disabling firewalls
|
||||
10. **Cleanup activity** - Deleting bash history, clearing logs
|
||||
|
||||
## Output Format
|
||||
|
||||
### For Standalone Security Reviews
|
||||
|
||||
```
|
||||
## Activity Summary
|
||||
|
||||
**Host:** <hostname>
|
||||
**Time Period:** <start> to <end>
|
||||
**Sessions Found:** <count>
|
||||
|
||||
## User Sessions
|
||||
|
||||
### Session 1: <user> from <source_ip>
|
||||
- **Login:** HH:MM:SSZ
|
||||
- **Logout:** HH:MM:SSZ (or ongoing)
|
||||
- **Commands executed:**
|
||||
- HH:MM:SSZ - <command>
|
||||
- HH:MM:SSZ - <command>
|
||||
|
||||
## Suspicious Activity
|
||||
|
||||
[If any patterns from the watch list were detected]
|
||||
- **Finding:** <description>
|
||||
- **Evidence:** <log entries>
|
||||
- **Risk Level:** Low / Medium / High
|
||||
|
||||
## Summary
|
||||
|
||||
[Overall assessment: normal activity, concerning patterns, or clear malicious activity]
|
||||
```
|
||||
|
||||
### When Called by Another Agent
|
||||
|
||||
Provide a focused response addressing the specific question:
|
||||
|
||||
```
|
||||
## Audit Findings
|
||||
|
||||
**Query:** <what was asked>
|
||||
**Time Window:** <investigated period>
|
||||
|
||||
## Relevant Activity
|
||||
|
||||
[Chronological list of relevant events]
|
||||
- HH:MM:SSZ - <event>
|
||||
- HH:MM:SSZ - <event>
|
||||
|
||||
## Assessment
|
||||
|
||||
[Direct answer to the question with supporting evidence]
|
||||
```
|
||||
|
||||
## Guidelines
|
||||
|
||||
- Reconstruct timelines chronologically
|
||||
- Correlate events (login → commands → logout)
|
||||
- Note gaps or missing data
|
||||
- Distinguish between automated (systemd, cron) and interactive activity
|
||||
- Consider the host's role and tier when assessing severity
|
||||
- When called by another agent, focus on answering their specific question
|
||||
- Don't speculate without evidence - state what the logs show and don't show
|
||||
211
.claude/agents/investigate-alarm.md
Normal file
211
.claude/agents/investigate-alarm.md
Normal file
@@ -0,0 +1,211 @@
|
||||
---
|
||||
name: investigate-alarm
|
||||
description: Investigates a single system alarm by querying Prometheus metrics and Loki logs, analyzing configuration files for affected hosts/services, and providing root cause analysis.
|
||||
tools: Read, Grep, Glob
|
||||
mcpServers:
|
||||
- lab-monitoring
|
||||
- git-explorer
|
||||
---
|
||||
|
||||
You are an alarm investigation specialist for a NixOS homelab infrastructure. Your task is to analyze a single alarm and determine its root cause.
|
||||
|
||||
## Input
|
||||
|
||||
You will receive information about an alarm, which may include:
|
||||
- Alert name and severity
|
||||
- Affected host or service
|
||||
- Alert expression/threshold
|
||||
- Current value or status
|
||||
- When it started firing
|
||||
|
||||
## Investigation Process
|
||||
|
||||
### 1. Understand the Alert Context
|
||||
|
||||
Start by understanding what the alert is measuring:
|
||||
- Use `get_alert` if you have a fingerprint, or `list_alerts` to find matching alerts
|
||||
- Use `get_metric_metadata` to understand the metric being monitored
|
||||
- Use `search_metrics` to find related metrics
|
||||
|
||||
### 2. Query Current State
|
||||
|
||||
Gather evidence about the current system state:
|
||||
- Use `query` to check the current metric values and related metrics
|
||||
- Use `list_targets` to verify the host/service is being scraped successfully
|
||||
- Look for correlated metrics that might explain the issue
|
||||
|
||||
### 3. Check Service Logs
|
||||
|
||||
Search for relevant log entries using `query_logs`. Focus on service-specific logs and errors.
|
||||
|
||||
**Query strategies (start narrow, expand if needed):**
|
||||
- Start with `limit: 20-30`, increase only if needed
|
||||
- Use tight time windows: `start: "15m"` or `start: "30m"` initially
|
||||
- Filter to specific services: `{host="<hostname>", systemd_unit="<service>.service"}`
|
||||
- Search for errors: `{host="<hostname>"} |= "error"` or `|= "failed"`
|
||||
|
||||
**Common patterns:**
|
||||
- Service logs: `{host="<hostname>", systemd_unit="<service>.service"}`
|
||||
- All errors on host: `{host="<hostname>"} |= "error"`
|
||||
- Journal for a unit: `{host="<hostname>", systemd_unit="nginx.service"} |= "failed"`
|
||||
|
||||
**Avoid:**
|
||||
- Using `start: "1h"` with no filters on busy hosts
|
||||
- Limits over 50 without specific filters
|
||||
|
||||
### 4. Investigate User Activity
|
||||
|
||||
For any analysis of user activity, **always spawn the `auditor` agent**. Do not query audit logs (EXECVE, USER_LOGIN, etc.) directly - delegate this to the auditor.
|
||||
|
||||
**Always call the auditor when:**
|
||||
- A service stopped unexpectedly (may have been manually stopped)
|
||||
- A process was killed or a config was changed
|
||||
- You need to know who was logged in around the time of an incident
|
||||
- You need to understand what commands led to the current state
|
||||
- The cause isn't obvious from service logs alone
|
||||
|
||||
**Do NOT try to query audit logs yourself.** The auditor is specialized for:
|
||||
- Parsing EXECVE records and reconstructing command lines
|
||||
- Correlating SSH sessions with commands executed
|
||||
- Identifying suspicious patterns
|
||||
- Filtering out systemd/nix-store noise
|
||||
|
||||
**Example prompt for auditor:**
|
||||
```
|
||||
Investigate user activity on <hostname> between <start_time> and <end_time>.
|
||||
Context: The prometheus-node-exporter service stopped at 14:32.
|
||||
Determine if it was manually stopped and by whom.
|
||||
```
|
||||
|
||||
Incorporate the auditor's findings into your timeline and root cause analysis.
|
||||
|
||||
### 5. Check Configuration (if relevant)
|
||||
|
||||
If the alert relates to a NixOS-managed service:
|
||||
- Check host configuration in `/hosts/<hostname>/`
|
||||
- Check service modules in `/services/<service>/`
|
||||
- Look for thresholds, resource limits, or misconfigurations
|
||||
- Check `homelab.host` options for tier/priority/role metadata
|
||||
|
||||
### 6. Check for Configuration Drift
|
||||
|
||||
Use the git-explorer MCP server to compare the host's deployed configuration against the current master branch. This helps identify:
|
||||
- Hosts running outdated configurations
|
||||
- Recent changes that might have caused the issue
|
||||
- Whether a fix has already been committed but not deployed
|
||||
|
||||
**Step 1: Get the deployed revision from Prometheus**
|
||||
```promql
|
||||
nixos_flake_info{hostname="<hostname>"}
|
||||
```
|
||||
The `current_rev` label contains the deployed git commit hash.
|
||||
|
||||
**Step 2: Check if the host is behind master**
|
||||
```
|
||||
resolve_ref("master") # Get current master commit
|
||||
is_ancestor(deployed, master) # Check if host is behind
|
||||
```
|
||||
|
||||
**Step 3: See what commits are missing**
|
||||
```
|
||||
commits_between(deployed, master) # List commits not yet deployed
|
||||
```
|
||||
|
||||
**Step 4: Check which files changed**
|
||||
```
|
||||
get_diff_files(deployed, master) # Files modified since deployment
|
||||
```
|
||||
Look for files in `hosts/<hostname>/`, `services/<relevant-service>/`, or `system/` that affect this host.
|
||||
|
||||
**Step 5: View configuration at the deployed revision**
|
||||
```
|
||||
get_file_at_commit(deployed, "services/<service>/default.nix")
|
||||
```
|
||||
Compare against the current file to understand differences.
|
||||
|
||||
**Step 6: Find when something changed**
|
||||
```
|
||||
search_commits("<service-name>") # Find commits mentioning the service
|
||||
get_commit_info(<hash>) # Get full details of a specific change
|
||||
```
|
||||
|
||||
**Example workflow for a service-related alert:**
|
||||
1. Query `nixos_flake_info{hostname="monitoring01"}` → `current_rev: 8959829`
|
||||
2. `resolve_ref("master")` → `4633421`
|
||||
3. `is_ancestor("8959829", "4633421")` → Yes, host is behind
|
||||
4. `commits_between("8959829", "4633421")` → 7 commits missing
|
||||
5. `get_diff_files("8959829", "4633421")` → Check if relevant service files changed
|
||||
6. If a fix was committed after the deployed rev, recommend deployment
|
||||
|
||||
### 7. Consider Common Causes
|
||||
|
||||
For infrastructure alerts, common causes include:
|
||||
- **Manual intervention**: Service manually stopped/restarted (call auditor to confirm)
|
||||
- **Configuration drift**: Host running outdated config, fix already in master
|
||||
- **Disk space**: Nix store growth, logs, temp files
|
||||
- **Memory pressure**: Service memory leaks, insufficient limits
|
||||
- **CPU**: Runaway processes, build jobs
|
||||
- **Network**: DNS issues, connectivity problems
|
||||
- **Service restarts**: Failed upgrades, configuration errors
|
||||
- **Scrape failures**: Service down, firewall issues, port changes
|
||||
|
||||
**Note:** If a service stopped unexpectedly and service logs don't show a crash or error, it was likely manual intervention - call the auditor to investigate.
|
||||
|
||||
## Output Format
|
||||
|
||||
Provide a concise report with one of two outcomes:
|
||||
|
||||
### If Root Cause Identified:
|
||||
|
||||
```
|
||||
## Root Cause
|
||||
[1-2 sentence summary of the root cause]
|
||||
|
||||
## Timeline
|
||||
[Chronological sequence of relevant events leading to the alert]
|
||||
- HH:MM:SSZ - [Event description]
|
||||
- HH:MM:SSZ - [Event description]
|
||||
- HH:MM:SSZ - [Alert fired]
|
||||
|
||||
### Timeline sources
|
||||
- HH:MM:SSZ - [Source for information about this event. Which metric or log file]
|
||||
- HH:MM:SSZ - [Source for information about this event. Which metric or log file]
|
||||
- HH:MM:SSZ - [Alert fired]
|
||||
|
||||
|
||||
## Evidence
|
||||
- [Specific metric values or log entries that support the conclusion]
|
||||
- [Configuration details if relevant]
|
||||
|
||||
|
||||
## Recommended Actions
|
||||
1. [Specific remediation step]
|
||||
2. [Follow-up actions if any]
|
||||
```
|
||||
|
||||
### If Root Cause Unclear:
|
||||
|
||||
```
|
||||
## Investigation Summary
|
||||
[What was checked and what was found]
|
||||
|
||||
## Possible Causes
|
||||
- [Hypothesis 1 with supporting/contradicting evidence]
|
||||
- [Hypothesis 2 with supporting/contradicting evidence]
|
||||
|
||||
## Additional Information Needed
|
||||
- [Specific data, logs, or access that would help]
|
||||
- [Suggested queries or checks for the operator]
|
||||
```
|
||||
|
||||
## Guidelines
|
||||
|
||||
- Be concise and actionable
|
||||
- Reference specific metric names and values as evidence
|
||||
- Include log snippets when they're informative
|
||||
- Don't speculate without evidence
|
||||
- If the alert is a false positive or expected behavior, explain why
|
||||
- Consider the host's tier (test vs prod) when assessing severity
|
||||
- Build a timeline from log timestamps and metrics to show the sequence of events
|
||||
- **Query logs incrementally**: start with narrow filters and small limits, expand only if needed
|
||||
- **Always delegate to the auditor agent** for any user activity analysis - never query EXECVE or audit logs directly
|
||||
@@ -32,7 +32,7 @@ Use the `lab-monitoring` MCP server tools:
|
||||
Available labels for log queries:
|
||||
- `host` - Hostname (e.g., `ns1`, `monitoring01`, `ha1`)
|
||||
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `nixos-upgrade.service`)
|
||||
- `job` - Either `systemd-journal` (most logs) or `varlog` (file-based logs)
|
||||
- `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
|
||||
- `filename` - For `varlog` job, the log file path
|
||||
- `hostname` - Alternative to `host` for some streams
|
||||
|
||||
@@ -102,6 +102,36 @@ Useful systemd units for troubleshooting:
|
||||
- `sshd.service` - SSH daemon
|
||||
- `nix-gc.service` - Nix garbage collection
|
||||
|
||||
### Bootstrap Logs
|
||||
|
||||
VMs provisioned from template2 send bootstrap progress directly to Loki via curl (before promtail is available). These logs use `job="bootstrap"` with additional labels:
|
||||
|
||||
- `host` - Target hostname
|
||||
- `branch` - Git branch being deployed
|
||||
- `stage` - Bootstrap stage (see table below)
|
||||
|
||||
**Bootstrap stages:**
|
||||
|
||||
| Stage | Message | Meaning |
|
||||
|-------|---------|---------|
|
||||
| `starting` | Bootstrap starting for \<host\> (branch: \<branch\>) | Bootstrap service has started |
|
||||
| `network_ok` | Network connectivity confirmed | Can reach git server |
|
||||
| `vault_ok` | Vault credentials unwrapped and stored | AppRole credentials provisioned |
|
||||
| `vault_skip` | No Vault token provided - skipping credential setup | No wrapped token was provided |
|
||||
| `vault_warn` | Failed to unwrap Vault token - continuing without secrets | Token unwrap failed (expired/used) |
|
||||
| `building` | Starting nixos-rebuild boot | NixOS build starting |
|
||||
| `success` | Build successful - rebooting into new configuration | Build complete, rebooting |
|
||||
| `failed` | nixos-rebuild failed - manual intervention required | Build failed |
|
||||
|
||||
**Bootstrap queries:**
|
||||
|
||||
```logql
|
||||
{job="bootstrap"} # All bootstrap logs
|
||||
{job="bootstrap", host="myhost"} # Specific host
|
||||
{job="bootstrap", stage="failed"} # All failures
|
||||
{job="bootstrap", stage=~"building|success"} # Track build progress
|
||||
```
|
||||
|
||||
### Extracting JSON Fields
|
||||
|
||||
Parse JSON and filter on fields:
|
||||
@@ -175,15 +205,39 @@ Disk space (root filesystem):
|
||||
node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}
|
||||
```
|
||||
|
||||
### Service-Specific Metrics
|
||||
### Prometheus Jobs
|
||||
|
||||
Common job names:
|
||||
- `node-exporter` - System metrics (all hosts)
|
||||
- `nixos-exporter` - NixOS version/generation metrics
|
||||
- `caddy` - Reverse proxy metrics
|
||||
- `prometheus` / `loki` / `grafana` - Monitoring stack
|
||||
- `home-assistant` - Home automation
|
||||
- `step-ca` - Internal CA
|
||||
All available Prometheus job names:
|
||||
|
||||
**System exporters (on all/most hosts):**
|
||||
- `node-exporter` - System metrics (CPU, memory, disk, network)
|
||||
- `nixos-exporter` - NixOS flake revision and generation info
|
||||
- `systemd-exporter` - Systemd unit status metrics
|
||||
- `homelab-deploy` - Deployment listener metrics
|
||||
|
||||
**Service-specific exporters:**
|
||||
- `caddy` - Reverse proxy metrics (http-proxy)
|
||||
- `nix-cache_caddy` - Nix binary cache metrics
|
||||
- `home-assistant` - Home automation metrics (ha1)
|
||||
- `jellyfin` - Media server metrics (jelly01)
|
||||
- `kanidm` - Authentication server metrics (kanidm01)
|
||||
- `nats` - NATS messaging metrics (nats1)
|
||||
- `openbao` - Secrets management metrics (vault01)
|
||||
- `unbound` - DNS resolver metrics (ns1, ns2)
|
||||
- `wireguard` - VPN tunnel metrics (http-proxy)
|
||||
|
||||
**Monitoring stack (localhost on monitoring01):**
|
||||
- `prometheus` - Prometheus self-metrics
|
||||
- `loki` - Loki self-metrics
|
||||
- `grafana` - Grafana self-metrics
|
||||
- `alertmanager` - Alertmanager metrics
|
||||
- `pushgateway` - Push-based metrics gateway
|
||||
|
||||
**External/infrastructure:**
|
||||
- `pve-exporter` - Proxmox hypervisor metrics
|
||||
- `smartctl` - Disk SMART health (gunter)
|
||||
- `restic_rest` - Backup server metrics
|
||||
- `ghettoptt` - PTT service metrics (gunter)
|
||||
|
||||
### Target Labels
|
||||
|
||||
@@ -237,6 +291,7 @@ Current host labels:
|
||||
| ns2 | `role=dns`, `dns_role=secondary` |
|
||||
| nix-cache01 | `role=build-host` |
|
||||
| vault01 | `role=vault` |
|
||||
| kanidm01 | `role=auth`, `tier=test` |
|
||||
| testvm01/02/03 | `tier=test` |
|
||||
|
||||
---
|
||||
@@ -265,6 +320,17 @@ Current host labels:
|
||||
3. Check service logs for startup issues
|
||||
4. Check service metrics are being scraped
|
||||
|
||||
### Monitor VM Bootstrap
|
||||
|
||||
When provisioning new VMs, track bootstrap progress:
|
||||
|
||||
1. Watch bootstrap logs: `{job="bootstrap", host="<hostname>"}`
|
||||
2. Check for failures: `{job="bootstrap", host="<hostname>", stage="failed"}`
|
||||
3. After success, verify host appears in metrics: `up{hostname="<hostname>"}`
|
||||
4. Check logs are flowing: `{host="<hostname>"}`
|
||||
|
||||
See [docs/host-creation.md](../../../docs/host-creation.md) for the full host creation pipeline.
|
||||
|
||||
### Debug SSH/Access Issues
|
||||
|
||||
```logql
|
||||
|
||||
@@ -33,6 +33,13 @@
|
||||
"--nats-url", "nats://nats1.home.2rjus.net:4222",
|
||||
"--nkey-file", "/home/torjus/.config/homelab-deploy/test-deployer.nkey"
|
||||
]
|
||||
},
|
||||
"git-explorer": {
|
||||
"command": "nix",
|
||||
"args": ["run", "git+https://git.t-juice.club/torjus/labmcp#git-explorer", "--", "serve"],
|
||||
"env": {
|
||||
"GIT_REPO_PATH": "/home/torjus/git/nixos-servers"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
103
CLAUDE.md
103
CLAUDE.md
@@ -35,6 +35,10 @@ nix build .#create-host
|
||||
|
||||
Do not automatically deploy changes. Deployments are usually done by updating the master branch, and then triggering the auto update on the specific host.
|
||||
|
||||
### SSH Commands
|
||||
|
||||
Do not run SSH commands directly. If a command needs to be run on a remote host, provide the command to the user and ask them to run it manually.
|
||||
|
||||
### Testing Feature Branches on Hosts
|
||||
|
||||
All hosts have the `nixos-rebuild-test` helper script for testing feature branches before merging:
|
||||
@@ -152,82 +156,16 @@ Two MCP servers are available for searching NixOS options and packages:
|
||||
|
||||
This ensures documentation matches the exact nixpkgs version (currently NixOS 25.11) used by this flake.
|
||||
|
||||
### Lab Monitoring Log Queries
|
||||
### Lab Monitoring
|
||||
|
||||
The **lab-monitoring** MCP server can query logs from Loki. All hosts ship systemd journal logs via Promtail.
|
||||
The **lab-monitoring** MCP server provides access to Prometheus metrics and Loki logs. Use the `/observability` skill for detailed reference on:
|
||||
|
||||
**Loki Label Reference:**
|
||||
- Available Prometheus jobs and exporters
|
||||
- Loki labels and LogQL query syntax
|
||||
- Bootstrap log monitoring for new VMs
|
||||
- Common troubleshooting workflows
|
||||
|
||||
- `host` - Hostname (e.g., `ns1`, `ns2`, `monitoring01`, `ha1`). Use this label, not `hostname`.
|
||||
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `prometheus.service`, `nixos-upgrade.service`)
|
||||
- `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
|
||||
- `filename` - For `varlog` job, the log file path (e.g., `/var/log/caddy/nix-cache.log`)
|
||||
|
||||
Journal log entries are JSON-formatted with the actual log message in the `MESSAGE` field. Other useful fields include `PRIORITY` and `SYSLOG_IDENTIFIER`.
|
||||
|
||||
**Bootstrap Logs:**
|
||||
|
||||
VMs provisioned from template2 send bootstrap progress directly to Loki via curl (before promtail is available). These logs use `job="bootstrap"` with additional labels:
|
||||
|
||||
- `host` - Target hostname
|
||||
- `branch` - Git branch being deployed
|
||||
- `stage` - Bootstrap stage: `starting`, `network_ok`, `vault_ok`/`vault_skip`/`vault_warn`, `building`, `success`, `failed`
|
||||
|
||||
Query bootstrap status:
|
||||
```
|
||||
{job="bootstrap"} # All bootstrap logs
|
||||
{job="bootstrap", host="testvm01"} # Specific host
|
||||
{job="bootstrap", stage="failed"} # All failures
|
||||
{job="bootstrap", stage=~"building|success"} # Track build progress
|
||||
```
|
||||
|
||||
**Example LogQL queries:**
|
||||
```
|
||||
# Logs from a specific service on a host
|
||||
{host="ns2", systemd_unit="nsd.service"}
|
||||
|
||||
# Substring match on log content
|
||||
{host="ns1", systemd_unit="nsd.service"} |= "error"
|
||||
|
||||
# File-based logs (e.g., caddy access logs)
|
||||
{job="varlog", hostname="nix-cache01"}
|
||||
```
|
||||
|
||||
Default lookback is 1 hour. Use the `start` parameter with relative durations (e.g., `24h`, `168h`) for older logs.
|
||||
|
||||
### Lab Monitoring Prometheus Queries
|
||||
|
||||
The **lab-monitoring** MCP server can query Prometheus metrics via PromQL. The `instance` label uses the FQDN format `<host>.home.2rjus.net:<port>`.
|
||||
|
||||
**Prometheus Job Names:**
|
||||
|
||||
- `node-exporter` - System metrics from all hosts (CPU, memory, disk, network)
|
||||
- `caddy` - Reverse proxy metrics (http-proxy)
|
||||
- `nix-cache_caddy` - Nix binary cache metrics
|
||||
- `home-assistant` - Home automation metrics
|
||||
- `jellyfin` - Media server metrics
|
||||
- `loki` / `prometheus` / `grafana` - Monitoring stack self-metrics
|
||||
- `pve-exporter` - Proxmox hypervisor metrics
|
||||
- `smartctl` - Disk SMART health (gunter)
|
||||
- `wireguard` - VPN metrics (http-proxy)
|
||||
- `pushgateway` - Push-based metrics (e.g., backup results)
|
||||
- `restic_rest` - Backup server metrics
|
||||
- `ghettoptt` / `alertmanager` - Other service metrics
|
||||
|
||||
**Example PromQL queries:**
|
||||
```
|
||||
# Check all targets are up
|
||||
up
|
||||
|
||||
# CPU usage for a specific host
|
||||
rate(node_cpu_seconds_total{instance=~"ns1.*", mode!="idle"}[5m])
|
||||
|
||||
# Memory usage across all hosts
|
||||
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
|
||||
|
||||
# Disk space
|
||||
node_filesystem_avail_bytes{mountpoint="/"}
|
||||
```
|
||||
The skill contains up-to-date information about all scrape targets, host labels, and example queries.
|
||||
|
||||
### Deploying to Test Hosts
|
||||
|
||||
@@ -496,20 +434,11 @@ This means:
|
||||
|
||||
### Adding a New Host
|
||||
|
||||
1. Create `/hosts/<hostname>/` directory
|
||||
2. Copy structure from `template1` or similar host
|
||||
3. Add host entry to `flake.nix` nixosConfigurations
|
||||
4. Configure networking in `configuration.nix` (static IP via `systemd.network.networks`, DNS servers)
|
||||
5. (Optional) Add `homelab.dns.cnames` if the host needs CNAME aliases
|
||||
6. Add `vault.enable = true;` to the host configuration
|
||||
7. Add AppRole policy in `terraform/vault/approle.tf` and any secrets in `secrets.tf`
|
||||
8. Run `tofu apply` in `terraform/vault/`
|
||||
9. User clones template host
|
||||
10. User runs `prepare-host.sh` on new host
|
||||
11. Provision AppRole credentials: `nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<host>`
|
||||
12. Commit changes, and merge to master.
|
||||
13. Deploy by running `nixos-rebuild boot --flake URL#<hostname>` on the host.
|
||||
14. Run auto-upgrade on DNS servers (ns1, ns2) to pick up the new host's DNS entry
|
||||
See [docs/host-creation.md](docs/host-creation.md) for the complete host creation pipeline, including:
|
||||
- Using the `create-host` script to generate host configurations
|
||||
- Deploying VMs and secrets with OpenTofu
|
||||
- Monitoring the bootstrap process via Loki
|
||||
- Verification and troubleshooting steps
|
||||
|
||||
**Note:** DNS A records and Prometheus node-exporter scrape targets are auto-generated from the host's `systemd.network.networks` static IP configuration. No manual zone file or Prometheus config editing is required.
|
||||
|
||||
|
||||
@@ -13,7 +13,6 @@ NixOS Flake-based configuration repository for a homelab infrastructure. All hos
|
||||
| `monitoring01` | Prometheus, Grafana, Loki, Tempo, Pyroscope |
|
||||
| `jelly01` | Jellyfin media server |
|
||||
| `nix-cache01` | Nix binary cache |
|
||||
| `pgdb1` | PostgreSQL |
|
||||
| `nats1` | NATS messaging |
|
||||
| `vault01` | OpenBao (Vault) secrets management |
|
||||
| `template1`, `template2` | VM templates for cloning new hosts |
|
||||
|
||||
21
common/ssh-audit.nix
Normal file
21
common/ssh-audit.nix
Normal file
@@ -0,0 +1,21 @@
|
||||
# SSH session command auditing
|
||||
#
|
||||
# Logs all commands executed by users who logged in interactively (SSH).
|
||||
# System services and nix builds are excluded via auid filter.
|
||||
#
|
||||
# Logs are sent to journald and forwarded to Loki via promtail.
|
||||
# Query with: {host="<hostname>"} |= "EXECVE"
|
||||
{
|
||||
# Enable Linux audit subsystem
|
||||
security.audit.enable = true;
|
||||
security.auditd.enable = true;
|
||||
|
||||
# Log execve syscalls only from interactive login sessions
|
||||
# auid!=4294967295 means "audit login uid is set" (excludes system services, nix builds)
|
||||
security.audit.rules = [
|
||||
"-a exit,always -F arch=b64 -S execve -F auid!=4294967295"
|
||||
];
|
||||
|
||||
# Forward audit logs to journald (so promtail ships them to Loki)
|
||||
services.journald.audit = true;
|
||||
}
|
||||
217
docs/host-creation.md
Normal file
217
docs/host-creation.md
Normal file
@@ -0,0 +1,217 @@
|
||||
# Host Creation Pipeline
|
||||
|
||||
This document describes the process for creating new hosts in the homelab infrastructure.
|
||||
|
||||
## Overview
|
||||
|
||||
We use the `create-host` script to create new hosts, which generates default configurations from a template. We then use OpenTofu to deploy both secrets and VMs. The VMs boot using a template image (built from `hosts/template2`), which starts a bootstrap process. This bootstrap process applies the host's NixOS configuration and then reboots into the new config.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
All tools are available in the devshell: `create-host`, `bao` (OpenBao CLI), `tofu`.
|
||||
|
||||
```bash
|
||||
nix develop
|
||||
```
|
||||
|
||||
## Steps
|
||||
|
||||
Steps marked with **USER** must be performed by the user due to credential requirements.
|
||||
|
||||
1. **USER**: Run `create-host --hostname <name> --ip <ip/prefix>`
|
||||
2. Edit the auto-generated configurations in `hosts/<hostname>/` to import whatever modules are needed for its purpose
|
||||
3. Add any secrets needed to `terraform/vault/`
|
||||
4. Edit the VM specs in `terraform/vms.tf` if needed. To deploy from a branch other than master, add `flake_branch = "<branch>"` to the VM definition
|
||||
5. Push configuration to master (or the branch specified by `flake_branch`)
|
||||
6. **USER**: Apply terraform:
|
||||
```bash
|
||||
nix develop -c tofu -chdir=terraform/vault apply
|
||||
nix develop -c tofu -chdir=terraform apply
|
||||
```
|
||||
7. Once terraform completes, a VM boots in Proxmox using the template image
|
||||
8. The VM runs the `nixos-bootstrap` service, which applies the host config and reboots
|
||||
9. After reboot, the host should be operational
|
||||
10. Trigger auto-upgrade on `ns1` and `ns2` to propagate DNS records for the new host
|
||||
11. Trigger auto-upgrade on `monitoring01` to add the host to Prometheus scrape targets
|
||||
|
||||
## Tier Specification
|
||||
|
||||
New hosts should set `homelab.host.tier` in their configuration:
|
||||
|
||||
```nix
|
||||
homelab.host.tier = "test"; # or "prod"
|
||||
```
|
||||
|
||||
- **test** - Test-tier hosts can receive remote deployments via the `homelab-deploy` MCP server and have different credential access. Use for staging/testing.
|
||||
- **prod** - Production hosts. Deployments require direct access or the CLI with appropriate credentials.
|
||||
|
||||
## Observability
|
||||
|
||||
During the bootstrap process, status updates are sent to Loki. Query bootstrap logs with:
|
||||
|
||||
```
|
||||
{job="bootstrap", host="<hostname>"}
|
||||
```
|
||||
|
||||
### Bootstrap Stages
|
||||
|
||||
The bootstrap process reports these stages via the `stage` label:
|
||||
|
||||
| Stage | Message | Meaning |
|
||||
|-------|---------|---------|
|
||||
| `starting` | Bootstrap starting for \<host\> (branch: \<branch\>) | Bootstrap service has started |
|
||||
| `network_ok` | Network connectivity confirmed | Can reach git server |
|
||||
| `vault_ok` | Vault credentials unwrapped and stored | AppRole credentials provisioned |
|
||||
| `vault_skip` | No Vault token provided - skipping credential setup | No wrapped token was provided |
|
||||
| `vault_warn` | Failed to unwrap Vault token - continuing without secrets | Token unwrap failed (expired/used) |
|
||||
| `building` | Starting nixos-rebuild boot | NixOS build starting |
|
||||
| `success` | Build successful - rebooting into new configuration | Build complete, rebooting |
|
||||
| `failed` | nixos-rebuild failed - manual intervention required | Build failed |
|
||||
|
||||
### Useful Queries
|
||||
|
||||
```
|
||||
# All bootstrap activity for a host
|
||||
{job="bootstrap", host="myhost"}
|
||||
|
||||
# Track all failures
|
||||
{job="bootstrap", stage="failed"}
|
||||
|
||||
# Monitor builds in progress
|
||||
{job="bootstrap", stage=~"building|success"}
|
||||
```
|
||||
|
||||
Once the VM reboots with its full configuration, it will start publishing metrics to Prometheus and logs to Loki via Promtail.
|
||||
|
||||
## Verification
|
||||
|
||||
1. Check bootstrap completed successfully:
|
||||
```
|
||||
{job="bootstrap", host="<hostname>", stage="success"}
|
||||
```
|
||||
|
||||
2. Verify the host is up and reporting metrics:
|
||||
```promql
|
||||
up{instance=~"<hostname>.*"}
|
||||
```
|
||||
|
||||
3. Verify the correct flake revision is deployed:
|
||||
```promql
|
||||
nixos_flake_info{instance=~"<hostname>.*"}
|
||||
```
|
||||
|
||||
4. Check logs are flowing:
|
||||
```
|
||||
{host="<hostname>"}
|
||||
```
|
||||
|
||||
5. Confirm expected services are running and producing logs
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Bootstrap Failed
|
||||
|
||||
#### Common Issues
|
||||
|
||||
* VM has trouble running initial nixos-rebuild. Usually caused if it needs to compile packages from scratch if they are not available in our local nix-cache.
|
||||
|
||||
#### Troubleshooting
|
||||
|
||||
1. Check bootstrap logs in Loki - if they never progress past `building`, the rebuild likely consumed all resources:
|
||||
```
|
||||
{job="bootstrap", host="<hostname>"}
|
||||
```
|
||||
|
||||
2. **USER**: SSH into the host and check the bootstrap service:
|
||||
```bash
|
||||
ssh root@<hostname>
|
||||
journalctl -u nixos-bootstrap.service
|
||||
```
|
||||
|
||||
3. If the build failed due to resource constraints, increase VM specs in `terraform/vms.tf` and redeploy, or manually run the rebuild:
|
||||
```bash
|
||||
nixos-rebuild boot --flake git+https://git.t-juice.club/torjus/nixos-servers.git#<hostname>
|
||||
```
|
||||
|
||||
4. If the host config doesn't exist in the flake, ensure step 5 was completed (config pushed to the correct branch).
|
||||
|
||||
### Vault Credentials Not Working
|
||||
|
||||
Usually caused by running the `create-host` script without proper credentials, or the wrapped token has expired/already been used.
|
||||
|
||||
#### Troubleshooting
|
||||
|
||||
1. Check if credentials exist on the host:
|
||||
```bash
|
||||
ssh root@<hostname>
|
||||
ls -la /var/lib/vault/approle/
|
||||
```
|
||||
|
||||
2. Check bootstrap logs for vault-related stages:
|
||||
```
|
||||
{job="bootstrap", host="<hostname>", stage=~"vault.*"}
|
||||
```
|
||||
|
||||
3. **USER**: Regenerate and provision credentials manually:
|
||||
```bash
|
||||
nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<hostname>
|
||||
```
|
||||
|
||||
### Host Not Appearing in DNS
|
||||
|
||||
Usually caused by not having deployed the commit with the new host to ns1/ns2.
|
||||
|
||||
#### Troubleshooting
|
||||
|
||||
1. Verify the host config has a static IP configured in `systemd.network.networks`
|
||||
|
||||
2. Check that `homelab.dns.enable` is not set to `false`
|
||||
|
||||
3. **USER**: Trigger auto-upgrade on DNS servers:
|
||||
```bash
|
||||
ssh root@ns1 systemctl start nixos-upgrade.service
|
||||
ssh root@ns2 systemctl start nixos-upgrade.service
|
||||
```
|
||||
|
||||
4. Verify DNS resolution after upgrade completes:
|
||||
```bash
|
||||
dig @ns1.home.2rjus.net <hostname>.home.2rjus.net
|
||||
```
|
||||
|
||||
### Host Not Being Scraped by Prometheus
|
||||
|
||||
Usually caused by not having deployed the commit with the new host to the monitoring host.
|
||||
|
||||
#### Troubleshooting
|
||||
|
||||
1. Check that `homelab.monitoring.enable` is not set to `false`
|
||||
|
||||
2. **USER**: Trigger auto-upgrade on monitoring01:
|
||||
```bash
|
||||
ssh root@monitoring01 systemctl start nixos-upgrade.service
|
||||
```
|
||||
|
||||
3. Verify the target appears in Prometheus:
|
||||
```promql
|
||||
up{instance=~"<hostname>.*"}
|
||||
```
|
||||
|
||||
4. If the target is down, check that node-exporter is running on the host:
|
||||
```bash
|
||||
ssh root@<hostname> systemctl status prometheus-node-exporter.service
|
||||
```
|
||||
|
||||
## Related Files
|
||||
|
||||
| Path | Description |
|
||||
|------|-------------|
|
||||
| `scripts/create-host/` | The `create-host` script that generates host configurations |
|
||||
| `hosts/template2/` | Template VM configuration (base image for new VMs) |
|
||||
| `hosts/template2/bootstrap.nix` | Bootstrap service that applies NixOS config on first boot |
|
||||
| `terraform/vms.tf` | VM definitions (specs, IPs, branch overrides) |
|
||||
| `terraform/cloud-init.tf` | Cloud-init configuration (passes hostname, branch, vault token) |
|
||||
| `terraform/vault/approle.tf` | AppRole policies for each host |
|
||||
| `terraform/vault/secrets.tf` | Secret definitions in Vault |
|
||||
| `terraform/vault/hosts-generated.tf` | Auto-generated wrapped tokens for VM bootstrap |
|
||||
| `playbooks/provision-approle.yml` | Ansible playbook for manual credential provisioning |
|
||||
| `flake.nix` | Flake with all host configurations (add new hosts here) |
|
||||
@@ -2,7 +2,7 @@
|
||||
|
||||
## Overview
|
||||
|
||||
Replace the current auth01 setup (LLDAP + Authelia) with a modern, unified authentication solution. The current setup is not in active use, making this a good time to evaluate alternatives.
|
||||
Deploy a modern, unified authentication solution for the homelab. Provides central user management, SSO for web services, and consistent UID/GID mapping for NAS permissions.
|
||||
|
||||
## Goals
|
||||
|
||||
@@ -11,66 +11,9 @@ Replace the current auth01 setup (LLDAP + Authelia) with a modern, unified authe
|
||||
3. **UID/GID consistency** - Proper POSIX attributes for NAS share permissions
|
||||
4. **OIDC provider** - Single sign-on for homelab web services (Grafana, etc.)
|
||||
|
||||
## Options Evaluated
|
||||
## Solution: Kanidm
|
||||
|
||||
### OpenLDAP (raw)
|
||||
|
||||
- **NixOS Support:** Good (`services.openldap` with `declarativeContents`)
|
||||
- **Pros:** Most widely supported, very flexible
|
||||
- **Cons:** LDIF format is painful, schema management is complex, no built-in OIDC, requires SSSD on each client
|
||||
- **Verdict:** Doesn't address LDAP complexity concerns
|
||||
|
||||
### LLDAP + Authelia (current)
|
||||
|
||||
- **NixOS Support:** Both have good modules
|
||||
- **Pros:** Already configured, lightweight, nice web UIs
|
||||
- **Cons:** Two services to manage, limited POSIX attribute support in LLDAP, requires SSSD on every client host
|
||||
- **Verdict:** Workable but has friction for NAS/UID goals
|
||||
|
||||
### FreeIPA
|
||||
|
||||
- **NixOS Support:** None
|
||||
- **Pros:** Full enterprise solution (LDAP + Kerberos + DNS + CA)
|
||||
- **Cons:** Extremely heavy, wants to own DNS, designed for Red Hat ecosystems, massive overkill for homelab
|
||||
- **Verdict:** Overkill, no NixOS support
|
||||
|
||||
### Keycloak
|
||||
|
||||
- **NixOS Support:** None
|
||||
- **Pros:** Good OIDC/SAML, nice UI
|
||||
- **Cons:** Primarily an identity broker not a user directory, poor POSIX support, heavy (Java)
|
||||
- **Verdict:** Wrong tool for Linux user management
|
||||
|
||||
### Authentik
|
||||
|
||||
- **NixOS Support:** None (would need Docker)
|
||||
- **Pros:** All-in-one with LDAP outpost and OIDC, modern UI
|
||||
- **Cons:** Heavy stack (Python + PostgreSQL + Redis), LDAP is a separate component
|
||||
- **Verdict:** Would work but requires Docker and is heavy
|
||||
|
||||
### Kanidm
|
||||
|
||||
- **NixOS Support:** Excellent - first-class module with PAM/NSS integration
|
||||
- **Pros:**
|
||||
- Native PAM/NSS module (no SSSD needed)
|
||||
- Built-in OIDC provider
|
||||
- Optional LDAP interface for legacy services
|
||||
- Declarative provisioning via NixOS (users, groups, OAuth2 clients)
|
||||
- Modern, written in Rust
|
||||
- Single service handles everything
|
||||
- **Cons:** Newer project, smaller community than LDAP
|
||||
- **Verdict:** Best fit for requirements
|
||||
|
||||
### Pocket-ID
|
||||
|
||||
- **NixOS Support:** Unknown
|
||||
- **Pros:** Very lightweight, passkey-first
|
||||
- **Cons:** No LDAP, no PAM/NSS integration - purely OIDC for web apps
|
||||
- **Verdict:** Doesn't solve Linux user management goal
|
||||
|
||||
## Recommendation: Kanidm
|
||||
|
||||
Kanidm is the recommended solution for the following reasons:
|
||||
Kanidm was chosen for the following reasons:
|
||||
|
||||
| Requirement | Kanidm Support |
|
||||
|-------------|----------------|
|
||||
@@ -82,42 +25,10 @@ Kanidm is the recommended solution for the following reasons:
|
||||
| Simplicity | Modern API, LDAP optional |
|
||||
| NixOS integration | First-class |
|
||||
|
||||
### Key NixOS Features
|
||||
### Configuration Files
|
||||
|
||||
**Server configuration:**
|
||||
```nix
|
||||
services.kanidm.enableServer = true;
|
||||
services.kanidm.serverSettings = {
|
||||
domain = "home.2rjus.net";
|
||||
origin = "https://auth.home.2rjus.net";
|
||||
ldapbindaddress = "0.0.0.0:636"; # Optional LDAP interface
|
||||
};
|
||||
```
|
||||
|
||||
**Declarative user provisioning:**
|
||||
```nix
|
||||
services.kanidm.provision.enable = true;
|
||||
services.kanidm.provision.persons.torjus = {
|
||||
displayName = "Torjus";
|
||||
groups = [ "admins" "nas-users" ];
|
||||
};
|
||||
```
|
||||
|
||||
**Declarative OAuth2 clients:**
|
||||
```nix
|
||||
services.kanidm.provision.systems.oauth2.grafana = {
|
||||
displayName = "Grafana";
|
||||
originUrl = "https://grafana.home.2rjus.net/login/generic_oauth";
|
||||
originLanding = "https://grafana.home.2rjus.net";
|
||||
};
|
||||
```
|
||||
|
||||
**Client host configuration (add to system/):**
|
||||
```nix
|
||||
services.kanidm.enableClient = true;
|
||||
services.kanidm.enablePam = true;
|
||||
services.kanidm.clientSettings.uri = "https://auth.home.2rjus.net";
|
||||
```
|
||||
- **Host configuration:** `hosts/kanidm01/`
|
||||
- **Service module:** `services/kanidm/default.nix`
|
||||
|
||||
## NAS Integration
|
||||
|
||||
@@ -148,42 +59,103 @@ This future migration path is a strong argument for Kanidm over LDAP-only soluti
|
||||
|
||||
## Implementation Steps
|
||||
|
||||
1. **Create Kanidm service module** in `services/kanidm/`
|
||||
- Server configuration
|
||||
- TLS via internal ACME
|
||||
- Vault secrets for admin passwords
|
||||
1. **Create kanidm01 host and service module** ✅
|
||||
- Host: `kanidm01.home.2rjus.net` (10.69.13.23, test tier)
|
||||
- Service module: `services/kanidm/`
|
||||
- TLS via internal ACME (`auth.home.2rjus.net`)
|
||||
- Vault integration for idm_admin password
|
||||
- LDAPS on port 636
|
||||
|
||||
2. **Configure declarative provisioning**
|
||||
- Define initial users and groups
|
||||
- Set up POSIX attributes (UID/GID ranges)
|
||||
2. **Configure provisioning** ✅
|
||||
- Groups provisioned declaratively: `admins`, `users`, `ssh-users`
|
||||
- Users managed imperatively via CLI (allows setting POSIX passwords in one step)
|
||||
- POSIX attributes enabled (UID/GID range 65,536-69,999)
|
||||
|
||||
3. **Add OIDC clients** for homelab services
|
||||
- Grafana
|
||||
- Other services as needed
|
||||
|
||||
4. **Create client module** in `system/` for PAM/NSS
|
||||
- Enable on all hosts that need central auth
|
||||
- Configure trusted CA
|
||||
|
||||
5. **Test NAS integration**
|
||||
3. **Test NAS integration** (in progress)
|
||||
- ✅ LDAP interface verified working
|
||||
- Configure TrueNAS LDAP client to connect to Kanidm
|
||||
- Verify UID/GID mapping works with NFS shares
|
||||
|
||||
6. **Migrate auth01**
|
||||
- Remove LLDAP and Authelia services
|
||||
- Deploy Kanidm
|
||||
- Update DNS CNAMEs if needed
|
||||
4. **Add OIDC clients** for homelab services
|
||||
- Grafana
|
||||
- Other services as needed
|
||||
|
||||
7. **Documentation**
|
||||
- User management procedures
|
||||
- Adding new OAuth2 clients
|
||||
- Troubleshooting PAM/NSS issues
|
||||
5. **Create client module** in `system/` for PAM/NSS ✅
|
||||
- Module: `system/kanidm-client.nix`
|
||||
- `homelab.kanidm.enable = true` enables PAM/NSS
|
||||
- Short usernames (not SPN format)
|
||||
- Home directory symlinks via `home_alias`
|
||||
- Enabled on test tier: testvm01, testvm02, testvm03
|
||||
|
||||
## Open Questions
|
||||
6. **Documentation** ✅
|
||||
- `docs/user-management.md` - CLI workflows, troubleshooting
|
||||
- User/group creation procedures verified working
|
||||
|
||||
- What UID/GID range should be reserved for Kanidm-managed users?
|
||||
- Which hosts should have PAM/NSS enabled initially?
|
||||
- What OAuth2 clients are needed at launch?
|
||||
## Progress
|
||||
|
||||
### Completed (2026-02-08)
|
||||
|
||||
**Kanidm server deployed on kanidm01 (test tier):**
|
||||
- Host: `kanidm01.home.2rjus.net` (10.69.13.23)
|
||||
- WebUI: `https://auth.home.2rjus.net`
|
||||
- LDAPS: port 636
|
||||
- Valid certificate from internal CA
|
||||
|
||||
**Configuration:**
|
||||
- Kanidm 1.8 with secret provisioning support
|
||||
- Daily backups at 22:00 (7 versions retained)
|
||||
- Vault integration for idm_admin password
|
||||
- Prometheus monitoring scrape target configured
|
||||
|
||||
**Provisioned entities:**
|
||||
- Groups: `admins`, `users`, `ssh-users` (declarative)
|
||||
- Users managed via CLI (imperative)
|
||||
|
||||
**Verified working:**
|
||||
- WebUI login with idm_admin
|
||||
- LDAP bind and search with POSIX-enabled user
|
||||
- LDAPS with valid internal CA certificate
|
||||
|
||||
### Completed (2026-02-08) - PAM/NSS Client
|
||||
|
||||
**Client module deployed (`system/kanidm-client.nix`):**
|
||||
- `homelab.kanidm.enable = true` enables PAM/NSS integration
|
||||
- Connects to auth.home.2rjus.net
|
||||
- Short usernames (`torjus` instead of `torjus@home.2rjus.net`)
|
||||
- Home directory symlinks (`/home/torjus` → UUID-based dir)
|
||||
- Login restricted to `ssh-users` group
|
||||
|
||||
**Enabled on test tier:**
|
||||
- testvm01, testvm02, testvm03
|
||||
|
||||
**Verified working:**
|
||||
- User/group resolution via `getent`
|
||||
- SSH login with Kanidm unix passwords
|
||||
- Home directory creation with symlinks
|
||||
- Imperative user/group creation via CLI
|
||||
|
||||
**Documentation:**
|
||||
- `docs/user-management.md` with full CLI workflows
|
||||
- Password requirements (min 10 chars)
|
||||
- Troubleshooting guide (nscd, cache invalidation)
|
||||
|
||||
### UID/GID Range (Resolved)
|
||||
|
||||
**Range: 65,536 - 69,999** (manually allocated)
|
||||
|
||||
- Users: 65,536 - 67,999 (up to ~2500 users)
|
||||
- Groups: 68,000 - 69,999 (up to ~2000 groups)
|
||||
|
||||
Rationale:
|
||||
- Starts at Kanidm's recommended minimum (65,536)
|
||||
- Well above NixOS system users (typically <1000)
|
||||
- Avoids Podman/container issues with very high GIDs
|
||||
|
||||
### Next Steps
|
||||
|
||||
1. Enable PAM/NSS on production hosts (after test tier validation)
|
||||
2. Configure TrueNAS LDAP client for NAS integration testing
|
||||
3. Add OAuth2 clients (Grafana first)
|
||||
|
||||
## References
|
||||
|
||||
|
||||
107
docs/plans/completed/ns1-recreation.md
Normal file
107
docs/plans/completed/ns1-recreation.md
Normal file
@@ -0,0 +1,107 @@
|
||||
# ns1 Recreation Plan
|
||||
|
||||
## Overview
|
||||
|
||||
Recreate ns1 using the OpenTofu workflow after the existing VM entered emergency mode due to incorrect hardware-configuration.nix (hardcoded UUIDs that don't match actual disk layout).
|
||||
|
||||
## Current ns1 Configuration to Preserve
|
||||
|
||||
- **IP:** 10.69.13.5/24
|
||||
- **Gateway:** 10.69.13.1
|
||||
- **Role:** Primary DNS (authoritative + resolver)
|
||||
- **Services:**
|
||||
- `../../services/ns/master-authorative.nix`
|
||||
- `../../services/ns/resolver.nix`
|
||||
- **Metadata:**
|
||||
- `homelab.host.role = "dns"`
|
||||
- `homelab.host.labels.dns_role = "primary"`
|
||||
- **Vault:** enabled
|
||||
- **Deploy:** enabled
|
||||
|
||||
## Execution Steps
|
||||
|
||||
### Phase 1: Remove Old Configuration
|
||||
|
||||
```bash
|
||||
nix develop -c create-host --remove --hostname ns1 --force
|
||||
```
|
||||
|
||||
This removes:
|
||||
- `hosts/ns1/` directory
|
||||
- Entry from `flake.nix`
|
||||
- Any terraform entries (none exist currently)
|
||||
|
||||
### Phase 2: Create New Configuration
|
||||
|
||||
```bash
|
||||
nix develop -c create-host --hostname ns1 --ip 10.69.13.5/24
|
||||
```
|
||||
|
||||
This creates:
|
||||
- `hosts/ns1/` with template2-based configuration
|
||||
- Entry in `flake.nix`
|
||||
- Entry in `terraform/vms.tf`
|
||||
- Vault wrapped token for bootstrap
|
||||
|
||||
### Phase 3: Customize Configuration
|
||||
|
||||
After create-host, manually update `hosts/ns1/configuration.nix` to add:
|
||||
|
||||
1. DNS service imports:
|
||||
```nix
|
||||
../../services/ns/master-authorative.nix
|
||||
../../services/ns/resolver.nix
|
||||
```
|
||||
|
||||
2. Host metadata:
|
||||
```nix
|
||||
homelab.host = {
|
||||
tier = "prod";
|
||||
role = "dns";
|
||||
labels.dns_role = "primary";
|
||||
};
|
||||
```
|
||||
|
||||
3. Disable resolved (conflicts with Unbound):
|
||||
```nix
|
||||
services.resolved.enable = false;
|
||||
```
|
||||
|
||||
### Phase 4: Commit Changes
|
||||
|
||||
```bash
|
||||
git add -A
|
||||
git commit -m "ns1: recreate with OpenTofu workflow
|
||||
|
||||
Old VM had incorrect hardware-configuration.nix with hardcoded UUIDs
|
||||
that didn't match actual disk layout, causing boot failure.
|
||||
|
||||
Recreated using template2-based configuration for OpenTofu provisioning."
|
||||
```
|
||||
|
||||
### Phase 5: Infrastructure
|
||||
|
||||
1. Delete old ns1 VM in Proxmox (it's broken anyway)
|
||||
2. Run `nix develop -c tofu -chdir=terraform apply`
|
||||
3. Wait for bootstrap to complete
|
||||
4. Verify ns1 is functional:
|
||||
- DNS resolution working
|
||||
- Zone transfer to ns2 working
|
||||
- All exporters responding
|
||||
|
||||
### Phase 6: Finalize
|
||||
|
||||
- Push to master
|
||||
- Move this plan to `docs/plans/completed/`
|
||||
|
||||
## Rollback
|
||||
|
||||
If the new VM fails:
|
||||
1. ns2 is still operational as secondary DNS
|
||||
2. Can recreate with different settings if needed
|
||||
|
||||
## Notes
|
||||
|
||||
- ns2 will continue serving DNS during the migration
|
||||
- Zone data is generated from flake, so no data loss
|
||||
- The old VM's disk can be kept briefly in Proxmox as backup if desired
|
||||
@@ -9,24 +9,23 @@ hosts are decommissioned or deferred.
|
||||
|
||||
## Current State
|
||||
|
||||
Hosts already managed by OpenTofu: `vault01`, `testvm01`, `vaulttest01`
|
||||
Hosts already managed by OpenTofu: `vault01`, `testvm01`, `testvm02`, `testvm03`, `ns2`, `ns1`
|
||||
|
||||
Hosts to migrate:
|
||||
|
||||
| Host | Category | Notes |
|
||||
|------|----------|-------|
|
||||
| ns1 | Stateless | Primary DNS, recreate |
|
||||
| ns2 | Stateless | Secondary DNS, recreate |
|
||||
| ~~ns1~~ | ~~Stateless~~ | ✓ Complete |
|
||||
| nix-cache01 | Stateless | Binary cache, recreate |
|
||||
| http-proxy | Stateless | Reverse proxy, recreate |
|
||||
| nats1 | Stateless | Messaging, recreate |
|
||||
| auth01 | Decommission | No longer in use |
|
||||
| ha1 | Stateful | Home Assistant + Zigbee2MQTT + Mosquitto |
|
||||
| monitoring01 | Stateful | Prometheus, Grafana, Loki |
|
||||
| jelly01 | Stateful | Jellyfin metadata, watch history, config |
|
||||
| pgdb1 | Stateful | PostgreSQL databases |
|
||||
| jump | Decommission | No longer needed |
|
||||
| ca | Deferred | Pending Phase 4c PKI migration to OpenBao |
|
||||
| pgdb1 | Decommission | Only used by Open WebUI on gunter, migrating to local postgres |
|
||||
| ~~jump~~ | ~~Decommission~~ | ✓ Complete |
|
||||
| ~~auth01~~ | ~~Decommission~~ | ✓ Complete |
|
||||
| ~~ca~~ | ~~Deferred~~ | ✓ Complete |
|
||||
|
||||
## Phase 1: Backup Preparation
|
||||
|
||||
@@ -46,39 +45,19 @@ No backup currently exists. Add a restic backup job for `/var/lib/jellyfin/` whi
|
||||
Media files are on the NAS (`nas.home.2rjus.net:/mnt/hdd-pool/media`) and do not need backup.
|
||||
The cache directory (`/var/cache/jellyfin/`) does not need backup — it regenerates.
|
||||
|
||||
### 1c. Add PostgreSQL Backup to pgdb1
|
||||
|
||||
No backup currently exists. Add a restic backup job with a `pg_dumpall` pre-hook to capture
|
||||
all databases and roles. The dump should be piped through restic's stdin backup (similar to
|
||||
the Grafana DB dump pattern on monitoring01).
|
||||
|
||||
### 1d. Verify Existing ha1 Backup
|
||||
### 1c. Verify Existing ha1 Backup
|
||||
|
||||
ha1 already backs up `/var/lib/hass`, `/var/lib/zigbee2mqtt`, `/var/lib/mosquitto`. Verify
|
||||
these backups are current and restorable before proceeding with migration.
|
||||
|
||||
### 1e. Verify All Backups
|
||||
### 1d. Verify All Backups
|
||||
|
||||
After adding/expanding backup jobs:
|
||||
1. Trigger a manual backup run on each host
|
||||
2. Verify backup integrity with `restic check`
|
||||
3. Test a restore to a temporary location to confirm data is recoverable
|
||||
|
||||
## Phase 2: Declare pgdb1 Databases in Nix
|
||||
|
||||
Before migrating pgdb1, audit the manually-created databases and users on the running
|
||||
instance, then declare them in the Nix configuration using `ensureDatabases` and
|
||||
`ensureUsers`. This makes the PostgreSQL setup reproducible on the new host.
|
||||
|
||||
Steps:
|
||||
1. SSH to pgdb1, run `\l` and `\du` in psql to list databases and roles
|
||||
2. Add `ensureDatabases` and `ensureUsers` to `services/postgres/postgres.nix`
|
||||
3. Document any non-default PostgreSQL settings or extensions per database
|
||||
|
||||
After reprovisioning, the databases will be created by NixOS, and data restored from the
|
||||
`pg_dumpall` backup.
|
||||
|
||||
## Phase 3: Stateless Host Migration
|
||||
## Phase 2: Stateless Host Migration
|
||||
|
||||
These hosts have no meaningful state and can be recreated fresh. For each host:
|
||||
|
||||
@@ -95,13 +74,14 @@ Migrate stateless hosts in an order that minimizes disruption:
|
||||
|
||||
1. **nix-cache01** — low risk, no downstream dependencies during migration
|
||||
2. **nats1** — low risk, verify no persistent JetStream streams first
|
||||
4. **http-proxy** — brief disruption to proxied services, migrate during low-traffic window
|
||||
5. **ns1, ns2** — migrate one at a time, verify DNS resolution between each
|
||||
3. **http-proxy** — brief disruption to proxied services, migrate during low-traffic window
|
||||
4. ~~**ns1** — ns2 already migrated, verify AXFR works after ns1 migration~~ ✓ Complete
|
||||
|
||||
For ns1/ns2: migrate ns2 first (secondary), verify AXFR works, then migrate ns1. All hosts
|
||||
use both ns1 and ns2 as resolvers, so one being down briefly is tolerable.
|
||||
~~For ns1/ns2: migrate ns2 first (secondary), verify AXFR works, then migrate ns1.~~ Both ns1
|
||||
and ns2 migration complete. Zone transfer (AXFR) verified working between ns1 (primary) and
|
||||
ns2 (secondary).
|
||||
|
||||
## Phase 4: Stateful Host Migration
|
||||
## Phase 3: Stateful Host Migration
|
||||
|
||||
For each stateful host, the procedure is:
|
||||
|
||||
@@ -114,17 +94,7 @@ For each stateful host, the procedure is:
|
||||
7. Start services and verify functionality
|
||||
8. Decommission the old VM
|
||||
|
||||
### 4a. pgdb1
|
||||
|
||||
1. Run final `pg_dumpall` backup via restic
|
||||
2. Stop PostgreSQL on the old host
|
||||
3. Provision new pgdb1 via OpenTofu
|
||||
4. After bootstrap, NixOS creates the declared databases/users
|
||||
5. Restore data with `pg_restore` or `psql < dumpall.sql`
|
||||
6. Verify database connectivity from gunter (`10.69.30.105`)
|
||||
7. Decommission old VM
|
||||
|
||||
### 4b. monitoring01
|
||||
### 3a. monitoring01
|
||||
|
||||
1. Run final Grafana backup
|
||||
2. Provision new monitoring01 via OpenTofu
|
||||
@@ -134,7 +104,7 @@ For each stateful host, the procedure is:
|
||||
6. Verify all scrape targets are being collected
|
||||
7. Decommission old VM
|
||||
|
||||
### 4c. jelly01
|
||||
### 3b. jelly01
|
||||
|
||||
1. Run final Jellyfin backup
|
||||
2. Provision new jelly01 via OpenTofu
|
||||
@@ -143,7 +113,7 @@ For each stateful host, the procedure is:
|
||||
5. Start Jellyfin, verify watch history and library metadata are present
|
||||
6. Decommission old VM
|
||||
|
||||
### 4d. ha1
|
||||
### 3c. ha1
|
||||
|
||||
1. Verify latest restic backup is current
|
||||
2. Stop Home Assistant, Zigbee2MQTT, and Mosquitto on old host
|
||||
@@ -167,47 +137,69 @@ OpenTofu/Proxmox. Verify the USB device ID on the hypervisor and add the appropr
|
||||
`usb` block to the VM definition in `terraform/vms.tf`. The USB device must be passed
|
||||
through before starting Zigbee2MQTT on the new host.
|
||||
|
||||
## Phase 5: Decommission jump and auth01 Hosts
|
||||
## Phase 4: Decommission Hosts
|
||||
|
||||
### jump
|
||||
1. Verify nothing depends on the jump host (no SSH proxy configs pointing to it, etc.)
|
||||
2. Remove host configuration from `hosts/jump/`
|
||||
3. Remove from `flake.nix`
|
||||
4. Remove any secrets in `secrets/jump/`
|
||||
5. Remove from `.sops.yaml`
|
||||
### jump ✓ COMPLETE
|
||||
|
||||
~~1. Verify nothing depends on the jump host (no SSH proxy configs pointing to it, etc.)~~
|
||||
~~2. Remove host configuration from `hosts/jump/`~~
|
||||
~~3. Remove from `flake.nix`~~
|
||||
~~4. Remove any secrets in `secrets/jump/`~~
|
||||
~~5. Remove from `.sops.yaml`~~
|
||||
~~6. Destroy the VM in Proxmox~~
|
||||
~~7. Commit cleanup~~
|
||||
|
||||
Host was already removed from flake.nix and VM destroyed. Configuration cleaned up in ba9f47f.
|
||||
|
||||
### auth01 ✓ COMPLETE
|
||||
|
||||
~~1. Remove host configuration from `hosts/auth01/`~~
|
||||
~~2. Remove from `flake.nix`~~
|
||||
~~3. Remove any secrets in `secrets/auth01/`~~
|
||||
~~4. Remove from `.sops.yaml`~~
|
||||
~~5. Remove `services/authelia/` and `services/lldap/` (only used by auth01)~~
|
||||
~~6. Destroy the VM in Proxmox~~
|
||||
~~7. Commit cleanup~~
|
||||
|
||||
Host configuration, services, and VM already removed.
|
||||
|
||||
### pgdb1 (in progress)
|
||||
|
||||
Only consumer was Open WebUI on gunter, which has been migrated to use local PostgreSQL.
|
||||
|
||||
1. ~~Verify Open WebUI on gunter is using local PostgreSQL (not pgdb1)~~ ✓
|
||||
2. ~~Remove host configuration from `hosts/pgdb1/`~~ ✓
|
||||
3. ~~Remove `services/postgres/` (only used by pgdb1)~~ ✓
|
||||
4. ~~Remove from `flake.nix`~~ ✓
|
||||
5. ~~Remove Vault AppRole from `terraform/vault/approle.tf`~~ ✓
|
||||
6. Destroy the VM in Proxmox
|
||||
7. Commit cleanup
|
||||
7. ~~Commit cleanup~~ ✓
|
||||
|
||||
### auth01
|
||||
1. Remove host configuration from `hosts/auth01/`
|
||||
2. Remove from `flake.nix`
|
||||
3. Remove any secrets in `secrets/auth01/`
|
||||
4. Remove from `.sops.yaml`
|
||||
5. Remove `services/authelia/` and `services/lldap/` (only used by auth01)
|
||||
6. Destroy the VM in Proxmox
|
||||
7. Commit cleanup
|
||||
See `docs/plans/pgdb1-decommission.md` for detailed plan.
|
||||
|
||||
## Phase 6: Decommission ca Host (Deferred)
|
||||
## Phase 5: Decommission ca Host ✓ COMPLETE
|
||||
|
||||
Deferred until Phase 4c (PKI migration to OpenBao) is complete. Once all hosts use the
|
||||
~~Deferred until Phase 4c (PKI migration to OpenBao) is complete. Once all hosts use the
|
||||
OpenBao ACME endpoint for certificates, the step-ca host can be decommissioned following
|
||||
the same cleanup steps as the jump host.
|
||||
the same cleanup steps as the jump host.~~
|
||||
|
||||
## Phase 7: Remove sops-nix
|
||||
PKI migration to OpenBao complete. Host configuration, `services/ca/`, and VM removed.
|
||||
|
||||
Once `ca` is decommissioned (Phase 6), `sops-nix` is no longer used by any host. Remove
|
||||
all remnants:
|
||||
- `sops-nix` input from `flake.nix` and `flake.lock`
|
||||
- `sops-nix.nixosModules.sops` from all host module lists in `flake.nix`
|
||||
- `inherit sops-nix` from all specialArgs in `flake.nix`
|
||||
- `system/sops.nix` and its import in `system/default.nix`
|
||||
- `.sops.yaml`
|
||||
- `secrets/` directory
|
||||
- All `sops.secrets.*` declarations in `services/ca/`, `services/authelia/`, `services/lldap/`
|
||||
- Template scripts that generate age keys for sops (`hosts/template/scripts.nix`,
|
||||
`hosts/template2/scripts.nix`)
|
||||
## Phase 6: Remove sops-nix ✓ COMPLETE
|
||||
|
||||
See `docs/plans/completed/sops-to-openbao-migration.md` for full context.
|
||||
~~Once `ca` is decommissioned (Phase 6), `sops-nix` is no longer used by any host. Remove
|
||||
all remnants:~~
|
||||
~~- `sops-nix` input from `flake.nix` and `flake.lock`~~
|
||||
~~- `sops-nix.nixosModules.sops` from all host module lists in `flake.nix`~~
|
||||
~~- `inherit sops-nix` from all specialArgs in `flake.nix`~~
|
||||
~~- `system/sops.nix` and its import in `system/default.nix`~~
|
||||
~~- `.sops.yaml`~~
|
||||
~~- `secrets/` directory~~
|
||||
~~- All `sops.secrets.*` declarations in `services/ca/`, `services/authelia/`, `services/lldap/`~~
|
||||
~~- Template scripts that generate age keys for sops (`hosts/template/scripts.nix`,
|
||||
`hosts/template2/scripts.nix`)~~
|
||||
|
||||
All sops-nix remnants removed. See `docs/plans/completed/sops-to-openbao-migration.md` for context.
|
||||
|
||||
## Notes
|
||||
|
||||
@@ -216,7 +208,7 @@ See `docs/plans/completed/sops-to-openbao-migration.md` for full context.
|
||||
- The old VMs use IPs that the new VMs need, so the old VM must be shut down before
|
||||
the new one is provisioned (or use a temporary IP and swap after verification)
|
||||
- Stateful migrations should be done during low-usage windows
|
||||
- After all migrations are complete, the only hosts not in OpenTofu will be ca (deferred)
|
||||
- After all migrations are complete, all decommissioned hosts (jump, auth01, ca) have been removed
|
||||
- Since many hosts are being recreated, this is a good opportunity to establish consistent
|
||||
hostname naming conventions before provisioning the new VMs. Current naming is inconsistent
|
||||
(e.g. `ns1` vs `nix-cache01`, `ha1` vs `auth01`, `pgdb1` vs `http-proxy`). Decide on a
|
||||
|
||||
116
docs/plans/memory-issues-follow-up.md
Normal file
116
docs/plans/memory-issues-follow-up.md
Normal file
@@ -0,0 +1,116 @@
|
||||
# Memory Issues Follow-up
|
||||
|
||||
Tracking the zram change to verify it resolves OOM issues during nixos-upgrade on low-memory hosts.
|
||||
|
||||
## Background
|
||||
|
||||
On 2026-02-08, ns2 (2GB RAM) experienced an OOM kill during nixos-upgrade. The Nix evaluation process consumed ~1.6GB before being killed by the kernel. ns1 (manually increased to 4GB) succeeded with the same upgrade.
|
||||
|
||||
Root cause: 2GB RAM is insufficient for Nix flake evaluation without swap.
|
||||
|
||||
## Fix Applied
|
||||
|
||||
**Commit:** `1674b6a` - system: enable zram swap for all hosts
|
||||
|
||||
**Merged:** 2026-02-08 ~12:15 UTC
|
||||
|
||||
**Change:** Added `zramSwap.enable = true` to `system/zram.nix`, providing ~2GB compressed swap on all hosts.
|
||||
|
||||
## Timeline
|
||||
|
||||
| Time (UTC) | Event |
|
||||
|------------|-------|
|
||||
| 05:00:46 | ns2 nixos-upgrade OOM killed |
|
||||
| 05:01:47 | `nixos_upgrade_failed` alert fired |
|
||||
| 12:15 | zram commit merged to master |
|
||||
| 12:19 | ns2 rebooted with zram enabled |
|
||||
| 12:20 | ns1 rebooted (memory reduced to 2GB via tofu) |
|
||||
|
||||
## Hosts Affected
|
||||
|
||||
All 2GB VMs that run nixos-upgrade:
|
||||
- ns1, ns2 (DNS)
|
||||
- vault01
|
||||
- testvm01, testvm02, testvm03
|
||||
- kanidm01
|
||||
|
||||
## Metrics to Monitor
|
||||
|
||||
Check these in Grafana or via PromQL to verify the fix:
|
||||
|
||||
### Swap availability (should be ~2GB after upgrade)
|
||||
```promql
|
||||
node_memory_SwapTotal_bytes / 1024 / 1024
|
||||
```
|
||||
|
||||
### Swap usage during upgrades
|
||||
```promql
|
||||
(node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / 1024 / 1024
|
||||
```
|
||||
|
||||
### Zswap compressed bytes (active compression)
|
||||
```promql
|
||||
node_memory_Zswap_bytes / 1024 / 1024
|
||||
```
|
||||
|
||||
### Upgrade failures (should be 0)
|
||||
```promql
|
||||
node_systemd_unit_state{name="nixos-upgrade.service", state="failed"}
|
||||
```
|
||||
|
||||
### Memory available during upgrades
|
||||
```promql
|
||||
node_memory_MemAvailable_bytes / 1024 / 1024
|
||||
```
|
||||
|
||||
## Verification Steps
|
||||
|
||||
After a few days (allow auto-upgrades to run on all hosts):
|
||||
|
||||
1. Check all hosts have swap enabled:
|
||||
```promql
|
||||
node_memory_SwapTotal_bytes > 0
|
||||
```
|
||||
|
||||
2. Check for any upgrade failures since the fix:
|
||||
```promql
|
||||
count_over_time(ALERTS{alertname="nixos_upgrade_failed"}[7d])
|
||||
```
|
||||
|
||||
3. Review if any hosts used swap during upgrades (check historical graphs)
|
||||
|
||||
## Success Criteria
|
||||
|
||||
- No `nixos_upgrade_failed` alerts due to OOM after 2026-02-08
|
||||
- All hosts show ~2GB swap available
|
||||
- Upgrades complete successfully on 2GB VMs
|
||||
|
||||
## Fallback Options
|
||||
|
||||
If zram is insufficient:
|
||||
|
||||
1. **Increase VM memory** - Update `terraform/vms.tf` to 4GB for affected hosts
|
||||
2. **Enable memory ballooning** - Configure VMs with dynamic memory allocation (see below)
|
||||
3. **Use remote builds** - Configure `nix.buildMachines` to offload evaluation
|
||||
4. **Reduce flake size** - Split configurations to reduce evaluation memory
|
||||
|
||||
### Memory Ballooning
|
||||
|
||||
Proxmox supports memory ballooning, which allows VMs to dynamically grow/shrink memory allocation based on demand. The balloon driver inside the guest communicates with the hypervisor to release or reclaim memory pages.
|
||||
|
||||
Configuration in `terraform/vms.tf`:
|
||||
```hcl
|
||||
memory = 4096 # maximum memory
|
||||
balloon = 2048 # minimum memory (shrinks to this when idle)
|
||||
```
|
||||
|
||||
Pros:
|
||||
- VMs get memory on-demand without reboots
|
||||
- Better host memory utilization
|
||||
- Solves upgrade OOM without permanently allocating 4GB
|
||||
|
||||
Cons:
|
||||
- Requires QEMU guest agent running in guest
|
||||
- Guest can experience memory pressure if host is overcommitted
|
||||
|
||||
Ballooning and zram are complementary - ballooning provides headroom from the host, zram provides overflow within the guest.
|
||||
219
docs/plans/monitoring-migration-victoriametrics.md
Normal file
219
docs/plans/monitoring-migration-victoriametrics.md
Normal file
@@ -0,0 +1,219 @@
|
||||
# Monitoring Stack Migration to VictoriaMetrics
|
||||
|
||||
## Overview
|
||||
|
||||
Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression
|
||||
and longer retention. Run in parallel with monitoring01 until validated, then switch over using
|
||||
a `monitoring` CNAME for seamless transition.
|
||||
|
||||
## Current State
|
||||
|
||||
**monitoring01** (10.69.13.13):
|
||||
- 4 CPU cores, 4GB RAM, 33GB disk
|
||||
- Prometheus with 30-day retention (15s scrape interval)
|
||||
- Alertmanager (routes to alerttonotify webhook)
|
||||
- Grafana (dashboards, datasources)
|
||||
- Loki (log aggregation from all hosts via Promtail)
|
||||
- Tempo (distributed tracing)
|
||||
- Pyroscope (continuous profiling)
|
||||
|
||||
**Hardcoded References to monitoring01:**
|
||||
- `system/monitoring/logs.nix` - Promtail sends logs to `http://monitoring01.home.2rjus.net:3100`
|
||||
- `hosts/template2/bootstrap.nix` - Bootstrap logs to Loki (keep as-is until decommission)
|
||||
- `services/http-proxy/proxy.nix` - Caddy proxies Prometheus, Alertmanager, Grafana, Pyroscope, Pushgateway
|
||||
|
||||
**Auto-generated:**
|
||||
- Prometheus scrape targets (from `lib/monitoring.nix` + `homelab.monitoring.scrapeTargets`)
|
||||
- Node-exporter targets (from all hosts with static IPs)
|
||||
|
||||
## Decision: VictoriaMetrics
|
||||
|
||||
Per `docs/plans/long-term-metrics-storage.md`, VictoriaMetrics is the recommended starting point:
|
||||
- Single binary replacement for Prometheus
|
||||
- 5-10x better compression (30 days could become 180+ days in same space)
|
||||
- Same PromQL query language (Grafana dashboards work unchanged)
|
||||
- Same scrape config format (existing auto-generated configs work)
|
||||
|
||||
If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────┐
|
||||
│ monitoring02 │
|
||||
│ VictoriaMetrics│
|
||||
│ + Grafana │
|
||||
monitoring │ + Loki │
|
||||
CNAME ──────────│ + Tempo │
|
||||
│ + Pyroscope │
|
||||
│ + Alertmanager │
|
||||
│ (vmalert) │
|
||||
└─────────────────┘
|
||||
▲
|
||||
│ scrapes
|
||||
┌───────────────┼───────────────┐
|
||||
│ │ │
|
||||
┌────┴────┐ ┌─────┴────┐ ┌─────┴────┐
|
||||
│ ns1 │ │ ha1 │ │ ... │
|
||||
│ :9100 │ │ :9100 │ │ :9100 │
|
||||
└─────────┘ └──────────┘ └──────────┘
|
||||
```
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Phase 1: Create monitoring02 Host
|
||||
|
||||
Use `create-host` script which handles flake.nix and terraform/vms.tf automatically.
|
||||
|
||||
1. **Run create-host**: `nix develop -c create-host monitoring02 10.69.13.24`
|
||||
2. **Update VM resources** in `terraform/vms.tf`:
|
||||
- 4 cores (same as monitoring01)
|
||||
- 8GB RAM (double, for VictoriaMetrics headroom)
|
||||
- 100GB disk (for 3+ months retention with compression)
|
||||
3. **Update host configuration**: Import monitoring services
|
||||
4. **Create Vault AppRole**: Add to `terraform/vault/approle.tf`
|
||||
|
||||
### Phase 2: Set Up VictoriaMetrics Stack
|
||||
|
||||
Create new service module at `services/monitoring/victoriametrics/` for testing alongside existing
|
||||
Prometheus config. Once validated, this can replace the Prometheus module.
|
||||
|
||||
1. **VictoriaMetrics** (port 8428):
|
||||
- `services.victoriametrics.enable = true`
|
||||
- `services.victoriametrics.retentionPeriod = "3m"` (3 months, increase later based on disk usage)
|
||||
- Migrate scrape configs via `prometheusConfig`
|
||||
- Use native push support (replaces Pushgateway)
|
||||
|
||||
2. **vmalert** for alerting rules:
|
||||
- `services.vmalert.enable = true`
|
||||
- Point to VictoriaMetrics for metrics evaluation
|
||||
- Keep rules in separate `rules.yml` file (same format as Prometheus)
|
||||
- No receiver configured during parallel operation (prevents duplicate alerts)
|
||||
|
||||
3. **Alertmanager** (port 9093):
|
||||
- Keep existing configuration (alerttonotify webhook routing)
|
||||
- Only enable receiver after cutover from monitoring01
|
||||
|
||||
4. **Loki** (port 3100):
|
||||
- Same configuration as current
|
||||
|
||||
5. **Grafana** (port 3000):
|
||||
- Define dashboards declaratively via NixOS options (not imported from monitoring01)
|
||||
- Reference existing dashboards on monitoring01 for content inspiration
|
||||
- Configure VictoriaMetrics datasource (port 8428)
|
||||
- Configure Loki datasource
|
||||
|
||||
6. **Tempo** (ports 3200, 3201):
|
||||
- Same configuration
|
||||
|
||||
7. **Pyroscope** (port 4040):
|
||||
- Same Docker-based deployment
|
||||
|
||||
### Phase 3: Parallel Operation
|
||||
|
||||
Run both monitoring01 and monitoring02 simultaneously:
|
||||
|
||||
1. **Dual scraping**: Both hosts scrape the same targets
|
||||
- Validates VictoriaMetrics is collecting data correctly
|
||||
|
||||
2. **Dual log shipping**: Configure Promtail to send logs to both Loki instances
|
||||
- Add second client in `system/monitoring/logs.nix` pointing to monitoring02
|
||||
|
||||
3. **Validate dashboards**: Access Grafana on monitoring02, verify dashboards work
|
||||
|
||||
4. **Validate alerts**: Verify vmalert evaluates rules correctly (no receiver = no notifications)
|
||||
|
||||
5. **Compare resource usage**: Monitor disk/memory consumption between hosts
|
||||
|
||||
### Phase 4: Add monitoring CNAME
|
||||
|
||||
Add CNAME to monitoring02 once validated:
|
||||
|
||||
```nix
|
||||
# hosts/monitoring02/configuration.nix
|
||||
homelab.dns.cnames = [ "monitoring" ];
|
||||
```
|
||||
|
||||
This creates `monitoring.home.2rjus.net` pointing to monitoring02.
|
||||
|
||||
### Phase 5: Update References
|
||||
|
||||
Update hardcoded references to use the CNAME:
|
||||
|
||||
1. **system/monitoring/logs.nix**:
|
||||
- Remove dual-shipping, point only to `http://monitoring.home.2rjus.net:3100`
|
||||
|
||||
2. **services/http-proxy/proxy.nix**: Update reverse proxy backends:
|
||||
- prometheus.home.2rjus.net -> monitoring.home.2rjus.net:8428
|
||||
- alertmanager.home.2rjus.net -> monitoring.home.2rjus.net:9093
|
||||
- grafana.home.2rjus.net -> monitoring.home.2rjus.net:3000
|
||||
- pyroscope.home.2rjus.net -> monitoring.home.2rjus.net:4040
|
||||
|
||||
Note: `hosts/template2/bootstrap.nix` stays pointed at monitoring01 until decommission.
|
||||
|
||||
### Phase 6: Enable Alerting
|
||||
|
||||
Once ready to cut over:
|
||||
1. Enable Alertmanager receiver on monitoring02
|
||||
2. Verify test alerts route correctly
|
||||
|
||||
### Phase 7: Cutover and Decommission
|
||||
|
||||
1. **Stop monitoring01**: Prevent duplicate alerts during transition
|
||||
2. **Update bootstrap.nix**: Point to `monitoring.home.2rjus.net`
|
||||
3. **Verify all targets scraped**: Check VictoriaMetrics UI
|
||||
4. **Verify logs flowing**: Check Loki on monitoring02
|
||||
5. **Decommission monitoring01**:
|
||||
- Remove from flake.nix
|
||||
- Remove host configuration
|
||||
- Destroy VM in Proxmox
|
||||
- Remove from terraform state
|
||||
|
||||
## Open Questions
|
||||
|
||||
- [ ] What disk size for monitoring02? 100GB should allow 3+ months with VictoriaMetrics compression
|
||||
- [ ] Which dashboards to recreate declaratively? (Review monitoring01 Grafana for current set)
|
||||
|
||||
## VictoriaMetrics Service Configuration
|
||||
|
||||
Example NixOS configuration for monitoring02:
|
||||
|
||||
```nix
|
||||
# VictoriaMetrics replaces Prometheus
|
||||
services.victoriametrics = {
|
||||
enable = true;
|
||||
retentionPeriod = "3m"; # 3 months, increase based on disk usage
|
||||
prometheusConfig = {
|
||||
global.scrape_interval = "15s";
|
||||
scrape_configs = [
|
||||
# Auto-generated node-exporter targets
|
||||
# Service-specific scrape targets
|
||||
# External targets
|
||||
];
|
||||
};
|
||||
};
|
||||
|
||||
# vmalert for alerting rules (no receiver during parallel operation)
|
||||
services.vmalert = {
|
||||
enable = true;
|
||||
datasource.url = "http://localhost:8428";
|
||||
# notifier.alertmanager.url = "http://localhost:9093"; # Enable after cutover
|
||||
rule = [ ./rules.yml ];
|
||||
};
|
||||
```
|
||||
|
||||
## Rollback Plan
|
||||
|
||||
If issues arise after cutover:
|
||||
1. Move `monitoring` CNAME back to monitoring01
|
||||
2. Restart monitoring01 services
|
||||
3. Revert Promtail config to point only to monitoring01
|
||||
4. Revert http-proxy backends
|
||||
|
||||
## Notes
|
||||
|
||||
- VictoriaMetrics uses port 8428 vs Prometheus 9090
|
||||
- PromQL compatibility is excellent
|
||||
- VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
|
||||
- monitoring02 deployed via OpenTofu using `create-host` script
|
||||
- Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state
|
||||
212
docs/plans/nix-cache-reprovision.md
Normal file
212
docs/plans/nix-cache-reprovision.md
Normal file
@@ -0,0 +1,212 @@
|
||||
# Nix Cache Host Reprovision
|
||||
|
||||
## Overview
|
||||
|
||||
Reprovision `nix-cache01` using the OpenTofu workflow, and improve the build/cache system with:
|
||||
1. NATS-based remote build triggering (replacing the current bash script)
|
||||
2. Safer flake update workflow that validates builds before pushing to master
|
||||
|
||||
## Current State
|
||||
|
||||
### Host Configuration
|
||||
- `nix-cache01` at 10.69.13.15 serves the binary cache via Harmonia
|
||||
- Runs Gitea Actions runner for CI workflows
|
||||
- Has `homelab.deploy.enable = true` (already supports NATS-based deployment)
|
||||
- Uses a dedicated XFS volume at `/nix` for cache storage
|
||||
|
||||
### Current Build System (`services/nix-cache/build-flakes.sh`)
|
||||
- Runs every 30 minutes via systemd timer
|
||||
- Clones/pulls two repos: `nixos-servers` and `nixos` (gunter)
|
||||
- Builds all hosts with `nixos-rebuild build` (no blacklist despite docs mentioning it)
|
||||
- Pushes success/failure metrics to pushgateway
|
||||
- Simple but has no filtering, no parallelism, no remote triggering
|
||||
|
||||
### Current Flake Update Workflow (`.github/workflows/flake-update.yaml`)
|
||||
- Runs daily at midnight via cron
|
||||
- Runs `nix flake update --commit-lock-file`
|
||||
- Pushes directly to master
|
||||
- No build validation — can push broken inputs
|
||||
|
||||
## Improvement 1: NATS-Based Remote Build Triggering
|
||||
|
||||
### Design
|
||||
|
||||
Extend the existing `homelab-deploy` tool to support a "build" command that triggers builds on the cache host. This reuses the NATS infrastructure already in place.
|
||||
|
||||
| Approach | Pros | Cons |
|
||||
|----------|------|------|
|
||||
| Extend homelab-deploy | Reuses existing NATS auth, NKey handling, CLI | Adds scope to existing tool |
|
||||
| New nix-cache-tool | Clean separation | Duplicate NATS boilerplate, new credentials |
|
||||
| Gitea Actions webhook | No custom tooling | Less flexible, tied to Gitea |
|
||||
|
||||
**Recommendation:** Extend `homelab-deploy` with a build subcommand. The tool already has NATS client code, authentication handling, and a listener module in NixOS.
|
||||
|
||||
### Implementation
|
||||
|
||||
1. Add new message type to homelab-deploy: `build.<host>` subject
|
||||
2. Listener on nix-cache01 subscribes to `build.>` wildcard
|
||||
3. On message receipt, builds the specified host and returns success/failure
|
||||
4. CLI command: `homelab-deploy build <hostname>` or `homelab-deploy build --all`
|
||||
|
||||
### Benefits
|
||||
- Trigger rebuild for specific host to ensure it's cached
|
||||
- Could be called from CI after merging PRs
|
||||
- Reuses existing NATS infrastructure and auth
|
||||
- Progress/status could stream back via NATS reply
|
||||
|
||||
## Improvement 2: Smarter Flake Update Workflow
|
||||
|
||||
### Current Problems
|
||||
1. Updates can push breaking changes to master
|
||||
2. No visibility into what broke when it does
|
||||
3. Hosts that auto-update can pull broken configs
|
||||
|
||||
### Proposed Workflow
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Flake Update Workflow │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ 1. nix flake update (on feature branch) │
|
||||
│ 2. Build ALL hosts locally │
|
||||
│ 3. If all pass → fast-forward merge to master │
|
||||
│ 4. If any fail → create PR with failure logs attached │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Implementation Options
|
||||
|
||||
| Option | Description | Pros | Cons |
|
||||
|--------|-------------|------|------|
|
||||
| **A: Self-hosted runner** | Build on nix-cache01 | Fast (local cache), simple | Ties up cache host during build |
|
||||
| **B: Gitea Actions only** | Use container runner | Clean separation | Slow (no cache), resource limits |
|
||||
| **C: Hybrid** | Trigger builds on nix-cache01 via NATS from Actions | Best of both | More complex |
|
||||
|
||||
**Recommendation:** Option A with nix-cache01 as the runner. The host is already running Gitea Actions runner and has the cache. Building all ~16 hosts is disk I/O heavy but feasible on dedicated hardware.
|
||||
|
||||
### Workflow Steps
|
||||
|
||||
1. Workflow runs on schedule (daily or weekly)
|
||||
2. Creates branch `flake-update/YYYY-MM-DD`
|
||||
3. Runs `nix flake update --commit-lock-file`
|
||||
4. Builds each host: `nix build .#nixosConfigurations.<host>.config.system.build.toplevel`
|
||||
5. If all succeed:
|
||||
- Fast-forward merge to master
|
||||
- Delete feature branch
|
||||
6. If any fail:
|
||||
- Create PR from the update branch
|
||||
- Attach build logs as PR comment
|
||||
- Label PR with `needs-review` or `build-failure`
|
||||
- Do NOT merge automatically
|
||||
|
||||
### Workflow File Changes
|
||||
|
||||
```yaml
|
||||
# New: .github/workflows/flake-update-safe.yaml
|
||||
name: Safe flake update
|
||||
on:
|
||||
schedule:
|
||||
- cron: "0 2 * * 0" # Weekly on Sunday at 2 AM
|
||||
workflow_dispatch: # Manual trigger
|
||||
|
||||
jobs:
|
||||
update-and-validate:
|
||||
runs-on: homelab # Use self-hosted runner on nix-cache01
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
with:
|
||||
ref: master
|
||||
fetch-depth: 0 # Need full history for merge
|
||||
|
||||
- name: Create update branch
|
||||
run: |
|
||||
BRANCH="flake-update/$(date +%Y-%m-%d)"
|
||||
git checkout -b "$BRANCH"
|
||||
|
||||
- name: Update flake
|
||||
run: nix flake update --commit-lock-file
|
||||
|
||||
- name: Build all hosts
|
||||
id: build
|
||||
run: |
|
||||
FAILED=""
|
||||
for host in $(nix flake show --json | jq -r '.nixosConfigurations | keys[]'); do
|
||||
echo "Building $host..."
|
||||
if ! nix build ".#nixosConfigurations.$host.config.system.build.toplevel" 2>&1 | tee "build-$host.log"; then
|
||||
FAILED="$FAILED $host"
|
||||
fi
|
||||
done
|
||||
echo "failed=$FAILED" >> $GITHUB_OUTPUT
|
||||
|
||||
- name: Merge to master (if all pass)
|
||||
if: steps.build.outputs.failed == ''
|
||||
run: |
|
||||
git checkout master
|
||||
git merge --ff-only "$BRANCH"
|
||||
git push origin master
|
||||
git push origin --delete "$BRANCH"
|
||||
|
||||
- name: Create PR (if any fail)
|
||||
if: steps.build.outputs.failed != ''
|
||||
run: |
|
||||
git push origin "$BRANCH"
|
||||
# Create PR via Gitea API with build logs
|
||||
# ... (PR creation with log attachment)
|
||||
```
|
||||
|
||||
## Migration Steps
|
||||
|
||||
### Phase 1: Reprovision Host via OpenTofu
|
||||
|
||||
1. Add `nix-cache01` to `terraform/vms.tf`:
|
||||
```hcl
|
||||
"nix-cache01" = {
|
||||
ip = "10.69.13.15/24"
|
||||
cpu_cores = 4
|
||||
memory = 8192
|
||||
disk_size = "100G" # Larger for nix store
|
||||
}
|
||||
```
|
||||
|
||||
2. Shut down existing nix-cache01 VM
|
||||
3. Run `tofu apply` to provision new VM
|
||||
4. Verify bootstrap completes and cache is serving
|
||||
|
||||
**Note:** The cache will be cold after reprovision. Run initial builds to populate.
|
||||
|
||||
### Phase 2: Add Build Triggering to homelab-deploy
|
||||
|
||||
1. Add `build` command to homelab-deploy CLI
|
||||
2. Add listener handler in NixOS module for `build.*` subjects
|
||||
3. Update nix-cache01 config to enable build listener
|
||||
4. Test with `homelab-deploy build testvm01`
|
||||
|
||||
### Phase 3: Implement Safe Flake Update Workflow
|
||||
|
||||
1. Create `.github/workflows/flake-update-safe.yaml`
|
||||
2. Disable or remove old `flake-update.yaml`
|
||||
3. Test manually with `workflow_dispatch`
|
||||
4. Monitor first automated run
|
||||
|
||||
### Phase 4: Remove Old Build Script
|
||||
|
||||
1. After new workflow is stable, remove:
|
||||
- `services/nix-cache/build-flakes.nix`
|
||||
- `services/nix-cache/build-flakes.sh`
|
||||
2. The new workflow handles scheduled builds
|
||||
|
||||
## Open Questions
|
||||
|
||||
- [ ] What runner labels should the self-hosted runner use for the update workflow?
|
||||
- [ ] Should we build hosts in parallel (faster) or sequentially (easier to debug)?
|
||||
- [ ] How long to keep flake-update PRs open before auto-closing stale ones?
|
||||
- [ ] Should successful updates trigger a NATS notification to rebuild all hosts?
|
||||
- [ ] What to do about `gunter` (external nixos repo) - include in validation?
|
||||
- [ ] Disk size for new nix-cache01 - is 100G enough for cache + builds?
|
||||
|
||||
## Notes
|
||||
|
||||
- The existing `homelab.deploy.enable = true` on nix-cache01 means it already has NATS connectivity
|
||||
- The Harmonia service and cache signing key will work the same after reprovision
|
||||
- Actions runner token is in Vault, will be provisioned automatically
|
||||
- Consider adding a `homelab.host.role = "build-host"` label for monitoring/filtering
|
||||
113
docs/plans/pgdb1-decommission.md
Normal file
113
docs/plans/pgdb1-decommission.md
Normal file
@@ -0,0 +1,113 @@
|
||||
# pgdb1 Decommissioning Plan
|
||||
|
||||
## Overview
|
||||
|
||||
Decommission the pgdb1 PostgreSQL server. The only consumer was Open WebUI on gunter, which has been migrated to use a local PostgreSQL instance.
|
||||
|
||||
## Pre-flight Verification
|
||||
|
||||
Before proceeding, verify that gunter is no longer using pgdb1:
|
||||
|
||||
1. Check Open WebUI on gunter is configured for local PostgreSQL (not 10.69.13.16)
|
||||
2. Optionally: Check pgdb1 for recent connection activity:
|
||||
```bash
|
||||
ssh pgdb1 'sudo -u postgres psql -c "SELECT * FROM pg_stat_activity WHERE datname IS NOT NULL;"'
|
||||
```
|
||||
|
||||
## Files to Remove
|
||||
|
||||
### Host Configuration
|
||||
- `hosts/pgdb1/default.nix`
|
||||
- `hosts/pgdb1/configuration.nix`
|
||||
- `hosts/pgdb1/hardware-configuration.nix`
|
||||
- `hosts/pgdb1/` (directory)
|
||||
|
||||
### Service Module
|
||||
- `services/postgres/postgres.nix`
|
||||
- `services/postgres/default.nix`
|
||||
- `services/postgres/` (directory)
|
||||
|
||||
Note: This service module is only used by pgdb1, so it can be removed entirely.
|
||||
|
||||
### Flake Entry
|
||||
Remove from `flake.nix` (lines 131-138):
|
||||
```nix
|
||||
pgdb1 = nixpkgs.lib.nixosSystem {
|
||||
inherit system;
|
||||
specialArgs = {
|
||||
inherit inputs self;
|
||||
};
|
||||
modules = commonModules ++ [
|
||||
./hosts/pgdb1
|
||||
];
|
||||
};
|
||||
```
|
||||
|
||||
### Vault AppRole
|
||||
Remove from `terraform/vault/approle.tf` (lines 69-73):
|
||||
```hcl
|
||||
"pgdb1" = {
|
||||
paths = [
|
||||
"secret/data/hosts/pgdb1/*",
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Monitoring Rules
|
||||
Remove from `services/monitoring/rules.yml` the `postgres_down` alert (lines 359-365):
|
||||
```yaml
|
||||
- name: postgres_rules
|
||||
rules:
|
||||
- alert: postgres_down
|
||||
expr: node_systemd_unit_state{instance="pgdb1.home.2rjus.net:9100", name="postgresql.service", state="active"} == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
```
|
||||
|
||||
### Utility Scripts
|
||||
Delete `rebuild-all.sh` entirely (obsolete script).
|
||||
|
||||
## Execution Steps
|
||||
|
||||
### Phase 1: Verification
|
||||
- [ ] Confirm Open WebUI on gunter uses local PostgreSQL
|
||||
- [ ] Verify no active connections to pgdb1
|
||||
|
||||
### Phase 2: Code Cleanup
|
||||
- [ ] Create feature branch: `git checkout -b decommission-pgdb1`
|
||||
- [ ] Remove `hosts/pgdb1/` directory
|
||||
- [ ] Remove `services/postgres/` directory
|
||||
- [ ] Remove pgdb1 entry from `flake.nix`
|
||||
- [ ] Remove postgres alert from `services/monitoring/rules.yml`
|
||||
- [ ] Delete `rebuild-all.sh` (obsolete)
|
||||
- [ ] Run `nix flake check` to verify no broken references
|
||||
- [ ] Commit changes
|
||||
|
||||
### Phase 3: Terraform Cleanup
|
||||
- [ ] Remove pgdb1 from `terraform/vault/approle.tf`
|
||||
- [ ] Run `tofu plan` in `terraform/vault/` to preview changes
|
||||
- [ ] Run `tofu apply` to remove the AppRole
|
||||
- [ ] Commit terraform changes
|
||||
|
||||
### Phase 4: Infrastructure Cleanup
|
||||
- [ ] Shut down pgdb1 VM in Proxmox
|
||||
- [ ] Delete the VM from Proxmox
|
||||
- [ ] (Optional) Remove any DNS entries if not auto-generated
|
||||
|
||||
### Phase 5: Finalize
|
||||
- [ ] Merge feature branch to master
|
||||
- [ ] Trigger auto-upgrade on DNS servers (ns1, ns2) to remove DNS entry
|
||||
- [ ] Move this plan to `docs/plans/completed/`
|
||||
|
||||
## Rollback
|
||||
|
||||
If issues arise after decommissioning:
|
||||
1. The VM can be recreated from template using the git history
|
||||
2. Database data would need to be restored from backup (if any exists)
|
||||
|
||||
## Notes
|
||||
|
||||
- pgdb1 IP: 10.69.13.16
|
||||
- The postgres service allowed connections from gunter (10.69.30.105)
|
||||
- No restic backup was configured for this host
|
||||
224
docs/plans/security-hardening.md
Normal file
224
docs/plans/security-hardening.md
Normal file
@@ -0,0 +1,224 @@
|
||||
# Security Hardening Plan
|
||||
|
||||
## Overview
|
||||
|
||||
Address security gaps identified in infrastructure review. Focus areas: SSH hardening, network security, logging improvements, and secrets management.
|
||||
|
||||
## Current State
|
||||
|
||||
- SSH allows password auth and unrestricted root login (`system/sshd.nix`)
|
||||
- Firewall disabled on all hosts (`networking.firewall.enable = false`)
|
||||
- Promtail ships logs over HTTP to Loki
|
||||
- Loki has no authentication (`auth_enabled = false`)
|
||||
- AppRole secret-IDs never expire (`secret_id_ttl = 0`)
|
||||
- Vault TLS verification disabled by default (`skipTlsVerify = true`)
|
||||
- Audit logging exists (`common/ssh-audit.nix`) but not applied globally
|
||||
- Alert rules focus on availability, no security event detection
|
||||
|
||||
## Priority Matrix
|
||||
|
||||
| Issue | Severity | Effort | Priority |
|
||||
|-------|----------|--------|----------|
|
||||
| SSH password auth | High | Low | **P1** |
|
||||
| Firewall disabled | High | Medium | **P1** |
|
||||
| Promtail HTTP (no TLS) | High | Medium | **P2** |
|
||||
| No security alerting | Medium | Low | **P2** |
|
||||
| Audit logging not global | Low | Low | **P2** |
|
||||
| Loki no auth | Medium | Medium | **P3** |
|
||||
| Secret-ID TTL | Medium | Medium | **P3** |
|
||||
| Vault skipTlsVerify | Medium | Low | **P3** |
|
||||
|
||||
## Phase 1: Quick Wins (P1)
|
||||
|
||||
### 1.1 SSH Hardening
|
||||
|
||||
Edit `system/sshd.nix`:
|
||||
|
||||
```nix
|
||||
services.openssh = {
|
||||
enable = true;
|
||||
settings = {
|
||||
PermitRootLogin = "prohibit-password"; # Key-only root login
|
||||
PasswordAuthentication = false;
|
||||
KbdInteractiveAuthentication = false;
|
||||
};
|
||||
};
|
||||
```
|
||||
|
||||
**Prerequisite:** Verify all hosts have SSH keys deployed for root.
|
||||
|
||||
### 1.2 Enable Firewall
|
||||
|
||||
Create `system/firewall.nix` with default deny policy:
|
||||
|
||||
```nix
|
||||
{ ... }: {
|
||||
networking.firewall.enable = true;
|
||||
|
||||
# Use openssh's built-in firewall integration
|
||||
services.openssh.openFirewall = true;
|
||||
}
|
||||
```
|
||||
|
||||
**Useful firewall options:**
|
||||
|
||||
| Option | Description |
|
||||
|--------|-------------|
|
||||
| `networking.firewall.trustedInterfaces` | Accept all traffic from these interfaces (e.g., `[ "lo" ]`) |
|
||||
| `networking.firewall.interfaces.<name>.allowedTCPPorts` | Per-interface port rules |
|
||||
| `networking.firewall.extraInputRules` | Custom nftables rules (for complex filtering) |
|
||||
|
||||
**Network range restrictions:** Consider restricting SSH to the infrastructure subnet (`10.69.13.0/24`) using `extraInputRules` for defense in depth. However, this adds complexity and may not be necessary given the trusted network model.
|
||||
|
||||
#### Per-Interface Rules (http-proxy WireGuard)
|
||||
|
||||
The `http-proxy` host has a WireGuard interface (`wg0`) that may need different rules than the LAN interface. Use `networking.firewall.interfaces` to apply per-interface policies:
|
||||
|
||||
```nix
|
||||
# Example: http-proxy with different rules per interface
|
||||
networking.firewall = {
|
||||
enable = true;
|
||||
|
||||
# Default: only SSH (via openFirewall)
|
||||
allowedTCPPorts = [ ];
|
||||
|
||||
# LAN interface: allow HTTP/HTTPS
|
||||
interfaces.ens18 = {
|
||||
allowedTCPPorts = [ 80 443 ];
|
||||
};
|
||||
|
||||
# WireGuard interface: restrict to specific services or trust fully
|
||||
interfaces.wg0 = {
|
||||
allowedTCPPorts = [ 80 443 ];
|
||||
# Or use trustedInterfaces = [ "wg0" ] if fully trusted
|
||||
};
|
||||
};
|
||||
```
|
||||
|
||||
**TODO:** Investigate current WireGuard usage on http-proxy to determine appropriate rules.
|
||||
|
||||
Then per-host, open required ports:
|
||||
|
||||
| Host | Additional Ports |
|
||||
|------|------------------|
|
||||
| ns1/ns2 | 53 (TCP/UDP) |
|
||||
| vault01 | 8200 |
|
||||
| monitoring01 | 3100, 9090, 3000, 9093 |
|
||||
| http-proxy | 80, 443 |
|
||||
| nats1 | 4222 |
|
||||
| ha1 | 1883, 8123 |
|
||||
| jelly01 | 8096 |
|
||||
| nix-cache01 | 5000 |
|
||||
|
||||
## Phase 2: Logging & Detection (P2)
|
||||
|
||||
### 2.1 Enable TLS for Promtail → Loki
|
||||
|
||||
Update `system/monitoring/logs.nix`:
|
||||
|
||||
```nix
|
||||
clients = [{
|
||||
url = "https://monitoring01.home.2rjus.net:3100/loki/api/v1/push";
|
||||
tls_config = {
|
||||
ca_file = "/etc/ssl/certs/homelab-root-ca.pem";
|
||||
};
|
||||
}];
|
||||
```
|
||||
|
||||
Requires:
|
||||
- Configure Loki with TLS certificate (use internal ACME)
|
||||
- Ensure all hosts trust root CA (already done via `system/pki/root-ca.nix`)
|
||||
|
||||
### 2.2 Security Alert Rules
|
||||
|
||||
Add to `services/monitoring/rules.yml`:
|
||||
|
||||
```yaml
|
||||
- name: security_rules
|
||||
rules:
|
||||
- alert: ssh_auth_failures
|
||||
expr: increase(node_logind_sessions_total[5m]) > 20
|
||||
for: 0m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Unusual login activity on {{ $labels.instance }}"
|
||||
|
||||
- alert: vault_secret_fetch_failure
|
||||
expr: increase(vault_secret_failures[5m]) > 5
|
||||
for: 0m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Vault secret fetch failures on {{ $labels.instance }}"
|
||||
```
|
||||
|
||||
Also add Loki-based alerts for:
|
||||
- Failed SSH attempts: `{job="systemd-journal"} |= "Failed password"`
|
||||
- sudo usage: `{job="systemd-journal"} |= "sudo"`
|
||||
|
||||
### 2.3 Global Audit Logging
|
||||
|
||||
Add `./common/ssh-audit.nix` import to `system/default.nix`:
|
||||
|
||||
```nix
|
||||
imports = [
|
||||
# ... existing imports
|
||||
../common/ssh-audit.nix
|
||||
];
|
||||
```
|
||||
|
||||
## Phase 3: Defense in Depth (P3)
|
||||
|
||||
### 3.1 Loki Authentication
|
||||
|
||||
Options:
|
||||
1. **Basic auth via reverse proxy** - Put Loki behind Caddy with auth
|
||||
2. **Loki multi-tenancy** - Enable `auth_enabled = true` and use tenant IDs
|
||||
3. **Network isolation** - Bind Loki only to localhost, expose via authenticated proxy
|
||||
|
||||
Recommendation: Option 1 (reverse proxy) is simplest for homelab.
|
||||
|
||||
### 3.2 AppRole Secret Rotation
|
||||
|
||||
Update `terraform/vault/approle.tf`:
|
||||
|
||||
```hcl
|
||||
secret_id_ttl = 2592000 # 30 days
|
||||
```
|
||||
|
||||
Add documentation for manual rotation procedure or implement automated rotation via the existing `restartTrigger` mechanism in `vault-secrets.nix`.
|
||||
|
||||
### 3.3 Enable Vault TLS Verification
|
||||
|
||||
Change default in `system/vault-secrets.nix`:
|
||||
|
||||
```nix
|
||||
skipTlsVerify = mkOption {
|
||||
type = types.bool;
|
||||
default = false; # Changed from true
|
||||
};
|
||||
```
|
||||
|
||||
**Prerequisite:** Verify all hosts trust the internal CA that signed the Vault certificate.
|
||||
|
||||
## Implementation Order
|
||||
|
||||
1. **Test on test-tier first** - Deploy phases 1-2 to testvm01/02/03
|
||||
2. **Validate SSH access** - Ensure key-based login works before disabling passwords
|
||||
3. **Document firewall ports** - Create reference of ports per host before enabling
|
||||
4. **Phase prod rollout** - Deploy to prod hosts one at a time, verify each
|
||||
|
||||
## Open Questions
|
||||
|
||||
- [ ] Do all hosts have SSH keys configured for root access?
|
||||
- [ ] Should firewall rules be per-host or use a central definition with roles?
|
||||
- [ ] Should Loki authentication use the existing Kanidm setup?
|
||||
|
||||
**Resolved:** Password-based SSH access for recovery is not required - most hosts have console access through Proxmox or physical access, which provides an out-of-band recovery path if SSH keys fail.
|
||||
|
||||
## Notes
|
||||
|
||||
- Firewall changes are the highest risk - test thoroughly on test-tier
|
||||
- SSH hardening must not lock out access - verify keys first
|
||||
- Consider creating a "break glass" procedure for emergency access if keys fail
|
||||
267
docs/user-management.md
Normal file
267
docs/user-management.md
Normal file
@@ -0,0 +1,267 @@
|
||||
# User Management with Kanidm
|
||||
|
||||
Central authentication for the homelab using Kanidm.
|
||||
|
||||
## Overview
|
||||
|
||||
- **Server**: kanidm01.home.2rjus.net (auth.home.2rjus.net)
|
||||
- **WebUI**: https://auth.home.2rjus.net
|
||||
- **LDAPS**: port 636
|
||||
|
||||
## CLI Setup
|
||||
|
||||
The `kanidm` CLI is available in the devshell:
|
||||
|
||||
```bash
|
||||
nix develop
|
||||
|
||||
# Login as idm_admin
|
||||
kanidm login --name idm_admin --url https://auth.home.2rjus.net
|
||||
```
|
||||
|
||||
## User Management
|
||||
|
||||
POSIX users are managed imperatively via the `kanidm` CLI. This allows setting
|
||||
all attributes (including UNIX password) in one workflow.
|
||||
|
||||
### Creating a POSIX User
|
||||
|
||||
```bash
|
||||
# Create the person
|
||||
kanidm person create <username> "<Display Name>"
|
||||
|
||||
# Add to groups
|
||||
kanidm group add-members ssh-users <username>
|
||||
|
||||
# Enable POSIX (UID is auto-assigned)
|
||||
kanidm person posix set <username>
|
||||
|
||||
# Set UNIX password (required for SSH login, min 10 characters)
|
||||
kanidm person posix set-password <username>
|
||||
|
||||
# Optionally set login shell
|
||||
kanidm person posix set <username> --shell /bin/zsh
|
||||
```
|
||||
|
||||
### Example: Full User Creation
|
||||
|
||||
```bash
|
||||
kanidm person create testuser "Test User"
|
||||
kanidm group add-members ssh-users testuser
|
||||
kanidm person posix set testuser
|
||||
kanidm person posix set-password testuser
|
||||
kanidm person get testuser
|
||||
```
|
||||
|
||||
After creation, verify on a client host:
|
||||
```bash
|
||||
getent passwd testuser
|
||||
ssh testuser@testvm01.home.2rjus.net
|
||||
```
|
||||
|
||||
### Viewing User Details
|
||||
|
||||
```bash
|
||||
kanidm person get <username>
|
||||
```
|
||||
|
||||
### Removing a User
|
||||
|
||||
```bash
|
||||
kanidm person delete <username>
|
||||
```
|
||||
|
||||
## Group Management
|
||||
|
||||
Groups for POSIX access are also managed via CLI.
|
||||
|
||||
### Creating a POSIX Group
|
||||
|
||||
```bash
|
||||
# Create the group
|
||||
kanidm group create <group-name>
|
||||
|
||||
# Enable POSIX with a specific GID
|
||||
kanidm group posix set <group-name> --gidnumber <gid>
|
||||
```
|
||||
|
||||
### Adding Members
|
||||
|
||||
```bash
|
||||
kanidm group add-members <group-name> <username>
|
||||
```
|
||||
|
||||
### Viewing Group Details
|
||||
|
||||
```bash
|
||||
kanidm group get <group-name>
|
||||
kanidm group list-members <group-name>
|
||||
```
|
||||
|
||||
### Example: Full Group Creation
|
||||
|
||||
```bash
|
||||
kanidm group create testgroup
|
||||
kanidm group posix set testgroup --gidnumber 68010
|
||||
kanidm group add-members testgroup testuser
|
||||
kanidm group get testgroup
|
||||
```
|
||||
|
||||
After creation, verify on a client host:
|
||||
```bash
|
||||
getent group testgroup
|
||||
```
|
||||
|
||||
### Current Groups
|
||||
|
||||
| Group | GID | Purpose |
|
||||
|-------|-----|---------|
|
||||
| ssh-users | 68000 | SSH login access |
|
||||
| admins | 68001 | Administrative access |
|
||||
| users | 68002 | General users |
|
||||
|
||||
### UID/GID Allocation
|
||||
|
||||
Kanidm auto-assigns UIDs/GIDs from its configured range. For manually assigned GIDs:
|
||||
|
||||
| Range | Purpose |
|
||||
|-------|---------|
|
||||
| 65,536+ | Users (auto-assigned) |
|
||||
| 68,000 - 68,999 | Groups (manually assigned) |
|
||||
|
||||
## PAM/NSS Client Configuration
|
||||
|
||||
Enable central authentication on a host:
|
||||
|
||||
```nix
|
||||
homelab.kanidm.enable = true;
|
||||
```
|
||||
|
||||
This configures:
|
||||
- `services.kanidm.enablePam = true`
|
||||
- Client connection to auth.home.2rjus.net
|
||||
- Login authorization for `ssh-users` group
|
||||
- Short usernames (`torjus` instead of `torjus@home.2rjus.net`)
|
||||
- Home directory symlinks (`/home/torjus` → UUID-based directory)
|
||||
|
||||
### Enabled Hosts
|
||||
|
||||
- testvm01, testvm02, testvm03 (test tier)
|
||||
|
||||
### Options
|
||||
|
||||
```nix
|
||||
homelab.kanidm = {
|
||||
enable = true;
|
||||
server = "https://auth.home.2rjus.net"; # default
|
||||
allowedLoginGroups = [ "ssh-users" ]; # default
|
||||
};
|
||||
```
|
||||
|
||||
### Home Directories
|
||||
|
||||
Home directories use UUID-based paths for stability (so renaming a user doesn't
|
||||
require moving their home directory). Symlinks provide convenient access:
|
||||
|
||||
```
|
||||
/home/torjus -> /home/e4f4c56c-4aee-4c20-846f-90cb69807733
|
||||
```
|
||||
|
||||
The symlinks are created by `kanidm-unixd-tasks` on first login.
|
||||
|
||||
## Testing
|
||||
|
||||
### Verify NSS Resolution
|
||||
|
||||
```bash
|
||||
# Check user resolution
|
||||
getent passwd <username>
|
||||
|
||||
# Check group resolution
|
||||
getent group <group-name>
|
||||
```
|
||||
|
||||
### Test SSH Login
|
||||
|
||||
```bash
|
||||
ssh <username>@<hostname>.home.2rjus.net
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "PAM user mismatch" error
|
||||
|
||||
SSH fails with "fatal: PAM user mismatch" in logs. This happens when Kanidm returns
|
||||
usernames in SPN format (`torjus@home.2rjus.net`) but SSH expects short names (`torjus`).
|
||||
|
||||
**Solution**: Configure `uid_attr_map = "name"` in unixSettings (already set in our module).
|
||||
|
||||
Check current format:
|
||||
```bash
|
||||
getent passwd torjus
|
||||
# Should show: torjus:x:65536:...
|
||||
# NOT: torjus@home.2rjus.net:x:65536:...
|
||||
```
|
||||
|
||||
### User resolves but SSH fails immediately
|
||||
|
||||
The user's login group (e.g., `ssh-users`) likely doesn't have POSIX enabled:
|
||||
|
||||
```bash
|
||||
# Check if group has POSIX
|
||||
getent group ssh-users
|
||||
|
||||
# If empty, enable POSIX on the server
|
||||
kanidm group posix set ssh-users --gidnumber 68000
|
||||
```
|
||||
|
||||
### User doesn't resolve via getent
|
||||
|
||||
1. Check kanidm-unixd service is running:
|
||||
```bash
|
||||
systemctl status kanidm-unixd
|
||||
```
|
||||
|
||||
2. Check unixd can reach server:
|
||||
```bash
|
||||
kanidm-unix status
|
||||
# Should show: system: online, Kanidm: online
|
||||
```
|
||||
|
||||
3. Check client can reach server:
|
||||
```bash
|
||||
curl -s https://auth.home.2rjus.net/status
|
||||
```
|
||||
|
||||
4. Check user has POSIX enabled on server:
|
||||
```bash
|
||||
kanidm person get <username>
|
||||
```
|
||||
|
||||
5. Restart nscd to clear stale cache:
|
||||
```bash
|
||||
systemctl restart nscd
|
||||
```
|
||||
|
||||
6. Invalidate kanidm cache:
|
||||
```bash
|
||||
kanidm-unix cache-invalidate
|
||||
```
|
||||
|
||||
### Changes not taking effect after deployment
|
||||
|
||||
NixOS uses nsncd (a Rust reimplementation of nscd) for NSS caching. After deploying
|
||||
kanidm-unixd config changes, you may need to restart both services:
|
||||
|
||||
```bash
|
||||
systemctl restart kanidm-unixd
|
||||
systemctl restart nscd
|
||||
```
|
||||
|
||||
### Test PAM authentication directly
|
||||
|
||||
Use the kanidm-unix CLI to test PAM auth without SSH:
|
||||
|
||||
```bash
|
||||
kanidm-unix auth-test --name <username>
|
||||
```
|
||||
37
flake.nix
37
flake.nix
@@ -65,15 +65,6 @@
|
||||
in
|
||||
{
|
||||
nixosConfigurations = {
|
||||
ns1 = nixpkgs.lib.nixosSystem {
|
||||
inherit system;
|
||||
specialArgs = {
|
||||
inherit inputs self;
|
||||
};
|
||||
modules = commonModules ++ [
|
||||
./hosts/ns1
|
||||
];
|
||||
};
|
||||
ha1 = nixpkgs.lib.nixosSystem {
|
||||
inherit system;
|
||||
specialArgs = {
|
||||
@@ -128,15 +119,6 @@
|
||||
./hosts/nix-cache01
|
||||
];
|
||||
};
|
||||
pgdb1 = nixpkgs.lib.nixosSystem {
|
||||
inherit system;
|
||||
specialArgs = {
|
||||
inherit inputs self;
|
||||
};
|
||||
modules = commonModules ++ [
|
||||
./hosts/pgdb1
|
||||
];
|
||||
};
|
||||
nats1 = nixpkgs.lib.nixosSystem {
|
||||
inherit system;
|
||||
specialArgs = {
|
||||
@@ -191,6 +173,24 @@
|
||||
./hosts/ns2
|
||||
];
|
||||
};
|
||||
ns1 = nixpkgs.lib.nixosSystem {
|
||||
inherit system;
|
||||
specialArgs = {
|
||||
inherit inputs self;
|
||||
};
|
||||
modules = commonModules ++ [
|
||||
./hosts/ns1
|
||||
];
|
||||
};
|
||||
kanidm01 = nixpkgs.lib.nixosSystem {
|
||||
inherit system;
|
||||
specialArgs = {
|
||||
inherit inputs self;
|
||||
};
|
||||
modules = commonModules ++ [
|
||||
./hosts/kanidm01
|
||||
];
|
||||
};
|
||||
};
|
||||
packages = forAllSystems (
|
||||
{ pkgs }:
|
||||
@@ -207,6 +207,7 @@
|
||||
pkgs.ansible
|
||||
pkgs.opentofu
|
||||
pkgs.openbao
|
||||
pkgs.kanidm_1_8
|
||||
(pkgs.callPackage ./scripts/create-host { })
|
||||
homelab-deploy.packages.${pkgs.system}.default
|
||||
];
|
||||
|
||||
@@ -64,9 +64,5 @@
|
||||
vault.enable = true;
|
||||
homelab.deploy.enable = true;
|
||||
|
||||
zramSwap = {
|
||||
enable = true;
|
||||
};
|
||||
|
||||
system.stateVersion = "23.11"; # Did you read the comment?
|
||||
}
|
||||
|
||||
@@ -1,56 +0,0 @@
|
||||
{ config, lib, pkgs, ... }:
|
||||
|
||||
{
|
||||
imports =
|
||||
[
|
||||
./hardware-configuration.nix
|
||||
../../system
|
||||
];
|
||||
|
||||
nixpkgs.config.allowUnfree = true;
|
||||
|
||||
homelab.host.role = "bastion";
|
||||
|
||||
# Use the systemd-boot EFI boot loader.
|
||||
boot.loader.grub.enable = true;
|
||||
boot.loader.grub.device = "/dev/sda";
|
||||
|
||||
networking.hostName = "jump";
|
||||
networking.domain = "home.2rjus.net";
|
||||
networking.useNetworkd = true;
|
||||
networking.useDHCP = false;
|
||||
services.resolved.enable = false;
|
||||
networking.nameservers = [
|
||||
"10.69.13.5"
|
||||
"10.69.13.6"
|
||||
];
|
||||
|
||||
systemd.network.enable = true;
|
||||
systemd.network.networks."ens18" = {
|
||||
matchConfig.Name = "ens18";
|
||||
address = [
|
||||
"10.69.13.10/24"
|
||||
];
|
||||
routes = [
|
||||
{ Gateway = "10.69.13.1"; }
|
||||
];
|
||||
linkConfig.RequiredForOnline = "routable";
|
||||
};
|
||||
time.timeZone = "Europe/Oslo";
|
||||
|
||||
nix.settings.experimental-features = [ "nix-command" "flakes" ];
|
||||
environment.systemPackages = with pkgs; [
|
||||
vim
|
||||
wget
|
||||
git
|
||||
];
|
||||
|
||||
# Open ports in the firewall.
|
||||
# networking.firewall.allowedTCPPorts = [ ... ];
|
||||
# networking.firewall.allowedUDPPorts = [ ... ];
|
||||
# Or disable the firewall altogether.
|
||||
networking.firewall.enable = false;
|
||||
|
||||
system.stateVersion = "23.11"; # Did you read the comment?
|
||||
}
|
||||
|
||||
@@ -1,36 +0,0 @@
|
||||
{ config, lib, pkgs, modulesPath, ... }:
|
||||
|
||||
{
|
||||
imports =
|
||||
[
|
||||
(modulesPath + "/profiles/qemu-guest.nix")
|
||||
];
|
||||
|
||||
boot.initrd.availableKernelModules = [ "ata_piix" "uhci_hcd" "virtio_pci" "virtio_scsi" "sd_mod" "sr_mod" ];
|
||||
boot.initrd.kernelModules = [ ];
|
||||
# boot.kernelModules = [ ];
|
||||
# boot.extraModulePackages = [ ];
|
||||
|
||||
fileSystems."/" =
|
||||
{
|
||||
device = "/dev/disk/by-uuid/6889aba9-61ed-4687-ab10-e5cf4017ac8d";
|
||||
fsType = "xfs";
|
||||
};
|
||||
|
||||
fileSystems."/boot" =
|
||||
{
|
||||
device = "/dev/disk/by-uuid/BC07-3B7A";
|
||||
fsType = "vfat";
|
||||
};
|
||||
|
||||
swapDevices =
|
||||
[{ device = "/dev/disk/by-uuid/64e5757b-6625-4dd2-aa2a-66ca93444d23"; }];
|
||||
|
||||
# Enables DHCP on each ethernet and wireless interface. In case of scripted networking
|
||||
# (the default) this is the recommended approach. When using systemd-networkd it's
|
||||
# still possible to use this option, but it's recommended to use it in conjunction
|
||||
# with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
|
||||
# networking.interfaces.ens18.useDHCP = lib.mkDefault true;
|
||||
|
||||
nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
|
||||
}
|
||||
@@ -1,25 +1,39 @@
|
||||
{
|
||||
config,
|
||||
lib,
|
||||
pkgs,
|
||||
...
|
||||
}:
|
||||
|
||||
{
|
||||
imports = [
|
||||
./hardware-configuration.nix
|
||||
../template2/hardware-configuration.nix
|
||||
|
||||
../../system
|
||||
../../common/vm
|
||||
../../services/kanidm
|
||||
];
|
||||
|
||||
nixpkgs.config.allowUnfree = true;
|
||||
# Use the systemd-boot EFI boot loader.
|
||||
boot.loader.grub = {
|
||||
enable = true;
|
||||
device = "/dev/sda";
|
||||
configurationLimit = 3;
|
||||
# Host metadata
|
||||
homelab.host = {
|
||||
tier = "test";
|
||||
role = "auth";
|
||||
};
|
||||
|
||||
networking.hostName = "pgdb1";
|
||||
# DNS CNAME for auth.home.2rjus.net
|
||||
homelab.dns.cnames = [ "auth" ];
|
||||
|
||||
# Enable Vault integration
|
||||
vault.enable = true;
|
||||
|
||||
# Enable remote deployment via NATS
|
||||
homelab.deploy.enable = true;
|
||||
|
||||
nixpkgs.config.allowUnfree = true;
|
||||
boot.loader.grub.enable = true;
|
||||
boot.loader.grub.device = "/dev/vda";
|
||||
|
||||
networking.hostName = "kanidm01";
|
||||
networking.domain = "home.2rjus.net";
|
||||
networking.useNetworkd = true;
|
||||
networking.useDHCP = false;
|
||||
@@ -33,7 +47,7 @@
|
||||
systemd.network.networks."ens18" = {
|
||||
matchConfig.Name = "ens18";
|
||||
address = [
|
||||
"10.69.13.16/24"
|
||||
"10.69.13.23/24"
|
||||
];
|
||||
routes = [
|
||||
{ Gateway = "10.69.13.1"; }
|
||||
@@ -59,8 +73,5 @@
|
||||
# Or disable the firewall altogether.
|
||||
networking.firewall.enable = false;
|
||||
|
||||
vault.enable = true;
|
||||
homelab.deploy.enable = true;
|
||||
|
||||
system.stateVersion = "23.11"; # Did you read the comment?
|
||||
}
|
||||
system.stateVersion = "25.11"; # Did you read the comment?
|
||||
}
|
||||
@@ -2,4 +2,4 @@
|
||||
imports = [
|
||||
./configuration.nix
|
||||
];
|
||||
}
|
||||
}
|
||||
@@ -4,6 +4,5 @@
|
||||
./configuration.nix
|
||||
../../services/nix-cache
|
||||
../../services/actions-runner
|
||||
./zram.nix
|
||||
];
|
||||
}
|
||||
|
||||
@@ -1,6 +0,0 @@
|
||||
{ ... }:
|
||||
{
|
||||
zramSwap = {
|
||||
enable = true;
|
||||
};
|
||||
}
|
||||
@@ -7,23 +7,38 @@
|
||||
|
||||
{
|
||||
imports = [
|
||||
./hardware-configuration.nix
|
||||
../template2/hardware-configuration.nix
|
||||
|
||||
../../system
|
||||
../../common/vm
|
||||
|
||||
# DNS services
|
||||
../../services/ns/master-authorative.nix
|
||||
../../services/ns/resolver.nix
|
||||
../../common/vm
|
||||
];
|
||||
|
||||
# Host metadata
|
||||
homelab.host = {
|
||||
tier = "prod";
|
||||
role = "dns";
|
||||
labels.dns_role = "primary";
|
||||
};
|
||||
|
||||
# Enable Vault integration
|
||||
vault.enable = true;
|
||||
|
||||
# Enable remote deployment via NATS
|
||||
homelab.deploy.enable = true;
|
||||
|
||||
nixpkgs.config.allowUnfree = true;
|
||||
# Use the systemd-boot EFI boot loader.
|
||||
boot.loader.grub.enable = true;
|
||||
boot.loader.grub.device = "/dev/sda";
|
||||
boot.loader.grub.device = "/dev/vda";
|
||||
|
||||
networking.hostName = "ns1";
|
||||
networking.domain = "home.2rjus.net";
|
||||
networking.useNetworkd = true;
|
||||
networking.useDHCP = false;
|
||||
# Disable resolved - conflicts with Unbound resolver
|
||||
services.resolved.enable = false;
|
||||
networking.nameservers = [
|
||||
"10.69.13.5"
|
||||
@@ -47,14 +62,6 @@
|
||||
"nix-command"
|
||||
"flakes"
|
||||
];
|
||||
vault.enable = true;
|
||||
homelab.deploy.enable = true;
|
||||
|
||||
homelab.host = {
|
||||
role = "dns";
|
||||
labels.dns_role = "primary";
|
||||
};
|
||||
|
||||
nix.settings.tarball-ttl = 0;
|
||||
environment.systemPackages = with pkgs; [
|
||||
vim
|
||||
@@ -68,5 +75,5 @@
|
||||
# Or disable the firewall altogether.
|
||||
networking.firewall.enable = false;
|
||||
|
||||
system.stateVersion = "23.11"; # Did you read the comment?
|
||||
}
|
||||
system.stateVersion = "25.11"; # Did you read the comment?
|
||||
}
|
||||
@@ -2,4 +2,4 @@
|
||||
imports = [
|
||||
./configuration.nix
|
||||
];
|
||||
}
|
||||
}
|
||||
@@ -1,36 +0,0 @@
|
||||
{ config, lib, pkgs, modulesPath, ... }:
|
||||
|
||||
{
|
||||
imports =
|
||||
[
|
||||
(modulesPath + "/profiles/qemu-guest.nix")
|
||||
];
|
||||
|
||||
boot.initrd.availableKernelModules = [ "ata_piix" "uhci_hcd" "virtio_pci" "virtio_scsi" "sd_mod" "sr_mod" ];
|
||||
boot.initrd.kernelModules = [ ];
|
||||
# boot.kernelModules = [ ];
|
||||
# boot.extraModulePackages = [ ];
|
||||
|
||||
fileSystems."/" =
|
||||
{
|
||||
device = "/dev/disk/by-uuid/6889aba9-61ed-4687-ab10-e5cf4017ac8d";
|
||||
fsType = "xfs";
|
||||
};
|
||||
|
||||
fileSystems."/boot" =
|
||||
{
|
||||
device = "/dev/disk/by-uuid/BC07-3B7A";
|
||||
fsType = "vfat";
|
||||
};
|
||||
|
||||
swapDevices =
|
||||
[{ device = "/dev/disk/by-uuid/64e5757b-6625-4dd2-aa2a-66ca93444d23"; }];
|
||||
|
||||
# Enables DHCP on each ethernet and wireless interface. In case of scripted networking
|
||||
# (the default) this is the recommended approach. When using systemd-networkd it's
|
||||
# still possible to use this option, but it's recommended to use it in conjunction
|
||||
# with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
|
||||
# networking.interfaces.ens18.useDHCP = lib.mkDefault true;
|
||||
|
||||
nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
|
||||
}
|
||||
@@ -1,7 +0,0 @@
|
||||
{ ... }:
|
||||
{
|
||||
imports = [
|
||||
./configuration.nix
|
||||
../../services/postgres
|
||||
];
|
||||
}
|
||||
@@ -1,42 +0,0 @@
|
||||
{
|
||||
config,
|
||||
lib,
|
||||
pkgs,
|
||||
modulesPath,
|
||||
...
|
||||
}:
|
||||
|
||||
{
|
||||
imports = [
|
||||
(modulesPath + "/profiles/qemu-guest.nix")
|
||||
];
|
||||
boot.initrd.availableKernelModules = [
|
||||
"ata_piix"
|
||||
"uhci_hcd"
|
||||
"virtio_pci"
|
||||
"virtio_scsi"
|
||||
"sd_mod"
|
||||
"sr_mod"
|
||||
];
|
||||
boot.initrd.kernelModules = [ "dm-snapshot" ];
|
||||
boot.kernelModules = [
|
||||
"ptp_kvm"
|
||||
];
|
||||
boot.extraModulePackages = [ ];
|
||||
|
||||
fileSystems."/" = {
|
||||
device = "/dev/disk/by-label/root";
|
||||
fsType = "xfs";
|
||||
};
|
||||
|
||||
swapDevices = [ { device = "/dev/disk/by-label/swap"; } ];
|
||||
|
||||
# Enables DHCP on each ethernet and wireless interface. In case of scripted networking
|
||||
# (the default) this is the recommended approach. When using systemd-networkd it's
|
||||
# still possible to use this option, but it's recommended to use it in conjunction
|
||||
# with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
|
||||
networking.useDHCP = lib.mkDefault true;
|
||||
# networking.interfaces.ens18.useDHCP = lib.mkDefault true;
|
||||
|
||||
nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
|
||||
}
|
||||
@@ -58,6 +58,14 @@
|
||||
"flakes"
|
||||
];
|
||||
nix.settings.tarball-ttl = 0;
|
||||
nix.settings.substituters = [
|
||||
"https://nix-cache.home.2rjus.net"
|
||||
"https://cache.nixos.org"
|
||||
];
|
||||
nix.settings.trusted-public-keys = [
|
||||
"nix-cache.home.2rjus.net-1:2kowZOG6pvhoK4AHVO3alBlvcghH20wchzoR0V86UWI="
|
||||
"cache.nixos.org-1:6NCHdD59X431o0gWypbMrAURkbJ16ZPMQFGspcDShjY="
|
||||
];
|
||||
environment.systemPackages = with pkgs; [
|
||||
age
|
||||
vim
|
||||
@@ -71,5 +79,8 @@
|
||||
# Or disable the firewall altogether.
|
||||
networking.firewall.enable = false;
|
||||
|
||||
# Compressed swap in RAM - prevents OOM during bootstrap nixos-rebuild
|
||||
zramSwap.enable = true;
|
||||
|
||||
system.stateVersion = "25.11";
|
||||
}
|
||||
|
||||
@@ -11,6 +11,7 @@
|
||||
|
||||
../../system
|
||||
../../common/vm
|
||||
../../common/ssh-audit.nix
|
||||
];
|
||||
|
||||
# Host metadata (adjust as needed)
|
||||
@@ -24,6 +25,9 @@
|
||||
# Enable remote deployment via NATS
|
||||
homelab.deploy.enable = true;
|
||||
|
||||
# Enable Kanidm PAM/NSS for central authentication
|
||||
homelab.kanidm.enable = true;
|
||||
|
||||
nixpkgs.config.allowUnfree = true;
|
||||
boot.loader.grub.enable = true;
|
||||
boot.loader.grub.device = "/dev/vda";
|
||||
|
||||
@@ -11,6 +11,7 @@
|
||||
|
||||
../../system
|
||||
../../common/vm
|
||||
../../common/ssh-audit.nix
|
||||
];
|
||||
|
||||
# Host metadata (adjust as needed)
|
||||
@@ -24,6 +25,9 @@
|
||||
# Enable remote deployment via NATS
|
||||
homelab.deploy.enable = true;
|
||||
|
||||
# Enable Kanidm PAM/NSS for central authentication
|
||||
homelab.kanidm.enable = true;
|
||||
|
||||
nixpkgs.config.allowUnfree = true;
|
||||
boot.loader.grub.enable = true;
|
||||
boot.loader.grub.device = "/dev/vda";
|
||||
|
||||
@@ -11,6 +11,7 @@
|
||||
|
||||
../../system
|
||||
../../common/vm
|
||||
../../common/ssh-audit.nix
|
||||
];
|
||||
|
||||
# Host metadata (adjust as needed)
|
||||
@@ -24,6 +25,9 @@
|
||||
# Enable remote deployment via NATS
|
||||
homelab.deploy.enable = true;
|
||||
|
||||
# Enable Kanidm PAM/NSS for central authentication
|
||||
homelab.kanidm.enable = true;
|
||||
|
||||
nixpkgs.config.allowUnfree = true;
|
||||
boot.loader.grub.enable = true;
|
||||
boot.loader.grub.device = "/dev/vda";
|
||||
|
||||
@@ -1,19 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
# array of hosts
|
||||
HOSTS=(
|
||||
"ns1"
|
||||
"ns2"
|
||||
"ha1"
|
||||
"http-proxy"
|
||||
"jelly01"
|
||||
"monitoring01"
|
||||
"nix-cache01"
|
||||
"pgdb1"
|
||||
)
|
||||
|
||||
for host in "${HOSTS[@]}"; do
|
||||
echo "Rebuilding $host"
|
||||
nixos-rebuild boot --flake .#${host} --target-host root@${host}
|
||||
done
|
||||
65
services/kanidm/default.nix
Normal file
65
services/kanidm/default.nix
Normal file
@@ -0,0 +1,65 @@
|
||||
{ config, lib, pkgs, ... }:
|
||||
{
|
||||
services.kanidm = {
|
||||
package = pkgs.kanidmWithSecretProvisioning_1_8;
|
||||
enableServer = true;
|
||||
serverSettings = {
|
||||
domain = "home.2rjus.net";
|
||||
origin = "https://auth.home.2rjus.net";
|
||||
bindaddress = "0.0.0.0:443";
|
||||
ldapbindaddress = "0.0.0.0:636";
|
||||
tls_chain = "/var/lib/acme/auth.home.2rjus.net/fullchain.pem";
|
||||
tls_key = "/var/lib/acme/auth.home.2rjus.net/key.pem";
|
||||
online_backup = {
|
||||
path = "/var/lib/kanidm/backups";
|
||||
schedule = "00 22 * * *";
|
||||
versions = 7;
|
||||
};
|
||||
};
|
||||
|
||||
# Provision base groups only - users are managed via CLI
|
||||
# See docs/user-management.md for details
|
||||
provision = {
|
||||
enable = true;
|
||||
idmAdminPasswordFile = config.vault.secrets.kanidm-idm-admin.outputDir;
|
||||
|
||||
groups = {
|
||||
admins = { };
|
||||
users = { };
|
||||
ssh-users = { };
|
||||
};
|
||||
|
||||
# Regular users (persons) are managed imperatively via kanidm CLI
|
||||
};
|
||||
};
|
||||
|
||||
# Grant kanidm access to ACME certificates
|
||||
users.users.kanidm.extraGroups = [ "acme" ];
|
||||
|
||||
# ACME certificate from internal CA
|
||||
# Include both the CNAME (auth) and A record (kanidm01) for Prometheus scraping
|
||||
security.acme.certs."auth.home.2rjus.net" = {
|
||||
listenHTTP = ":80";
|
||||
reloadServices = [ "kanidm" ];
|
||||
extraDomainNames = [ "${config.networking.hostName}.home.2rjus.net" ];
|
||||
};
|
||||
|
||||
# Vault secret for idm_admin password (used for provisioning)
|
||||
vault.secrets.kanidm-idm-admin = {
|
||||
secretPath = "kanidm/idm-admin-password";
|
||||
extractKey = "password";
|
||||
services = [ "kanidm" ];
|
||||
owner = "kanidm";
|
||||
group = "kanidm";
|
||||
};
|
||||
|
||||
# Note: Kanidm does not expose Prometheus metrics
|
||||
# If metrics support is added in the future, uncomment:
|
||||
# homelab.monitoring.scrapeTargets = [
|
||||
# {
|
||||
# job_name = "kanidm";
|
||||
# port = 443;
|
||||
# scheme = "https";
|
||||
# }
|
||||
# ];
|
||||
}
|
||||
@@ -356,32 +356,6 @@ groups:
|
||||
annotations:
|
||||
summary: "Proxmox VM {{ $labels.id }} is stopped"
|
||||
description: "Proxmox VM {{ $labels.id }} ({{ $labels.name }}) has onboot=1 but is stopped."
|
||||
- name: postgres_rules
|
||||
rules:
|
||||
- alert: postgres_down
|
||||
expr: node_systemd_unit_state{instance="pgdb1.home.2rjus.net:9100", name="postgresql.service", state="active"} == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "PostgreSQL not running on {{ $labels.instance }}"
|
||||
description: "PostgreSQL has been down on {{ $labels.instance }} more than 5 minutes."
|
||||
- alert: postgres_exporter_down
|
||||
expr: up{job="postgres"} == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "PostgreSQL exporter down on {{ $labels.instance }}"
|
||||
description: "Cannot scrape PostgreSQL metrics from {{ $labels.instance }}."
|
||||
- alert: postgres_high_connections
|
||||
expr: pg_stat_activity_count / pg_settings_max_connections > 0.8
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "PostgreSQL connection pool near exhaustion on {{ $labels.instance }}"
|
||||
description: "PostgreSQL is using over 80% of max_connections on {{ $labels.instance }}."
|
||||
- name: jellyfin_rules
|
||||
rules:
|
||||
- alert: jellyfin_down
|
||||
|
||||
@@ -45,7 +45,11 @@
|
||||
};
|
||||
stub-zone = {
|
||||
name = "home.2rjus.net";
|
||||
stub-addr = "127.0.0.1@8053";
|
||||
stub-addr = [
|
||||
"127.0.0.1@8053" # Local NSD
|
||||
"10.69.13.5@8053" # ns1
|
||||
"10.69.13.6@8053" # ns2
|
||||
];
|
||||
};
|
||||
forward-zone = {
|
||||
name = ".";
|
||||
|
||||
@@ -1,6 +0,0 @@
|
||||
{ ... }:
|
||||
{
|
||||
imports = [
|
||||
./postgres.nix
|
||||
];
|
||||
}
|
||||
@@ -1,23 +0,0 @@
|
||||
{ pkgs, ... }:
|
||||
{
|
||||
homelab.monitoring.scrapeTargets = [{
|
||||
job_name = "postgres";
|
||||
port = 9187;
|
||||
}];
|
||||
|
||||
services.prometheus.exporters.postgres = {
|
||||
enable = true;
|
||||
runAsLocalSuperUser = true; # Use peer auth as postgres user
|
||||
};
|
||||
|
||||
services.postgresql = {
|
||||
enable = true;
|
||||
enableJIT = true;
|
||||
enableTCPIP = true;
|
||||
extensions = ps: with ps; [ pgvector ];
|
||||
authentication = ''
|
||||
# Allow access to everything from gunter
|
||||
host all all 10.69.30.105/32 scram-sha-256
|
||||
'';
|
||||
};
|
||||
}
|
||||
@@ -4,13 +4,16 @@
|
||||
./acme.nix
|
||||
./autoupgrade.nix
|
||||
./homelab-deploy.nix
|
||||
./kanidm-client.nix
|
||||
./monitoring
|
||||
./motd.nix
|
||||
./packages.nix
|
||||
./nix.nix
|
||||
./pipe-to-loki.nix
|
||||
./root-user.nix
|
||||
./pki/root-ca.nix
|
||||
./sshd.nix
|
||||
./vault-secrets.nix
|
||||
./zram.nix
|
||||
];
|
||||
}
|
||||
|
||||
42
system/kanidm-client.nix
Normal file
42
system/kanidm-client.nix
Normal file
@@ -0,0 +1,42 @@
|
||||
{ lib, config, pkgs, ... }:
|
||||
let
|
||||
cfg = config.homelab.kanidm;
|
||||
in
|
||||
{
|
||||
options.homelab.kanidm = {
|
||||
enable = lib.mkEnableOption "Kanidm PAM/NSS client for central authentication";
|
||||
|
||||
server = lib.mkOption {
|
||||
type = lib.types.str;
|
||||
default = "https://auth.home.2rjus.net";
|
||||
description = "URI of the Kanidm server";
|
||||
};
|
||||
|
||||
allowedLoginGroups = lib.mkOption {
|
||||
type = lib.types.listOf lib.types.str;
|
||||
default = [ "ssh-users" ];
|
||||
description = "Groups allowed to log in via PAM";
|
||||
};
|
||||
};
|
||||
|
||||
config = lib.mkIf cfg.enable {
|
||||
services.kanidm = {
|
||||
package = pkgs.kanidm_1_8;
|
||||
enablePam = true;
|
||||
|
||||
clientSettings = {
|
||||
uri = cfg.server;
|
||||
};
|
||||
|
||||
unixSettings = {
|
||||
pam_allowed_login_groups = cfg.allowedLoginGroups;
|
||||
# Use short names (torjus) instead of SPN format (torjus@home.2rjus.net)
|
||||
# This prevents "PAM user mismatch" errors with SSH
|
||||
uid_attr_map = "name";
|
||||
gid_attr_map = "name";
|
||||
# Create symlink /home/torjus -> /home/torjus@home.2rjus.net
|
||||
home_alias = "name";
|
||||
};
|
||||
};
|
||||
};
|
||||
}
|
||||
140
system/pipe-to-loki.nix
Normal file
140
system/pipe-to-loki.nix
Normal file
@@ -0,0 +1,140 @@
|
||||
{
|
||||
config,
|
||||
pkgs,
|
||||
lib,
|
||||
...
|
||||
}:
|
||||
let
|
||||
pipe-to-loki = pkgs.writeShellApplication {
|
||||
name = "pipe-to-loki";
|
||||
runtimeInputs = with pkgs; [
|
||||
curl
|
||||
jq
|
||||
util-linux
|
||||
coreutils
|
||||
];
|
||||
text = ''
|
||||
set -euo pipefail
|
||||
|
||||
LOKI_URL="http://monitoring01.home.2rjus.net:3100/loki/api/v1/push"
|
||||
HOSTNAME=$(hostname)
|
||||
SESSION_ID=""
|
||||
RECORD_MODE=false
|
||||
|
||||
usage() {
|
||||
echo "Usage: pipe-to-loki [--id ID] [--record]"
|
||||
echo ""
|
||||
echo "Send command output or interactive sessions to Loki."
|
||||
echo ""
|
||||
echo "Options:"
|
||||
echo " --id ID Set custom session ID (default: auto-generated)"
|
||||
echo " --record Start interactive recording session"
|
||||
echo ""
|
||||
echo "Examples:"
|
||||
echo " command | pipe-to-loki # Pipe command output"
|
||||
echo " command | pipe-to-loki --id foo # Pipe with custom ID"
|
||||
echo " pipe-to-loki --record # Start recording session"
|
||||
exit 1
|
||||
}
|
||||
|
||||
generate_id() {
|
||||
local random_chars
|
||||
random_chars=$(head -c 2 /dev/urandom | od -An -tx1 | tr -d ' \n')
|
||||
echo "''${HOSTNAME}-$(date +%s)-''${random_chars}"
|
||||
}
|
||||
|
||||
send_to_loki() {
|
||||
local content="$1"
|
||||
local type="$2"
|
||||
local timestamp_ns
|
||||
timestamp_ns=$(date +%s%N)
|
||||
|
||||
local payload
|
||||
payload=$(jq -n \
|
||||
--arg job "pipe-to-loki" \
|
||||
--arg host "$HOSTNAME" \
|
||||
--arg type "$type" \
|
||||
--arg id "$SESSION_ID" \
|
||||
--arg ts "$timestamp_ns" \
|
||||
--arg content "$content" \
|
||||
'{
|
||||
streams: [{
|
||||
stream: {
|
||||
job: $job,
|
||||
host: $host,
|
||||
type: $type,
|
||||
id: $id
|
||||
},
|
||||
values: [[$ts, $content]]
|
||||
}]
|
||||
}')
|
||||
|
||||
if curl -s -X POST "$LOKI_URL" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d "$payload" > /dev/null; then
|
||||
return 0
|
||||
else
|
||||
echo "Error: Failed to send to Loki" >&2
|
||||
return 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Parse arguments
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case $1 in
|
||||
--id)
|
||||
SESSION_ID="$2"
|
||||
shift 2
|
||||
;;
|
||||
--record)
|
||||
RECORD_MODE=true
|
||||
shift
|
||||
;;
|
||||
--help|-h)
|
||||
usage
|
||||
;;
|
||||
*)
|
||||
echo "Unknown option: $1" >&2
|
||||
usage
|
||||
;;
|
||||
esac
|
||||
done
|
||||
|
||||
# Generate ID if not provided
|
||||
if [[ -z "$SESSION_ID" ]]; then
|
||||
SESSION_ID=$(generate_id)
|
||||
fi
|
||||
|
||||
if $RECORD_MODE; then
|
||||
# Session recording mode
|
||||
SCRIPT_FILE=$(mktemp)
|
||||
trap 'rm -f "$SCRIPT_FILE"' EXIT
|
||||
|
||||
echo "Recording session $SESSION_ID... (exit to send)"
|
||||
|
||||
# Use script to record the session
|
||||
script -q "$SCRIPT_FILE"
|
||||
|
||||
# Read the transcript and send to Loki
|
||||
content=$(cat "$SCRIPT_FILE")
|
||||
if send_to_loki "$content" "session"; then
|
||||
echo "Session $SESSION_ID sent to Loki"
|
||||
fi
|
||||
else
|
||||
# Pipe mode - read from stdin
|
||||
if [[ -t 0 ]]; then
|
||||
echo "Error: No input provided. Pipe a command or use --record for interactive mode." >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
content=$(cat)
|
||||
if send_to_loki "$content" "command"; then
|
||||
echo "Sent to Loki with id: $SESSION_ID"
|
||||
fi
|
||||
fi
|
||||
'';
|
||||
};
|
||||
in
|
||||
{
|
||||
environment.systemPackages = [ pipe-to-loki ];
|
||||
}
|
||||
8
system/zram.nix
Normal file
8
system/zram.nix
Normal file
@@ -0,0 +1,8 @@
|
||||
# Compressed swap in RAM
|
||||
#
|
||||
# Provides overflow memory during Nix builds and upgrades.
|
||||
# Prevents OOM kills on low-memory hosts (2GB VMs).
|
||||
{ ... }:
|
||||
{
|
||||
zramSwap.enable = true;
|
||||
}
|
||||
@@ -66,19 +66,7 @@ locals {
|
||||
]
|
||||
}
|
||||
|
||||
"pgdb1" = {
|
||||
paths = [
|
||||
"secret/data/hosts/pgdb1/*",
|
||||
]
|
||||
}
|
||||
|
||||
# Wave 3: DNS servers
|
||||
"ns1" = {
|
||||
paths = [
|
||||
"secret/data/hosts/ns1/*",
|
||||
"secret/data/shared/dns/*",
|
||||
]
|
||||
}
|
||||
|
||||
# Wave 4: http-proxy
|
||||
"http-proxy" = {
|
||||
|
||||
@@ -26,6 +26,19 @@ locals {
|
||||
"secret/data/shared/dns/*",
|
||||
]
|
||||
}
|
||||
"ns1" = {
|
||||
paths = [
|
||||
"secret/data/hosts/ns1/*",
|
||||
"secret/data/shared/dns/*",
|
||||
"secret/data/shared/homelab-deploy/*",
|
||||
]
|
||||
}
|
||||
"kanidm01" = {
|
||||
paths = [
|
||||
"secret/data/hosts/kanidm01/*",
|
||||
"secret/data/kanidm/*",
|
||||
]
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
|
||||
@@ -102,6 +102,12 @@ locals {
|
||||
auto_generate = false
|
||||
data = { nkey = var.homelab_deploy_admin_deployer_nkey }
|
||||
}
|
||||
|
||||
# Kanidm idm_admin password
|
||||
"kanidm/idm-admin-password" = {
|
||||
auto_generate = true
|
||||
password_length = 32
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -65,6 +65,20 @@ locals {
|
||||
disk_size = "20G"
|
||||
vault_wrapped_token = "s.3nran1e1Uim4B1OomIWCoS4T"
|
||||
}
|
||||
"ns1" = {
|
||||
ip = "10.69.13.5/24"
|
||||
cpu_cores = 2
|
||||
memory = 2048
|
||||
disk_size = "20G"
|
||||
vault_wrapped_token = "s.b6ge0KMtNQctdKkvm0RNxGdt"
|
||||
}
|
||||
"kanidm01" = {
|
||||
ip = "10.69.13.23/24"
|
||||
cpu_cores = 2
|
||||
memory = 2048
|
||||
disk_size = "20G"
|
||||
vault_wrapped_token = "s.OOqjEECeIV7dNgCS6jNmyY3K"
|
||||
}
|
||||
}
|
||||
|
||||
# Compute VM configurations with defaults applied
|
||||
|
||||
Reference in New Issue
Block a user