Compare commits
34 Commits
ec4ac1477e
...
pipe-to-lo
| Author | SHA1 | Date | |
|---|---|---|---|
|
78eb04205f
|
|||
| 19cb61ebbc | |||
|
9ed09c9a9c
|
|||
|
b31c64f1b9
|
|||
|
54b6e37420
|
|||
|
b845a8bb8b
|
|||
|
bfbf0cea68
|
|||
|
3abe5e83a7
|
|||
|
67c27555f3
|
|||
|
1674b6a844
|
|||
|
311be282b6
|
|||
|
11cbb64097
|
|||
|
e2dd21c994
|
|||
|
463342133e
|
|||
|
de36b9d016
|
|||
|
3f1d966919
|
|||
|
7fcc043a4d
|
|||
|
70ec5f8109
|
|||
|
c2ec34cab9
|
|||
|
8fbf1224fa
|
|||
|
8959829f77
|
|||
|
93dbb45802
|
|||
|
538c2ad097
|
|||
|
d99c82c74c
|
|||
|
ca0e3fd629
|
|||
|
732e9b8c22
|
|||
|
3a14ffd6b5
|
|||
|
f9a3961457
|
|||
|
003d4ccf03
|
|||
|
735b8a9ee3
|
|||
|
94feae82a0
|
|||
|
3f94f7ee95
|
|||
|
b7e398c9a7
|
|||
|
8ec2a083bd
|
180
.claude/agents/auditor.md
Normal file
180
.claude/agents/auditor.md
Normal file
@@ -0,0 +1,180 @@
|
|||||||
|
---
|
||||||
|
name: auditor
|
||||||
|
description: Analyzes audit logs to investigate user activity, command execution, and suspicious behavior on hosts. Can be used standalone for security reviews or called by other agents for behavioral context.
|
||||||
|
tools: Read, Grep, Glob
|
||||||
|
mcpServers:
|
||||||
|
- lab-monitoring
|
||||||
|
---
|
||||||
|
|
||||||
|
You are a security auditor for a NixOS homelab infrastructure. Your task is to analyze audit logs and reconstruct user activity on hosts.
|
||||||
|
|
||||||
|
## Input
|
||||||
|
|
||||||
|
You may receive:
|
||||||
|
- A host or list of hosts to investigate
|
||||||
|
- A time window (e.g., "last hour", "today", "between 14:00 and 15:00")
|
||||||
|
- Optional context: specific events to look for, user to focus on, or suspicious activity to investigate
|
||||||
|
- Optional context from a parent investigation (e.g., "a service stopped at 14:32, what happened around that time?")
|
||||||
|
|
||||||
|
## Audit Log Structure
|
||||||
|
|
||||||
|
Logs are shipped to Loki via promtail. Audit events use these labels:
|
||||||
|
- `host` - hostname
|
||||||
|
- `systemd_unit` - typically `auditd.service` for audit logs
|
||||||
|
- `job` - typically `systemd-journal`
|
||||||
|
|
||||||
|
Audit log entries contain structured data:
|
||||||
|
- `EXECVE` - command execution with full arguments
|
||||||
|
- `USER_LOGIN` / `USER_LOGOUT` - session start/end
|
||||||
|
- `USER_CMD` - sudo command execution
|
||||||
|
- `CRED_ACQ` / `CRED_DISP` - credential acquisition/disposal
|
||||||
|
- `SERVICE_START` / `SERVICE_STOP` - systemd service events
|
||||||
|
|
||||||
|
## Investigation Techniques
|
||||||
|
|
||||||
|
### 1. SSH Session Activity
|
||||||
|
|
||||||
|
Find SSH logins and session activity:
|
||||||
|
```logql
|
||||||
|
{host="<hostname>", systemd_unit="sshd.service"}
|
||||||
|
```
|
||||||
|
|
||||||
|
Look for:
|
||||||
|
- Accepted/Failed authentication
|
||||||
|
- Session opened/closed
|
||||||
|
- Unusual source IPs or users
|
||||||
|
|
||||||
|
### 2. Command Execution
|
||||||
|
|
||||||
|
Query executed commands (filter out noise):
|
||||||
|
```logql
|
||||||
|
{host="<hostname>"} |= "EXECVE" != "PATH item" != "PROCTITLE" != "SYSCALL" != "BPF"
|
||||||
|
```
|
||||||
|
|
||||||
|
Further filtering:
|
||||||
|
- Exclude systemd noise: `!= "systemd" != "/nix/store"`
|
||||||
|
- Focus on specific commands: `|= "rm" |= "-rf"`
|
||||||
|
- Focus on specific user: `|= "uid=1000"`
|
||||||
|
|
||||||
|
### 3. Sudo Activity
|
||||||
|
|
||||||
|
Check for privilege escalation:
|
||||||
|
```logql
|
||||||
|
{host="<hostname>"} |= "sudo" |= "COMMAND"
|
||||||
|
```
|
||||||
|
|
||||||
|
Or via audit:
|
||||||
|
```logql
|
||||||
|
{host="<hostname>"} |= "USER_CMD"
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. Service Manipulation
|
||||||
|
|
||||||
|
Check if services were manually stopped/started:
|
||||||
|
```logql
|
||||||
|
{host="<hostname>"} |= "EXECVE" |= "systemctl"
|
||||||
|
```
|
||||||
|
|
||||||
|
### 5. File Operations
|
||||||
|
|
||||||
|
Look for file modifications (if auditd rules are configured):
|
||||||
|
```logql
|
||||||
|
{host="<hostname>"} |= "EXECVE" |= "vim"
|
||||||
|
{host="<hostname>"} |= "EXECVE" |= "nano"
|
||||||
|
{host="<hostname>"} |= "EXECVE" |= "rm"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Query Guidelines
|
||||||
|
|
||||||
|
**Start narrow, expand if needed:**
|
||||||
|
- Begin with `limit: 20-30`
|
||||||
|
- Use tight time windows: `start: "15m"` or `start: "30m"`
|
||||||
|
- Add filters progressively
|
||||||
|
|
||||||
|
**Avoid:**
|
||||||
|
- Querying all audit logs without EXECVE filter (extremely verbose)
|
||||||
|
- Large time ranges without specific filters
|
||||||
|
- Limits over 50 without tight filters
|
||||||
|
|
||||||
|
**Time-bounded queries:**
|
||||||
|
When investigating around a specific event:
|
||||||
|
```logql
|
||||||
|
{host="<hostname>"} |= "EXECVE" != "systemd"
|
||||||
|
```
|
||||||
|
With `start: "2026-02-08T14:30:00Z"` and `end: "2026-02-08T14:35:00Z"`
|
||||||
|
|
||||||
|
## Suspicious Patterns to Watch For
|
||||||
|
|
||||||
|
1. **Unusual login times** - Activity outside normal hours
|
||||||
|
2. **Failed authentication** - Brute force attempts
|
||||||
|
3. **Privilege escalation** - Unexpected sudo usage
|
||||||
|
4. **Reconnaissance commands** - `whoami`, `id`, `uname`, `cat /etc/passwd`
|
||||||
|
5. **Data exfiltration indicators** - `curl`, `wget`, `scp`, `rsync` to external destinations
|
||||||
|
6. **Persistence mechanisms** - Cron modifications, systemd service creation
|
||||||
|
7. **Log tampering** - Commands targeting log files
|
||||||
|
8. **Lateral movement** - SSH to other internal hosts
|
||||||
|
9. **Service manipulation** - Stopping security services, disabling firewalls
|
||||||
|
10. **Cleanup activity** - Deleting bash history, clearing logs
|
||||||
|
|
||||||
|
## Output Format
|
||||||
|
|
||||||
|
### For Standalone Security Reviews
|
||||||
|
|
||||||
|
```
|
||||||
|
## Activity Summary
|
||||||
|
|
||||||
|
**Host:** <hostname>
|
||||||
|
**Time Period:** <start> to <end>
|
||||||
|
**Sessions Found:** <count>
|
||||||
|
|
||||||
|
## User Sessions
|
||||||
|
|
||||||
|
### Session 1: <user> from <source_ip>
|
||||||
|
- **Login:** HH:MM:SSZ
|
||||||
|
- **Logout:** HH:MM:SSZ (or ongoing)
|
||||||
|
- **Commands executed:**
|
||||||
|
- HH:MM:SSZ - <command>
|
||||||
|
- HH:MM:SSZ - <command>
|
||||||
|
|
||||||
|
## Suspicious Activity
|
||||||
|
|
||||||
|
[If any patterns from the watch list were detected]
|
||||||
|
- **Finding:** <description>
|
||||||
|
- **Evidence:** <log entries>
|
||||||
|
- **Risk Level:** Low / Medium / High
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
[Overall assessment: normal activity, concerning patterns, or clear malicious activity]
|
||||||
|
```
|
||||||
|
|
||||||
|
### When Called by Another Agent
|
||||||
|
|
||||||
|
Provide a focused response addressing the specific question:
|
||||||
|
|
||||||
|
```
|
||||||
|
## Audit Findings
|
||||||
|
|
||||||
|
**Query:** <what was asked>
|
||||||
|
**Time Window:** <investigated period>
|
||||||
|
|
||||||
|
## Relevant Activity
|
||||||
|
|
||||||
|
[Chronological list of relevant events]
|
||||||
|
- HH:MM:SSZ - <event>
|
||||||
|
- HH:MM:SSZ - <event>
|
||||||
|
|
||||||
|
## Assessment
|
||||||
|
|
||||||
|
[Direct answer to the question with supporting evidence]
|
||||||
|
```
|
||||||
|
|
||||||
|
## Guidelines
|
||||||
|
|
||||||
|
- Reconstruct timelines chronologically
|
||||||
|
- Correlate events (login → commands → logout)
|
||||||
|
- Note gaps or missing data
|
||||||
|
- Distinguish between automated (systemd, cron) and interactive activity
|
||||||
|
- Consider the host's role and tier when assessing severity
|
||||||
|
- When called by another agent, focus on answering their specific question
|
||||||
|
- Don't speculate without evidence - state what the logs show and don't show
|
||||||
211
.claude/agents/investigate-alarm.md
Normal file
211
.claude/agents/investigate-alarm.md
Normal file
@@ -0,0 +1,211 @@
|
|||||||
|
---
|
||||||
|
name: investigate-alarm
|
||||||
|
description: Investigates a single system alarm by querying Prometheus metrics and Loki logs, analyzing configuration files for affected hosts/services, and providing root cause analysis.
|
||||||
|
tools: Read, Grep, Glob
|
||||||
|
mcpServers:
|
||||||
|
- lab-monitoring
|
||||||
|
- git-explorer
|
||||||
|
---
|
||||||
|
|
||||||
|
You are an alarm investigation specialist for a NixOS homelab infrastructure. Your task is to analyze a single alarm and determine its root cause.
|
||||||
|
|
||||||
|
## Input
|
||||||
|
|
||||||
|
You will receive information about an alarm, which may include:
|
||||||
|
- Alert name and severity
|
||||||
|
- Affected host or service
|
||||||
|
- Alert expression/threshold
|
||||||
|
- Current value or status
|
||||||
|
- When it started firing
|
||||||
|
|
||||||
|
## Investigation Process
|
||||||
|
|
||||||
|
### 1. Understand the Alert Context
|
||||||
|
|
||||||
|
Start by understanding what the alert is measuring:
|
||||||
|
- Use `get_alert` if you have a fingerprint, or `list_alerts` to find matching alerts
|
||||||
|
- Use `get_metric_metadata` to understand the metric being monitored
|
||||||
|
- Use `search_metrics` to find related metrics
|
||||||
|
|
||||||
|
### 2. Query Current State
|
||||||
|
|
||||||
|
Gather evidence about the current system state:
|
||||||
|
- Use `query` to check the current metric values and related metrics
|
||||||
|
- Use `list_targets` to verify the host/service is being scraped successfully
|
||||||
|
- Look for correlated metrics that might explain the issue
|
||||||
|
|
||||||
|
### 3. Check Service Logs
|
||||||
|
|
||||||
|
Search for relevant log entries using `query_logs`. Focus on service-specific logs and errors.
|
||||||
|
|
||||||
|
**Query strategies (start narrow, expand if needed):**
|
||||||
|
- Start with `limit: 20-30`, increase only if needed
|
||||||
|
- Use tight time windows: `start: "15m"` or `start: "30m"` initially
|
||||||
|
- Filter to specific services: `{host="<hostname>", systemd_unit="<service>.service"}`
|
||||||
|
- Search for errors: `{host="<hostname>"} |= "error"` or `|= "failed"`
|
||||||
|
|
||||||
|
**Common patterns:**
|
||||||
|
- Service logs: `{host="<hostname>", systemd_unit="<service>.service"}`
|
||||||
|
- All errors on host: `{host="<hostname>"} |= "error"`
|
||||||
|
- Journal for a unit: `{host="<hostname>", systemd_unit="nginx.service"} |= "failed"`
|
||||||
|
|
||||||
|
**Avoid:**
|
||||||
|
- Using `start: "1h"` with no filters on busy hosts
|
||||||
|
- Limits over 50 without specific filters
|
||||||
|
|
||||||
|
### 4. Investigate User Activity
|
||||||
|
|
||||||
|
For any analysis of user activity, **always spawn the `auditor` agent**. Do not query audit logs (EXECVE, USER_LOGIN, etc.) directly - delegate this to the auditor.
|
||||||
|
|
||||||
|
**Always call the auditor when:**
|
||||||
|
- A service stopped unexpectedly (may have been manually stopped)
|
||||||
|
- A process was killed or a config was changed
|
||||||
|
- You need to know who was logged in around the time of an incident
|
||||||
|
- You need to understand what commands led to the current state
|
||||||
|
- The cause isn't obvious from service logs alone
|
||||||
|
|
||||||
|
**Do NOT try to query audit logs yourself.** The auditor is specialized for:
|
||||||
|
- Parsing EXECVE records and reconstructing command lines
|
||||||
|
- Correlating SSH sessions with commands executed
|
||||||
|
- Identifying suspicious patterns
|
||||||
|
- Filtering out systemd/nix-store noise
|
||||||
|
|
||||||
|
**Example prompt for auditor:**
|
||||||
|
```
|
||||||
|
Investigate user activity on <hostname> between <start_time> and <end_time>.
|
||||||
|
Context: The prometheus-node-exporter service stopped at 14:32.
|
||||||
|
Determine if it was manually stopped and by whom.
|
||||||
|
```
|
||||||
|
|
||||||
|
Incorporate the auditor's findings into your timeline and root cause analysis.
|
||||||
|
|
||||||
|
### 5. Check Configuration (if relevant)
|
||||||
|
|
||||||
|
If the alert relates to a NixOS-managed service:
|
||||||
|
- Check host configuration in `/hosts/<hostname>/`
|
||||||
|
- Check service modules in `/services/<service>/`
|
||||||
|
- Look for thresholds, resource limits, or misconfigurations
|
||||||
|
- Check `homelab.host` options for tier/priority/role metadata
|
||||||
|
|
||||||
|
### 6. Check for Configuration Drift
|
||||||
|
|
||||||
|
Use the git-explorer MCP server to compare the host's deployed configuration against the current master branch. This helps identify:
|
||||||
|
- Hosts running outdated configurations
|
||||||
|
- Recent changes that might have caused the issue
|
||||||
|
- Whether a fix has already been committed but not deployed
|
||||||
|
|
||||||
|
**Step 1: Get the deployed revision from Prometheus**
|
||||||
|
```promql
|
||||||
|
nixos_flake_info{hostname="<hostname>"}
|
||||||
|
```
|
||||||
|
The `current_rev` label contains the deployed git commit hash.
|
||||||
|
|
||||||
|
**Step 2: Check if the host is behind master**
|
||||||
|
```
|
||||||
|
resolve_ref("master") # Get current master commit
|
||||||
|
is_ancestor(deployed, master) # Check if host is behind
|
||||||
|
```
|
||||||
|
|
||||||
|
**Step 3: See what commits are missing**
|
||||||
|
```
|
||||||
|
commits_between(deployed, master) # List commits not yet deployed
|
||||||
|
```
|
||||||
|
|
||||||
|
**Step 4: Check which files changed**
|
||||||
|
```
|
||||||
|
get_diff_files(deployed, master) # Files modified since deployment
|
||||||
|
```
|
||||||
|
Look for files in `hosts/<hostname>/`, `services/<relevant-service>/`, or `system/` that affect this host.
|
||||||
|
|
||||||
|
**Step 5: View configuration at the deployed revision**
|
||||||
|
```
|
||||||
|
get_file_at_commit(deployed, "services/<service>/default.nix")
|
||||||
|
```
|
||||||
|
Compare against the current file to understand differences.
|
||||||
|
|
||||||
|
**Step 6: Find when something changed**
|
||||||
|
```
|
||||||
|
search_commits("<service-name>") # Find commits mentioning the service
|
||||||
|
get_commit_info(<hash>) # Get full details of a specific change
|
||||||
|
```
|
||||||
|
|
||||||
|
**Example workflow for a service-related alert:**
|
||||||
|
1. Query `nixos_flake_info{hostname="monitoring01"}` → `current_rev: 8959829`
|
||||||
|
2. `resolve_ref("master")` → `4633421`
|
||||||
|
3. `is_ancestor("8959829", "4633421")` → Yes, host is behind
|
||||||
|
4. `commits_between("8959829", "4633421")` → 7 commits missing
|
||||||
|
5. `get_diff_files("8959829", "4633421")` → Check if relevant service files changed
|
||||||
|
6. If a fix was committed after the deployed rev, recommend deployment
|
||||||
|
|
||||||
|
### 7. Consider Common Causes
|
||||||
|
|
||||||
|
For infrastructure alerts, common causes include:
|
||||||
|
- **Manual intervention**: Service manually stopped/restarted (call auditor to confirm)
|
||||||
|
- **Configuration drift**: Host running outdated config, fix already in master
|
||||||
|
- **Disk space**: Nix store growth, logs, temp files
|
||||||
|
- **Memory pressure**: Service memory leaks, insufficient limits
|
||||||
|
- **CPU**: Runaway processes, build jobs
|
||||||
|
- **Network**: DNS issues, connectivity problems
|
||||||
|
- **Service restarts**: Failed upgrades, configuration errors
|
||||||
|
- **Scrape failures**: Service down, firewall issues, port changes
|
||||||
|
|
||||||
|
**Note:** If a service stopped unexpectedly and service logs don't show a crash or error, it was likely manual intervention - call the auditor to investigate.
|
||||||
|
|
||||||
|
## Output Format
|
||||||
|
|
||||||
|
Provide a concise report with one of two outcomes:
|
||||||
|
|
||||||
|
### If Root Cause Identified:
|
||||||
|
|
||||||
|
```
|
||||||
|
## Root Cause
|
||||||
|
[1-2 sentence summary of the root cause]
|
||||||
|
|
||||||
|
## Timeline
|
||||||
|
[Chronological sequence of relevant events leading to the alert]
|
||||||
|
- HH:MM:SSZ - [Event description]
|
||||||
|
- HH:MM:SSZ - [Event description]
|
||||||
|
- HH:MM:SSZ - [Alert fired]
|
||||||
|
|
||||||
|
### Timeline sources
|
||||||
|
- HH:MM:SSZ - [Source for information about this event. Which metric or log file]
|
||||||
|
- HH:MM:SSZ - [Source for information about this event. Which metric or log file]
|
||||||
|
- HH:MM:SSZ - [Alert fired]
|
||||||
|
|
||||||
|
|
||||||
|
## Evidence
|
||||||
|
- [Specific metric values or log entries that support the conclusion]
|
||||||
|
- [Configuration details if relevant]
|
||||||
|
|
||||||
|
|
||||||
|
## Recommended Actions
|
||||||
|
1. [Specific remediation step]
|
||||||
|
2. [Follow-up actions if any]
|
||||||
|
```
|
||||||
|
|
||||||
|
### If Root Cause Unclear:
|
||||||
|
|
||||||
|
```
|
||||||
|
## Investigation Summary
|
||||||
|
[What was checked and what was found]
|
||||||
|
|
||||||
|
## Possible Causes
|
||||||
|
- [Hypothesis 1 with supporting/contradicting evidence]
|
||||||
|
- [Hypothesis 2 with supporting/contradicting evidence]
|
||||||
|
|
||||||
|
## Additional Information Needed
|
||||||
|
- [Specific data, logs, or access that would help]
|
||||||
|
- [Suggested queries or checks for the operator]
|
||||||
|
```
|
||||||
|
|
||||||
|
## Guidelines
|
||||||
|
|
||||||
|
- Be concise and actionable
|
||||||
|
- Reference specific metric names and values as evidence
|
||||||
|
- Include log snippets when they're informative
|
||||||
|
- Don't speculate without evidence
|
||||||
|
- If the alert is a false positive or expected behavior, explain why
|
||||||
|
- Consider the host's tier (test vs prod) when assessing severity
|
||||||
|
- Build a timeline from log timestamps and metrics to show the sequence of events
|
||||||
|
- **Query logs incrementally**: start with narrow filters and small limits, expand only if needed
|
||||||
|
- **Always delegate to the auditor agent** for any user activity analysis - never query EXECVE or audit logs directly
|
||||||
@@ -32,7 +32,7 @@ Use the `lab-monitoring` MCP server tools:
|
|||||||
Available labels for log queries:
|
Available labels for log queries:
|
||||||
- `host` - Hostname (e.g., `ns1`, `monitoring01`, `ha1`)
|
- `host` - Hostname (e.g., `ns1`, `monitoring01`, `ha1`)
|
||||||
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `nixos-upgrade.service`)
|
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `nixos-upgrade.service`)
|
||||||
- `job` - Either `systemd-journal` (most logs) or `varlog` (file-based logs)
|
- `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
|
||||||
- `filename` - For `varlog` job, the log file path
|
- `filename` - For `varlog` job, the log file path
|
||||||
- `hostname` - Alternative to `host` for some streams
|
- `hostname` - Alternative to `host` for some streams
|
||||||
|
|
||||||
@@ -102,6 +102,36 @@ Useful systemd units for troubleshooting:
|
|||||||
- `sshd.service` - SSH daemon
|
- `sshd.service` - SSH daemon
|
||||||
- `nix-gc.service` - Nix garbage collection
|
- `nix-gc.service` - Nix garbage collection
|
||||||
|
|
||||||
|
### Bootstrap Logs
|
||||||
|
|
||||||
|
VMs provisioned from template2 send bootstrap progress directly to Loki via curl (before promtail is available). These logs use `job="bootstrap"` with additional labels:
|
||||||
|
|
||||||
|
- `host` - Target hostname
|
||||||
|
- `branch` - Git branch being deployed
|
||||||
|
- `stage` - Bootstrap stage (see table below)
|
||||||
|
|
||||||
|
**Bootstrap stages:**
|
||||||
|
|
||||||
|
| Stage | Message | Meaning |
|
||||||
|
|-------|---------|---------|
|
||||||
|
| `starting` | Bootstrap starting for \<host\> (branch: \<branch\>) | Bootstrap service has started |
|
||||||
|
| `network_ok` | Network connectivity confirmed | Can reach git server |
|
||||||
|
| `vault_ok` | Vault credentials unwrapped and stored | AppRole credentials provisioned |
|
||||||
|
| `vault_skip` | No Vault token provided - skipping credential setup | No wrapped token was provided |
|
||||||
|
| `vault_warn` | Failed to unwrap Vault token - continuing without secrets | Token unwrap failed (expired/used) |
|
||||||
|
| `building` | Starting nixos-rebuild boot | NixOS build starting |
|
||||||
|
| `success` | Build successful - rebooting into new configuration | Build complete, rebooting |
|
||||||
|
| `failed` | nixos-rebuild failed - manual intervention required | Build failed |
|
||||||
|
|
||||||
|
**Bootstrap queries:**
|
||||||
|
|
||||||
|
```logql
|
||||||
|
{job="bootstrap"} # All bootstrap logs
|
||||||
|
{job="bootstrap", host="myhost"} # Specific host
|
||||||
|
{job="bootstrap", stage="failed"} # All failures
|
||||||
|
{job="bootstrap", stage=~"building|success"} # Track build progress
|
||||||
|
```
|
||||||
|
|
||||||
### Extracting JSON Fields
|
### Extracting JSON Fields
|
||||||
|
|
||||||
Parse JSON and filter on fields:
|
Parse JSON and filter on fields:
|
||||||
@@ -175,15 +205,39 @@ Disk space (root filesystem):
|
|||||||
node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}
|
node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}
|
||||||
```
|
```
|
||||||
|
|
||||||
### Service-Specific Metrics
|
### Prometheus Jobs
|
||||||
|
|
||||||
Common job names:
|
All available Prometheus job names:
|
||||||
- `node-exporter` - System metrics (all hosts)
|
|
||||||
- `nixos-exporter` - NixOS version/generation metrics
|
**System exporters (on all/most hosts):**
|
||||||
- `caddy` - Reverse proxy metrics
|
- `node-exporter` - System metrics (CPU, memory, disk, network)
|
||||||
- `prometheus` / `loki` / `grafana` - Monitoring stack
|
- `nixos-exporter` - NixOS flake revision and generation info
|
||||||
- `home-assistant` - Home automation
|
- `systemd-exporter` - Systemd unit status metrics
|
||||||
- `step-ca` - Internal CA
|
- `homelab-deploy` - Deployment listener metrics
|
||||||
|
|
||||||
|
**Service-specific exporters:**
|
||||||
|
- `caddy` - Reverse proxy metrics (http-proxy)
|
||||||
|
- `nix-cache_caddy` - Nix binary cache metrics
|
||||||
|
- `home-assistant` - Home automation metrics (ha1)
|
||||||
|
- `jellyfin` - Media server metrics (jelly01)
|
||||||
|
- `kanidm` - Authentication server metrics (kanidm01)
|
||||||
|
- `nats` - NATS messaging metrics (nats1)
|
||||||
|
- `openbao` - Secrets management metrics (vault01)
|
||||||
|
- `unbound` - DNS resolver metrics (ns1, ns2)
|
||||||
|
- `wireguard` - VPN tunnel metrics (http-proxy)
|
||||||
|
|
||||||
|
**Monitoring stack (localhost on monitoring01):**
|
||||||
|
- `prometheus` - Prometheus self-metrics
|
||||||
|
- `loki` - Loki self-metrics
|
||||||
|
- `grafana` - Grafana self-metrics
|
||||||
|
- `alertmanager` - Alertmanager metrics
|
||||||
|
- `pushgateway` - Push-based metrics gateway
|
||||||
|
|
||||||
|
**External/infrastructure:**
|
||||||
|
- `pve-exporter` - Proxmox hypervisor metrics
|
||||||
|
- `smartctl` - Disk SMART health (gunter)
|
||||||
|
- `restic_rest` - Backup server metrics
|
||||||
|
- `ghettoptt` - PTT service metrics (gunter)
|
||||||
|
|
||||||
### Target Labels
|
### Target Labels
|
||||||
|
|
||||||
@@ -237,6 +291,7 @@ Current host labels:
|
|||||||
| ns2 | `role=dns`, `dns_role=secondary` |
|
| ns2 | `role=dns`, `dns_role=secondary` |
|
||||||
| nix-cache01 | `role=build-host` |
|
| nix-cache01 | `role=build-host` |
|
||||||
| vault01 | `role=vault` |
|
| vault01 | `role=vault` |
|
||||||
|
| kanidm01 | `role=auth`, `tier=test` |
|
||||||
| testvm01/02/03 | `tier=test` |
|
| testvm01/02/03 | `tier=test` |
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -265,6 +320,17 @@ Current host labels:
|
|||||||
3. Check service logs for startup issues
|
3. Check service logs for startup issues
|
||||||
4. Check service metrics are being scraped
|
4. Check service metrics are being scraped
|
||||||
|
|
||||||
|
### Monitor VM Bootstrap
|
||||||
|
|
||||||
|
When provisioning new VMs, track bootstrap progress:
|
||||||
|
|
||||||
|
1. Watch bootstrap logs: `{job="bootstrap", host="<hostname>"}`
|
||||||
|
2. Check for failures: `{job="bootstrap", host="<hostname>", stage="failed"}`
|
||||||
|
3. After success, verify host appears in metrics: `up{hostname="<hostname>"}`
|
||||||
|
4. Check logs are flowing: `{host="<hostname>"}`
|
||||||
|
|
||||||
|
See [docs/host-creation.md](../../../docs/host-creation.md) for the full host creation pipeline.
|
||||||
|
|
||||||
### Debug SSH/Access Issues
|
### Debug SSH/Access Issues
|
||||||
|
|
||||||
```logql
|
```logql
|
||||||
|
|||||||
@@ -33,6 +33,13 @@
|
|||||||
"--nats-url", "nats://nats1.home.2rjus.net:4222",
|
"--nats-url", "nats://nats1.home.2rjus.net:4222",
|
||||||
"--nkey-file", "/home/torjus/.config/homelab-deploy/test-deployer.nkey"
|
"--nkey-file", "/home/torjus/.config/homelab-deploy/test-deployer.nkey"
|
||||||
]
|
]
|
||||||
|
},
|
||||||
|
"git-explorer": {
|
||||||
|
"command": "nix",
|
||||||
|
"args": ["run", "git+https://git.t-juice.club/torjus/labmcp#git-explorer", "--", "serve"],
|
||||||
|
"env": {
|
||||||
|
"GIT_REPO_PATH": "/home/torjus/git/nixos-servers"
|
||||||
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
103
CLAUDE.md
103
CLAUDE.md
@@ -35,6 +35,10 @@ nix build .#create-host
|
|||||||
|
|
||||||
Do not automatically deploy changes. Deployments are usually done by updating the master branch, and then triggering the auto update on the specific host.
|
Do not automatically deploy changes. Deployments are usually done by updating the master branch, and then triggering the auto update on the specific host.
|
||||||
|
|
||||||
|
### SSH Commands
|
||||||
|
|
||||||
|
Do not run SSH commands directly. If a command needs to be run on a remote host, provide the command to the user and ask them to run it manually.
|
||||||
|
|
||||||
### Testing Feature Branches on Hosts
|
### Testing Feature Branches on Hosts
|
||||||
|
|
||||||
All hosts have the `nixos-rebuild-test` helper script for testing feature branches before merging:
|
All hosts have the `nixos-rebuild-test` helper script for testing feature branches before merging:
|
||||||
@@ -152,82 +156,16 @@ Two MCP servers are available for searching NixOS options and packages:
|
|||||||
|
|
||||||
This ensures documentation matches the exact nixpkgs version (currently NixOS 25.11) used by this flake.
|
This ensures documentation matches the exact nixpkgs version (currently NixOS 25.11) used by this flake.
|
||||||
|
|
||||||
### Lab Monitoring Log Queries
|
### Lab Monitoring
|
||||||
|
|
||||||
The **lab-monitoring** MCP server can query logs from Loki. All hosts ship systemd journal logs via Promtail.
|
The **lab-monitoring** MCP server provides access to Prometheus metrics and Loki logs. Use the `/observability` skill for detailed reference on:
|
||||||
|
|
||||||
**Loki Label Reference:**
|
- Available Prometheus jobs and exporters
|
||||||
|
- Loki labels and LogQL query syntax
|
||||||
|
- Bootstrap log monitoring for new VMs
|
||||||
|
- Common troubleshooting workflows
|
||||||
|
|
||||||
- `host` - Hostname (e.g., `ns1`, `ns2`, `monitoring01`, `ha1`). Use this label, not `hostname`.
|
The skill contains up-to-date information about all scrape targets, host labels, and example queries.
|
||||||
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `prometheus.service`, `nixos-upgrade.service`)
|
|
||||||
- `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
|
|
||||||
- `filename` - For `varlog` job, the log file path (e.g., `/var/log/caddy/nix-cache.log`)
|
|
||||||
|
|
||||||
Journal log entries are JSON-formatted with the actual log message in the `MESSAGE` field. Other useful fields include `PRIORITY` and `SYSLOG_IDENTIFIER`.
|
|
||||||
|
|
||||||
**Bootstrap Logs:**
|
|
||||||
|
|
||||||
VMs provisioned from template2 send bootstrap progress directly to Loki via curl (before promtail is available). These logs use `job="bootstrap"` with additional labels:
|
|
||||||
|
|
||||||
- `host` - Target hostname
|
|
||||||
- `branch` - Git branch being deployed
|
|
||||||
- `stage` - Bootstrap stage: `starting`, `network_ok`, `vault_ok`/`vault_skip`/`vault_warn`, `building`, `success`, `failed`
|
|
||||||
|
|
||||||
Query bootstrap status:
|
|
||||||
```
|
|
||||||
{job="bootstrap"} # All bootstrap logs
|
|
||||||
{job="bootstrap", host="testvm01"} # Specific host
|
|
||||||
{job="bootstrap", stage="failed"} # All failures
|
|
||||||
{job="bootstrap", stage=~"building|success"} # Track build progress
|
|
||||||
```
|
|
||||||
|
|
||||||
**Example LogQL queries:**
|
|
||||||
```
|
|
||||||
# Logs from a specific service on a host
|
|
||||||
{host="ns2", systemd_unit="nsd.service"}
|
|
||||||
|
|
||||||
# Substring match on log content
|
|
||||||
{host="ns1", systemd_unit="nsd.service"} |= "error"
|
|
||||||
|
|
||||||
# File-based logs (e.g., caddy access logs)
|
|
||||||
{job="varlog", hostname="nix-cache01"}
|
|
||||||
```
|
|
||||||
|
|
||||||
Default lookback is 1 hour. Use the `start` parameter with relative durations (e.g., `24h`, `168h`) for older logs.
|
|
||||||
|
|
||||||
### Lab Monitoring Prometheus Queries
|
|
||||||
|
|
||||||
The **lab-monitoring** MCP server can query Prometheus metrics via PromQL. The `instance` label uses the FQDN format `<host>.home.2rjus.net:<port>`.
|
|
||||||
|
|
||||||
**Prometheus Job Names:**
|
|
||||||
|
|
||||||
- `node-exporter` - System metrics from all hosts (CPU, memory, disk, network)
|
|
||||||
- `caddy` - Reverse proxy metrics (http-proxy)
|
|
||||||
- `nix-cache_caddy` - Nix binary cache metrics
|
|
||||||
- `home-assistant` - Home automation metrics
|
|
||||||
- `jellyfin` - Media server metrics
|
|
||||||
- `loki` / `prometheus` / `grafana` - Monitoring stack self-metrics
|
|
||||||
- `pve-exporter` - Proxmox hypervisor metrics
|
|
||||||
- `smartctl` - Disk SMART health (gunter)
|
|
||||||
- `wireguard` - VPN metrics (http-proxy)
|
|
||||||
- `pushgateway` - Push-based metrics (e.g., backup results)
|
|
||||||
- `restic_rest` - Backup server metrics
|
|
||||||
- `ghettoptt` / `alertmanager` - Other service metrics
|
|
||||||
|
|
||||||
**Example PromQL queries:**
|
|
||||||
```
|
|
||||||
# Check all targets are up
|
|
||||||
up
|
|
||||||
|
|
||||||
# CPU usage for a specific host
|
|
||||||
rate(node_cpu_seconds_total{instance=~"ns1.*", mode!="idle"}[5m])
|
|
||||||
|
|
||||||
# Memory usage across all hosts
|
|
||||||
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
|
|
||||||
|
|
||||||
# Disk space
|
|
||||||
node_filesystem_avail_bytes{mountpoint="/"}
|
|
||||||
```
|
|
||||||
|
|
||||||
### Deploying to Test Hosts
|
### Deploying to Test Hosts
|
||||||
|
|
||||||
@@ -496,20 +434,11 @@ This means:
|
|||||||
|
|
||||||
### Adding a New Host
|
### Adding a New Host
|
||||||
|
|
||||||
1. Create `/hosts/<hostname>/` directory
|
See [docs/host-creation.md](docs/host-creation.md) for the complete host creation pipeline, including:
|
||||||
2. Copy structure from `template1` or similar host
|
- Using the `create-host` script to generate host configurations
|
||||||
3. Add host entry to `flake.nix` nixosConfigurations
|
- Deploying VMs and secrets with OpenTofu
|
||||||
4. Configure networking in `configuration.nix` (static IP via `systemd.network.networks`, DNS servers)
|
- Monitoring the bootstrap process via Loki
|
||||||
5. (Optional) Add `homelab.dns.cnames` if the host needs CNAME aliases
|
- Verification and troubleshooting steps
|
||||||
6. Add `vault.enable = true;` to the host configuration
|
|
||||||
7. Add AppRole policy in `terraform/vault/approle.tf` and any secrets in `secrets.tf`
|
|
||||||
8. Run `tofu apply` in `terraform/vault/`
|
|
||||||
9. User clones template host
|
|
||||||
10. User runs `prepare-host.sh` on new host
|
|
||||||
11. Provision AppRole credentials: `nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<host>`
|
|
||||||
12. Commit changes, and merge to master.
|
|
||||||
13. Deploy by running `nixos-rebuild boot --flake URL#<hostname>` on the host.
|
|
||||||
14. Run auto-upgrade on DNS servers (ns1, ns2) to pick up the new host's DNS entry
|
|
||||||
|
|
||||||
**Note:** DNS A records and Prometheus node-exporter scrape targets are auto-generated from the host's `systemd.network.networks` static IP configuration. No manual zone file or Prometheus config editing is required.
|
**Note:** DNS A records and Prometheus node-exporter scrape targets are auto-generated from the host's `systemd.network.networks` static IP configuration. No manual zone file or Prometheus config editing is required.
|
||||||
|
|
||||||
|
|||||||
@@ -13,7 +13,6 @@ NixOS Flake-based configuration repository for a homelab infrastructure. All hos
|
|||||||
| `monitoring01` | Prometheus, Grafana, Loki, Tempo, Pyroscope |
|
| `monitoring01` | Prometheus, Grafana, Loki, Tempo, Pyroscope |
|
||||||
| `jelly01` | Jellyfin media server |
|
| `jelly01` | Jellyfin media server |
|
||||||
| `nix-cache01` | Nix binary cache |
|
| `nix-cache01` | Nix binary cache |
|
||||||
| `pgdb1` | PostgreSQL |
|
|
||||||
| `nats1` | NATS messaging |
|
| `nats1` | NATS messaging |
|
||||||
| `vault01` | OpenBao (Vault) secrets management |
|
| `vault01` | OpenBao (Vault) secrets management |
|
||||||
| `template1`, `template2` | VM templates for cloning new hosts |
|
| `template1`, `template2` | VM templates for cloning new hosts |
|
||||||
|
|||||||
21
common/ssh-audit.nix
Normal file
21
common/ssh-audit.nix
Normal file
@@ -0,0 +1,21 @@
|
|||||||
|
# SSH session command auditing
|
||||||
|
#
|
||||||
|
# Logs all commands executed by users who logged in interactively (SSH).
|
||||||
|
# System services and nix builds are excluded via auid filter.
|
||||||
|
#
|
||||||
|
# Logs are sent to journald and forwarded to Loki via promtail.
|
||||||
|
# Query with: {host="<hostname>"} |= "EXECVE"
|
||||||
|
{
|
||||||
|
# Enable Linux audit subsystem
|
||||||
|
security.audit.enable = true;
|
||||||
|
security.auditd.enable = true;
|
||||||
|
|
||||||
|
# Log execve syscalls only from interactive login sessions
|
||||||
|
# auid!=4294967295 means "audit login uid is set" (excludes system services, nix builds)
|
||||||
|
security.audit.rules = [
|
||||||
|
"-a exit,always -F arch=b64 -S execve -F auid!=4294967295"
|
||||||
|
];
|
||||||
|
|
||||||
|
# Forward audit logs to journald (so promtail ships them to Loki)
|
||||||
|
services.journald.audit = true;
|
||||||
|
}
|
||||||
217
docs/host-creation.md
Normal file
217
docs/host-creation.md
Normal file
@@ -0,0 +1,217 @@
|
|||||||
|
# Host Creation Pipeline
|
||||||
|
|
||||||
|
This document describes the process for creating new hosts in the homelab infrastructure.
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
We use the `create-host` script to create new hosts, which generates default configurations from a template. We then use OpenTofu to deploy both secrets and VMs. The VMs boot using a template image (built from `hosts/template2`), which starts a bootstrap process. This bootstrap process applies the host's NixOS configuration and then reboots into the new config.
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
All tools are available in the devshell: `create-host`, `bao` (OpenBao CLI), `tofu`.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
nix develop
|
||||||
|
```
|
||||||
|
|
||||||
|
## Steps
|
||||||
|
|
||||||
|
Steps marked with **USER** must be performed by the user due to credential requirements.
|
||||||
|
|
||||||
|
1. **USER**: Run `create-host --hostname <name> --ip <ip/prefix>`
|
||||||
|
2. Edit the auto-generated configurations in `hosts/<hostname>/` to import whatever modules are needed for its purpose
|
||||||
|
3. Add any secrets needed to `terraform/vault/`
|
||||||
|
4. Edit the VM specs in `terraform/vms.tf` if needed. To deploy from a branch other than master, add `flake_branch = "<branch>"` to the VM definition
|
||||||
|
5. Push configuration to master (or the branch specified by `flake_branch`)
|
||||||
|
6. **USER**: Apply terraform:
|
||||||
|
```bash
|
||||||
|
nix develop -c tofu -chdir=terraform/vault apply
|
||||||
|
nix develop -c tofu -chdir=terraform apply
|
||||||
|
```
|
||||||
|
7. Once terraform completes, a VM boots in Proxmox using the template image
|
||||||
|
8. The VM runs the `nixos-bootstrap` service, which applies the host config and reboots
|
||||||
|
9. After reboot, the host should be operational
|
||||||
|
10. Trigger auto-upgrade on `ns1` and `ns2` to propagate DNS records for the new host
|
||||||
|
11. Trigger auto-upgrade on `monitoring01` to add the host to Prometheus scrape targets
|
||||||
|
|
||||||
|
## Tier Specification
|
||||||
|
|
||||||
|
New hosts should set `homelab.host.tier` in their configuration:
|
||||||
|
|
||||||
|
```nix
|
||||||
|
homelab.host.tier = "test"; # or "prod"
|
||||||
|
```
|
||||||
|
|
||||||
|
- **test** - Test-tier hosts can receive remote deployments via the `homelab-deploy` MCP server and have different credential access. Use for staging/testing.
|
||||||
|
- **prod** - Production hosts. Deployments require direct access or the CLI with appropriate credentials.
|
||||||
|
|
||||||
|
## Observability
|
||||||
|
|
||||||
|
During the bootstrap process, status updates are sent to Loki. Query bootstrap logs with:
|
||||||
|
|
||||||
|
```
|
||||||
|
{job="bootstrap", host="<hostname>"}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Bootstrap Stages
|
||||||
|
|
||||||
|
The bootstrap process reports these stages via the `stage` label:
|
||||||
|
|
||||||
|
| Stage | Message | Meaning |
|
||||||
|
|-------|---------|---------|
|
||||||
|
| `starting` | Bootstrap starting for \<host\> (branch: \<branch\>) | Bootstrap service has started |
|
||||||
|
| `network_ok` | Network connectivity confirmed | Can reach git server |
|
||||||
|
| `vault_ok` | Vault credentials unwrapped and stored | AppRole credentials provisioned |
|
||||||
|
| `vault_skip` | No Vault token provided - skipping credential setup | No wrapped token was provided |
|
||||||
|
| `vault_warn` | Failed to unwrap Vault token - continuing without secrets | Token unwrap failed (expired/used) |
|
||||||
|
| `building` | Starting nixos-rebuild boot | NixOS build starting |
|
||||||
|
| `success` | Build successful - rebooting into new configuration | Build complete, rebooting |
|
||||||
|
| `failed` | nixos-rebuild failed - manual intervention required | Build failed |
|
||||||
|
|
||||||
|
### Useful Queries
|
||||||
|
|
||||||
|
```
|
||||||
|
# All bootstrap activity for a host
|
||||||
|
{job="bootstrap", host="myhost"}
|
||||||
|
|
||||||
|
# Track all failures
|
||||||
|
{job="bootstrap", stage="failed"}
|
||||||
|
|
||||||
|
# Monitor builds in progress
|
||||||
|
{job="bootstrap", stage=~"building|success"}
|
||||||
|
```
|
||||||
|
|
||||||
|
Once the VM reboots with its full configuration, it will start publishing metrics to Prometheus and logs to Loki via Promtail.
|
||||||
|
|
||||||
|
## Verification
|
||||||
|
|
||||||
|
1. Check bootstrap completed successfully:
|
||||||
|
```
|
||||||
|
{job="bootstrap", host="<hostname>", stage="success"}
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Verify the host is up and reporting metrics:
|
||||||
|
```promql
|
||||||
|
up{instance=~"<hostname>.*"}
|
||||||
|
```
|
||||||
|
|
||||||
|
3. Verify the correct flake revision is deployed:
|
||||||
|
```promql
|
||||||
|
nixos_flake_info{instance=~"<hostname>.*"}
|
||||||
|
```
|
||||||
|
|
||||||
|
4. Check logs are flowing:
|
||||||
|
```
|
||||||
|
{host="<hostname>"}
|
||||||
|
```
|
||||||
|
|
||||||
|
5. Confirm expected services are running and producing logs
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Bootstrap Failed
|
||||||
|
|
||||||
|
#### Common Issues
|
||||||
|
|
||||||
|
* VM has trouble running initial nixos-rebuild. Usually caused if it needs to compile packages from scratch if they are not available in our local nix-cache.
|
||||||
|
|
||||||
|
#### Troubleshooting
|
||||||
|
|
||||||
|
1. Check bootstrap logs in Loki - if they never progress past `building`, the rebuild likely consumed all resources:
|
||||||
|
```
|
||||||
|
{job="bootstrap", host="<hostname>"}
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **USER**: SSH into the host and check the bootstrap service:
|
||||||
|
```bash
|
||||||
|
ssh root@<hostname>
|
||||||
|
journalctl -u nixos-bootstrap.service
|
||||||
|
```
|
||||||
|
|
||||||
|
3. If the build failed due to resource constraints, increase VM specs in `terraform/vms.tf` and redeploy, or manually run the rebuild:
|
||||||
|
```bash
|
||||||
|
nixos-rebuild boot --flake git+https://git.t-juice.club/torjus/nixos-servers.git#<hostname>
|
||||||
|
```
|
||||||
|
|
||||||
|
4. If the host config doesn't exist in the flake, ensure step 5 was completed (config pushed to the correct branch).
|
||||||
|
|
||||||
|
### Vault Credentials Not Working
|
||||||
|
|
||||||
|
Usually caused by running the `create-host` script without proper credentials, or the wrapped token has expired/already been used.
|
||||||
|
|
||||||
|
#### Troubleshooting
|
||||||
|
|
||||||
|
1. Check if credentials exist on the host:
|
||||||
|
```bash
|
||||||
|
ssh root@<hostname>
|
||||||
|
ls -la /var/lib/vault/approle/
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Check bootstrap logs for vault-related stages:
|
||||||
|
```
|
||||||
|
{job="bootstrap", host="<hostname>", stage=~"vault.*"}
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **USER**: Regenerate and provision credentials manually:
|
||||||
|
```bash
|
||||||
|
nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<hostname>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Host Not Appearing in DNS
|
||||||
|
|
||||||
|
Usually caused by not having deployed the commit with the new host to ns1/ns2.
|
||||||
|
|
||||||
|
#### Troubleshooting
|
||||||
|
|
||||||
|
1. Verify the host config has a static IP configured in `systemd.network.networks`
|
||||||
|
|
||||||
|
2. Check that `homelab.dns.enable` is not set to `false`
|
||||||
|
|
||||||
|
3. **USER**: Trigger auto-upgrade on DNS servers:
|
||||||
|
```bash
|
||||||
|
ssh root@ns1 systemctl start nixos-upgrade.service
|
||||||
|
ssh root@ns2 systemctl start nixos-upgrade.service
|
||||||
|
```
|
||||||
|
|
||||||
|
4. Verify DNS resolution after upgrade completes:
|
||||||
|
```bash
|
||||||
|
dig @ns1.home.2rjus.net <hostname>.home.2rjus.net
|
||||||
|
```
|
||||||
|
|
||||||
|
### Host Not Being Scraped by Prometheus
|
||||||
|
|
||||||
|
Usually caused by not having deployed the commit with the new host to the monitoring host.
|
||||||
|
|
||||||
|
#### Troubleshooting
|
||||||
|
|
||||||
|
1. Check that `homelab.monitoring.enable` is not set to `false`
|
||||||
|
|
||||||
|
2. **USER**: Trigger auto-upgrade on monitoring01:
|
||||||
|
```bash
|
||||||
|
ssh root@monitoring01 systemctl start nixos-upgrade.service
|
||||||
|
```
|
||||||
|
|
||||||
|
3. Verify the target appears in Prometheus:
|
||||||
|
```promql
|
||||||
|
up{instance=~"<hostname>.*"}
|
||||||
|
```
|
||||||
|
|
||||||
|
4. If the target is down, check that node-exporter is running on the host:
|
||||||
|
```bash
|
||||||
|
ssh root@<hostname> systemctl status prometheus-node-exporter.service
|
||||||
|
```
|
||||||
|
|
||||||
|
## Related Files
|
||||||
|
|
||||||
|
| Path | Description |
|
||||||
|
|------|-------------|
|
||||||
|
| `scripts/create-host/` | The `create-host` script that generates host configurations |
|
||||||
|
| `hosts/template2/` | Template VM configuration (base image for new VMs) |
|
||||||
|
| `hosts/template2/bootstrap.nix` | Bootstrap service that applies NixOS config on first boot |
|
||||||
|
| `terraform/vms.tf` | VM definitions (specs, IPs, branch overrides) |
|
||||||
|
| `terraform/cloud-init.tf` | Cloud-init configuration (passes hostname, branch, vault token) |
|
||||||
|
| `terraform/vault/approle.tf` | AppRole policies for each host |
|
||||||
|
| `terraform/vault/secrets.tf` | Secret definitions in Vault |
|
||||||
|
| `terraform/vault/hosts-generated.tf` | Auto-generated wrapped tokens for VM bootstrap |
|
||||||
|
| `playbooks/provision-approle.yml` | Ansible playbook for manual credential provisioning |
|
||||||
|
| `flake.nix` | Flake with all host configurations (add new hosts here) |
|
||||||
@@ -2,7 +2,7 @@
|
|||||||
|
|
||||||
## Overview
|
## Overview
|
||||||
|
|
||||||
Replace the current auth01 setup (LLDAP + Authelia) with a modern, unified authentication solution. The current setup is not in active use, making this a good time to evaluate alternatives.
|
Deploy a modern, unified authentication solution for the homelab. Provides central user management, SSO for web services, and consistent UID/GID mapping for NAS permissions.
|
||||||
|
|
||||||
## Goals
|
## Goals
|
||||||
|
|
||||||
@@ -11,66 +11,9 @@ Replace the current auth01 setup (LLDAP + Authelia) with a modern, unified authe
|
|||||||
3. **UID/GID consistency** - Proper POSIX attributes for NAS share permissions
|
3. **UID/GID consistency** - Proper POSIX attributes for NAS share permissions
|
||||||
4. **OIDC provider** - Single sign-on for homelab web services (Grafana, etc.)
|
4. **OIDC provider** - Single sign-on for homelab web services (Grafana, etc.)
|
||||||
|
|
||||||
## Options Evaluated
|
## Solution: Kanidm
|
||||||
|
|
||||||
### OpenLDAP (raw)
|
Kanidm was chosen for the following reasons:
|
||||||
|
|
||||||
- **NixOS Support:** Good (`services.openldap` with `declarativeContents`)
|
|
||||||
- **Pros:** Most widely supported, very flexible
|
|
||||||
- **Cons:** LDIF format is painful, schema management is complex, no built-in OIDC, requires SSSD on each client
|
|
||||||
- **Verdict:** Doesn't address LDAP complexity concerns
|
|
||||||
|
|
||||||
### LLDAP + Authelia (current)
|
|
||||||
|
|
||||||
- **NixOS Support:** Both have good modules
|
|
||||||
- **Pros:** Already configured, lightweight, nice web UIs
|
|
||||||
- **Cons:** Two services to manage, limited POSIX attribute support in LLDAP, requires SSSD on every client host
|
|
||||||
- **Verdict:** Workable but has friction for NAS/UID goals
|
|
||||||
|
|
||||||
### FreeIPA
|
|
||||||
|
|
||||||
- **NixOS Support:** None
|
|
||||||
- **Pros:** Full enterprise solution (LDAP + Kerberos + DNS + CA)
|
|
||||||
- **Cons:** Extremely heavy, wants to own DNS, designed for Red Hat ecosystems, massive overkill for homelab
|
|
||||||
- **Verdict:** Overkill, no NixOS support
|
|
||||||
|
|
||||||
### Keycloak
|
|
||||||
|
|
||||||
- **NixOS Support:** None
|
|
||||||
- **Pros:** Good OIDC/SAML, nice UI
|
|
||||||
- **Cons:** Primarily an identity broker not a user directory, poor POSIX support, heavy (Java)
|
|
||||||
- **Verdict:** Wrong tool for Linux user management
|
|
||||||
|
|
||||||
### Authentik
|
|
||||||
|
|
||||||
- **NixOS Support:** None (would need Docker)
|
|
||||||
- **Pros:** All-in-one with LDAP outpost and OIDC, modern UI
|
|
||||||
- **Cons:** Heavy stack (Python + PostgreSQL + Redis), LDAP is a separate component
|
|
||||||
- **Verdict:** Would work but requires Docker and is heavy
|
|
||||||
|
|
||||||
### Kanidm
|
|
||||||
|
|
||||||
- **NixOS Support:** Excellent - first-class module with PAM/NSS integration
|
|
||||||
- **Pros:**
|
|
||||||
- Native PAM/NSS module (no SSSD needed)
|
|
||||||
- Built-in OIDC provider
|
|
||||||
- Optional LDAP interface for legacy services
|
|
||||||
- Declarative provisioning via NixOS (users, groups, OAuth2 clients)
|
|
||||||
- Modern, written in Rust
|
|
||||||
- Single service handles everything
|
|
||||||
- **Cons:** Newer project, smaller community than LDAP
|
|
||||||
- **Verdict:** Best fit for requirements
|
|
||||||
|
|
||||||
### Pocket-ID
|
|
||||||
|
|
||||||
- **NixOS Support:** Unknown
|
|
||||||
- **Pros:** Very lightweight, passkey-first
|
|
||||||
- **Cons:** No LDAP, no PAM/NSS integration - purely OIDC for web apps
|
|
||||||
- **Verdict:** Doesn't solve Linux user management goal
|
|
||||||
|
|
||||||
## Recommendation: Kanidm
|
|
||||||
|
|
||||||
Kanidm is the recommended solution for the following reasons:
|
|
||||||
|
|
||||||
| Requirement | Kanidm Support |
|
| Requirement | Kanidm Support |
|
||||||
|-------------|----------------|
|
|-------------|----------------|
|
||||||
@@ -82,42 +25,10 @@ Kanidm is the recommended solution for the following reasons:
|
|||||||
| Simplicity | Modern API, LDAP optional |
|
| Simplicity | Modern API, LDAP optional |
|
||||||
| NixOS integration | First-class |
|
| NixOS integration | First-class |
|
||||||
|
|
||||||
### Key NixOS Features
|
### Configuration Files
|
||||||
|
|
||||||
**Server configuration:**
|
- **Host configuration:** `hosts/kanidm01/`
|
||||||
```nix
|
- **Service module:** `services/kanidm/default.nix`
|
||||||
services.kanidm.enableServer = true;
|
|
||||||
services.kanidm.serverSettings = {
|
|
||||||
domain = "home.2rjus.net";
|
|
||||||
origin = "https://auth.home.2rjus.net";
|
|
||||||
ldapbindaddress = "0.0.0.0:636"; # Optional LDAP interface
|
|
||||||
};
|
|
||||||
```
|
|
||||||
|
|
||||||
**Declarative user provisioning:**
|
|
||||||
```nix
|
|
||||||
services.kanidm.provision.enable = true;
|
|
||||||
services.kanidm.provision.persons.torjus = {
|
|
||||||
displayName = "Torjus";
|
|
||||||
groups = [ "admins" "nas-users" ];
|
|
||||||
};
|
|
||||||
```
|
|
||||||
|
|
||||||
**Declarative OAuth2 clients:**
|
|
||||||
```nix
|
|
||||||
services.kanidm.provision.systems.oauth2.grafana = {
|
|
||||||
displayName = "Grafana";
|
|
||||||
originUrl = "https://grafana.home.2rjus.net/login/generic_oauth";
|
|
||||||
originLanding = "https://grafana.home.2rjus.net";
|
|
||||||
};
|
|
||||||
```
|
|
||||||
|
|
||||||
**Client host configuration (add to system/):**
|
|
||||||
```nix
|
|
||||||
services.kanidm.enableClient = true;
|
|
||||||
services.kanidm.enablePam = true;
|
|
||||||
services.kanidm.clientSettings.uri = "https://auth.home.2rjus.net";
|
|
||||||
```
|
|
||||||
|
|
||||||
## NAS Integration
|
## NAS Integration
|
||||||
|
|
||||||
@@ -148,42 +59,103 @@ This future migration path is a strong argument for Kanidm over LDAP-only soluti
|
|||||||
|
|
||||||
## Implementation Steps
|
## Implementation Steps
|
||||||
|
|
||||||
1. **Create Kanidm service module** in `services/kanidm/`
|
1. **Create kanidm01 host and service module** ✅
|
||||||
- Server configuration
|
- Host: `kanidm01.home.2rjus.net` (10.69.13.23, test tier)
|
||||||
- TLS via internal ACME
|
- Service module: `services/kanidm/`
|
||||||
- Vault secrets for admin passwords
|
- TLS via internal ACME (`auth.home.2rjus.net`)
|
||||||
|
- Vault integration for idm_admin password
|
||||||
|
- LDAPS on port 636
|
||||||
|
|
||||||
2. **Configure declarative provisioning**
|
2. **Configure provisioning** ✅
|
||||||
- Define initial users and groups
|
- Groups provisioned declaratively: `admins`, `users`, `ssh-users`
|
||||||
- Set up POSIX attributes (UID/GID ranges)
|
- Users managed imperatively via CLI (allows setting POSIX passwords in one step)
|
||||||
|
- POSIX attributes enabled (UID/GID range 65,536-69,999)
|
||||||
|
|
||||||
3. **Add OIDC clients** for homelab services
|
3. **Test NAS integration** (in progress)
|
||||||
- Grafana
|
- ✅ LDAP interface verified working
|
||||||
- Other services as needed
|
|
||||||
|
|
||||||
4. **Create client module** in `system/` for PAM/NSS
|
|
||||||
- Enable on all hosts that need central auth
|
|
||||||
- Configure trusted CA
|
|
||||||
|
|
||||||
5. **Test NAS integration**
|
|
||||||
- Configure TrueNAS LDAP client to connect to Kanidm
|
- Configure TrueNAS LDAP client to connect to Kanidm
|
||||||
- Verify UID/GID mapping works with NFS shares
|
- Verify UID/GID mapping works with NFS shares
|
||||||
|
|
||||||
6. **Migrate auth01**
|
4. **Add OIDC clients** for homelab services
|
||||||
- Remove LLDAP and Authelia services
|
- Grafana
|
||||||
- Deploy Kanidm
|
- Other services as needed
|
||||||
- Update DNS CNAMEs if needed
|
|
||||||
|
|
||||||
7. **Documentation**
|
5. **Create client module** in `system/` for PAM/NSS ✅
|
||||||
- User management procedures
|
- Module: `system/kanidm-client.nix`
|
||||||
- Adding new OAuth2 clients
|
- `homelab.kanidm.enable = true` enables PAM/NSS
|
||||||
- Troubleshooting PAM/NSS issues
|
- Short usernames (not SPN format)
|
||||||
|
- Home directory symlinks via `home_alias`
|
||||||
|
- Enabled on test tier: testvm01, testvm02, testvm03
|
||||||
|
|
||||||
## Open Questions
|
6. **Documentation** ✅
|
||||||
|
- `docs/user-management.md` - CLI workflows, troubleshooting
|
||||||
|
- User/group creation procedures verified working
|
||||||
|
|
||||||
- What UID/GID range should be reserved for Kanidm-managed users?
|
## Progress
|
||||||
- Which hosts should have PAM/NSS enabled initially?
|
|
||||||
- What OAuth2 clients are needed at launch?
|
### Completed (2026-02-08)
|
||||||
|
|
||||||
|
**Kanidm server deployed on kanidm01 (test tier):**
|
||||||
|
- Host: `kanidm01.home.2rjus.net` (10.69.13.23)
|
||||||
|
- WebUI: `https://auth.home.2rjus.net`
|
||||||
|
- LDAPS: port 636
|
||||||
|
- Valid certificate from internal CA
|
||||||
|
|
||||||
|
**Configuration:**
|
||||||
|
- Kanidm 1.8 with secret provisioning support
|
||||||
|
- Daily backups at 22:00 (7 versions retained)
|
||||||
|
- Vault integration for idm_admin password
|
||||||
|
- Prometheus monitoring scrape target configured
|
||||||
|
|
||||||
|
**Provisioned entities:**
|
||||||
|
- Groups: `admins`, `users`, `ssh-users` (declarative)
|
||||||
|
- Users managed via CLI (imperative)
|
||||||
|
|
||||||
|
**Verified working:**
|
||||||
|
- WebUI login with idm_admin
|
||||||
|
- LDAP bind and search with POSIX-enabled user
|
||||||
|
- LDAPS with valid internal CA certificate
|
||||||
|
|
||||||
|
### Completed (2026-02-08) - PAM/NSS Client
|
||||||
|
|
||||||
|
**Client module deployed (`system/kanidm-client.nix`):**
|
||||||
|
- `homelab.kanidm.enable = true` enables PAM/NSS integration
|
||||||
|
- Connects to auth.home.2rjus.net
|
||||||
|
- Short usernames (`torjus` instead of `torjus@home.2rjus.net`)
|
||||||
|
- Home directory symlinks (`/home/torjus` → UUID-based dir)
|
||||||
|
- Login restricted to `ssh-users` group
|
||||||
|
|
||||||
|
**Enabled on test tier:**
|
||||||
|
- testvm01, testvm02, testvm03
|
||||||
|
|
||||||
|
**Verified working:**
|
||||||
|
- User/group resolution via `getent`
|
||||||
|
- SSH login with Kanidm unix passwords
|
||||||
|
- Home directory creation with symlinks
|
||||||
|
- Imperative user/group creation via CLI
|
||||||
|
|
||||||
|
**Documentation:**
|
||||||
|
- `docs/user-management.md` with full CLI workflows
|
||||||
|
- Password requirements (min 10 chars)
|
||||||
|
- Troubleshooting guide (nscd, cache invalidation)
|
||||||
|
|
||||||
|
### UID/GID Range (Resolved)
|
||||||
|
|
||||||
|
**Range: 65,536 - 69,999** (manually allocated)
|
||||||
|
|
||||||
|
- Users: 65,536 - 67,999 (up to ~2500 users)
|
||||||
|
- Groups: 68,000 - 69,999 (up to ~2000 groups)
|
||||||
|
|
||||||
|
Rationale:
|
||||||
|
- Starts at Kanidm's recommended minimum (65,536)
|
||||||
|
- Well above NixOS system users (typically <1000)
|
||||||
|
- Avoids Podman/container issues with very high GIDs
|
||||||
|
|
||||||
|
### Next Steps
|
||||||
|
|
||||||
|
1. Enable PAM/NSS on production hosts (after test tier validation)
|
||||||
|
2. Configure TrueNAS LDAP client for NAS integration testing
|
||||||
|
3. Add OAuth2 clients (Grafana first)
|
||||||
|
|
||||||
## References
|
## References
|
||||||
|
|
||||||
|
|||||||
107
docs/plans/completed/ns1-recreation.md
Normal file
107
docs/plans/completed/ns1-recreation.md
Normal file
@@ -0,0 +1,107 @@
|
|||||||
|
# ns1 Recreation Plan
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Recreate ns1 using the OpenTofu workflow after the existing VM entered emergency mode due to incorrect hardware-configuration.nix (hardcoded UUIDs that don't match actual disk layout).
|
||||||
|
|
||||||
|
## Current ns1 Configuration to Preserve
|
||||||
|
|
||||||
|
- **IP:** 10.69.13.5/24
|
||||||
|
- **Gateway:** 10.69.13.1
|
||||||
|
- **Role:** Primary DNS (authoritative + resolver)
|
||||||
|
- **Services:**
|
||||||
|
- `../../services/ns/master-authorative.nix`
|
||||||
|
- `../../services/ns/resolver.nix`
|
||||||
|
- **Metadata:**
|
||||||
|
- `homelab.host.role = "dns"`
|
||||||
|
- `homelab.host.labels.dns_role = "primary"`
|
||||||
|
- **Vault:** enabled
|
||||||
|
- **Deploy:** enabled
|
||||||
|
|
||||||
|
## Execution Steps
|
||||||
|
|
||||||
|
### Phase 1: Remove Old Configuration
|
||||||
|
|
||||||
|
```bash
|
||||||
|
nix develop -c create-host --remove --hostname ns1 --force
|
||||||
|
```
|
||||||
|
|
||||||
|
This removes:
|
||||||
|
- `hosts/ns1/` directory
|
||||||
|
- Entry from `flake.nix`
|
||||||
|
- Any terraform entries (none exist currently)
|
||||||
|
|
||||||
|
### Phase 2: Create New Configuration
|
||||||
|
|
||||||
|
```bash
|
||||||
|
nix develop -c create-host --hostname ns1 --ip 10.69.13.5/24
|
||||||
|
```
|
||||||
|
|
||||||
|
This creates:
|
||||||
|
- `hosts/ns1/` with template2-based configuration
|
||||||
|
- Entry in `flake.nix`
|
||||||
|
- Entry in `terraform/vms.tf`
|
||||||
|
- Vault wrapped token for bootstrap
|
||||||
|
|
||||||
|
### Phase 3: Customize Configuration
|
||||||
|
|
||||||
|
After create-host, manually update `hosts/ns1/configuration.nix` to add:
|
||||||
|
|
||||||
|
1. DNS service imports:
|
||||||
|
```nix
|
||||||
|
../../services/ns/master-authorative.nix
|
||||||
|
../../services/ns/resolver.nix
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Host metadata:
|
||||||
|
```nix
|
||||||
|
homelab.host = {
|
||||||
|
tier = "prod";
|
||||||
|
role = "dns";
|
||||||
|
labels.dns_role = "primary";
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
3. Disable resolved (conflicts with Unbound):
|
||||||
|
```nix
|
||||||
|
services.resolved.enable = false;
|
||||||
|
```
|
||||||
|
|
||||||
|
### Phase 4: Commit Changes
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add -A
|
||||||
|
git commit -m "ns1: recreate with OpenTofu workflow
|
||||||
|
|
||||||
|
Old VM had incorrect hardware-configuration.nix with hardcoded UUIDs
|
||||||
|
that didn't match actual disk layout, causing boot failure.
|
||||||
|
|
||||||
|
Recreated using template2-based configuration for OpenTofu provisioning."
|
||||||
|
```
|
||||||
|
|
||||||
|
### Phase 5: Infrastructure
|
||||||
|
|
||||||
|
1. Delete old ns1 VM in Proxmox (it's broken anyway)
|
||||||
|
2. Run `nix develop -c tofu -chdir=terraform apply`
|
||||||
|
3. Wait for bootstrap to complete
|
||||||
|
4. Verify ns1 is functional:
|
||||||
|
- DNS resolution working
|
||||||
|
- Zone transfer to ns2 working
|
||||||
|
- All exporters responding
|
||||||
|
|
||||||
|
### Phase 6: Finalize
|
||||||
|
|
||||||
|
- Push to master
|
||||||
|
- Move this plan to `docs/plans/completed/`
|
||||||
|
|
||||||
|
## Rollback
|
||||||
|
|
||||||
|
If the new VM fails:
|
||||||
|
1. ns2 is still operational as secondary DNS
|
||||||
|
2. Can recreate with different settings if needed
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
- ns2 will continue serving DNS during the migration
|
||||||
|
- Zone data is generated from flake, so no data loss
|
||||||
|
- The old VM's disk can be kept briefly in Proxmox as backup if desired
|
||||||
@@ -9,13 +9,13 @@ hosts are decommissioned or deferred.
|
|||||||
|
|
||||||
## Current State
|
## Current State
|
||||||
|
|
||||||
Hosts already managed by OpenTofu: `vault01`, `testvm01`, `testvm02`, `testvm03`, `ns2`
|
Hosts already managed by OpenTofu: `vault01`, `testvm01`, `testvm02`, `testvm03`, `ns2`, `ns1`
|
||||||
|
|
||||||
Hosts to migrate:
|
Hosts to migrate:
|
||||||
|
|
||||||
| Host | Category | Notes |
|
| Host | Category | Notes |
|
||||||
|------|----------|-------|
|
|------|----------|-------|
|
||||||
| ns1 | Stateless | Primary DNS, recreate |
|
| ~~ns1~~ | ~~Stateless~~ | ✓ Complete |
|
||||||
| nix-cache01 | Stateless | Binary cache, recreate |
|
| nix-cache01 | Stateless | Binary cache, recreate |
|
||||||
| http-proxy | Stateless | Reverse proxy, recreate |
|
| http-proxy | Stateless | Reverse proxy, recreate |
|
||||||
| nats1 | Stateless | Messaging, recreate |
|
| nats1 | Stateless | Messaging, recreate |
|
||||||
@@ -75,11 +75,11 @@ Migrate stateless hosts in an order that minimizes disruption:
|
|||||||
1. **nix-cache01** — low risk, no downstream dependencies during migration
|
1. **nix-cache01** — low risk, no downstream dependencies during migration
|
||||||
2. **nats1** — low risk, verify no persistent JetStream streams first
|
2. **nats1** — low risk, verify no persistent JetStream streams first
|
||||||
3. **http-proxy** — brief disruption to proxied services, migrate during low-traffic window
|
3. **http-proxy** — brief disruption to proxied services, migrate during low-traffic window
|
||||||
4. **ns1** — ns2 already migrated, verify AXFR works after ns1 migration
|
4. ~~**ns1** — ns2 already migrated, verify AXFR works after ns1 migration~~ ✓ Complete
|
||||||
|
|
||||||
~~For ns1/ns2: migrate ns2 first (secondary), verify AXFR works, then migrate ns1.~~ ns2
|
~~For ns1/ns2: migrate ns2 first (secondary), verify AXFR works, then migrate ns1.~~ Both ns1
|
||||||
migration complete. All hosts use both ns1 and ns2 as resolvers, so ns1 being down briefly
|
and ns2 migration complete. Zone transfer (AXFR) verified working between ns1 (primary) and
|
||||||
during migration is tolerable.
|
ns2 (secondary).
|
||||||
|
|
||||||
## Phase 3: Stateful Host Migration
|
## Phase 3: Stateful Host Migration
|
||||||
|
|
||||||
@@ -163,17 +163,19 @@ Host was already removed from flake.nix and VM destroyed. Configuration cleaned
|
|||||||
|
|
||||||
Host configuration, services, and VM already removed.
|
Host configuration, services, and VM already removed.
|
||||||
|
|
||||||
### pgdb1
|
### pgdb1 (in progress)
|
||||||
|
|
||||||
Only consumer was Open WebUI on gunter, which is being migrated to use local PostgreSQL.
|
Only consumer was Open WebUI on gunter, which has been migrated to use local PostgreSQL.
|
||||||
|
|
||||||
1. Verify Open WebUI on gunter is using local PostgreSQL (not pgdb1)
|
1. ~~Verify Open WebUI on gunter is using local PostgreSQL (not pgdb1)~~ ✓
|
||||||
2. Remove host configuration from `hosts/pgdb1/`
|
2. ~~Remove host configuration from `hosts/pgdb1/`~~ ✓
|
||||||
3. Remove `services/postgres/` (only used by pgdb1)
|
3. ~~Remove `services/postgres/` (only used by pgdb1)~~ ✓
|
||||||
4. Remove from `flake.nix`
|
4. ~~Remove from `flake.nix`~~ ✓
|
||||||
5. Remove Vault AppRole from `terraform/vault/approle.tf`
|
5. ~~Remove Vault AppRole from `terraform/vault/approle.tf`~~ ✓
|
||||||
6. Destroy the VM in Proxmox
|
6. Destroy the VM in Proxmox
|
||||||
7. Commit cleanup
|
7. ~~Commit cleanup~~ ✓
|
||||||
|
|
||||||
|
See `docs/plans/pgdb1-decommission.md` for detailed plan.
|
||||||
|
|
||||||
## Phase 5: Decommission ca Host ✓ COMPLETE
|
## Phase 5: Decommission ca Host ✓ COMPLETE
|
||||||
|
|
||||||
|
|||||||
116
docs/plans/memory-issues-follow-up.md
Normal file
116
docs/plans/memory-issues-follow-up.md
Normal file
@@ -0,0 +1,116 @@
|
|||||||
|
# Memory Issues Follow-up
|
||||||
|
|
||||||
|
Tracking the zram change to verify it resolves OOM issues during nixos-upgrade on low-memory hosts.
|
||||||
|
|
||||||
|
## Background
|
||||||
|
|
||||||
|
On 2026-02-08, ns2 (2GB RAM) experienced an OOM kill during nixos-upgrade. The Nix evaluation process consumed ~1.6GB before being killed by the kernel. ns1 (manually increased to 4GB) succeeded with the same upgrade.
|
||||||
|
|
||||||
|
Root cause: 2GB RAM is insufficient for Nix flake evaluation without swap.
|
||||||
|
|
||||||
|
## Fix Applied
|
||||||
|
|
||||||
|
**Commit:** `1674b6a` - system: enable zram swap for all hosts
|
||||||
|
|
||||||
|
**Merged:** 2026-02-08 ~12:15 UTC
|
||||||
|
|
||||||
|
**Change:** Added `zramSwap.enable = true` to `system/zram.nix`, providing ~2GB compressed swap on all hosts.
|
||||||
|
|
||||||
|
## Timeline
|
||||||
|
|
||||||
|
| Time (UTC) | Event |
|
||||||
|
|------------|-------|
|
||||||
|
| 05:00:46 | ns2 nixos-upgrade OOM killed |
|
||||||
|
| 05:01:47 | `nixos_upgrade_failed` alert fired |
|
||||||
|
| 12:15 | zram commit merged to master |
|
||||||
|
| 12:19 | ns2 rebooted with zram enabled |
|
||||||
|
| 12:20 | ns1 rebooted (memory reduced to 2GB via tofu) |
|
||||||
|
|
||||||
|
## Hosts Affected
|
||||||
|
|
||||||
|
All 2GB VMs that run nixos-upgrade:
|
||||||
|
- ns1, ns2 (DNS)
|
||||||
|
- vault01
|
||||||
|
- testvm01, testvm02, testvm03
|
||||||
|
- kanidm01
|
||||||
|
|
||||||
|
## Metrics to Monitor
|
||||||
|
|
||||||
|
Check these in Grafana or via PromQL to verify the fix:
|
||||||
|
|
||||||
|
### Swap availability (should be ~2GB after upgrade)
|
||||||
|
```promql
|
||||||
|
node_memory_SwapTotal_bytes / 1024 / 1024
|
||||||
|
```
|
||||||
|
|
||||||
|
### Swap usage during upgrades
|
||||||
|
```promql
|
||||||
|
(node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / 1024 / 1024
|
||||||
|
```
|
||||||
|
|
||||||
|
### Zswap compressed bytes (active compression)
|
||||||
|
```promql
|
||||||
|
node_memory_Zswap_bytes / 1024 / 1024
|
||||||
|
```
|
||||||
|
|
||||||
|
### Upgrade failures (should be 0)
|
||||||
|
```promql
|
||||||
|
node_systemd_unit_state{name="nixos-upgrade.service", state="failed"}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Memory available during upgrades
|
||||||
|
```promql
|
||||||
|
node_memory_MemAvailable_bytes / 1024 / 1024
|
||||||
|
```
|
||||||
|
|
||||||
|
## Verification Steps
|
||||||
|
|
||||||
|
After a few days (allow auto-upgrades to run on all hosts):
|
||||||
|
|
||||||
|
1. Check all hosts have swap enabled:
|
||||||
|
```promql
|
||||||
|
node_memory_SwapTotal_bytes > 0
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Check for any upgrade failures since the fix:
|
||||||
|
```promql
|
||||||
|
count_over_time(ALERTS{alertname="nixos_upgrade_failed"}[7d])
|
||||||
|
```
|
||||||
|
|
||||||
|
3. Review if any hosts used swap during upgrades (check historical graphs)
|
||||||
|
|
||||||
|
## Success Criteria
|
||||||
|
|
||||||
|
- No `nixos_upgrade_failed` alerts due to OOM after 2026-02-08
|
||||||
|
- All hosts show ~2GB swap available
|
||||||
|
- Upgrades complete successfully on 2GB VMs
|
||||||
|
|
||||||
|
## Fallback Options
|
||||||
|
|
||||||
|
If zram is insufficient:
|
||||||
|
|
||||||
|
1. **Increase VM memory** - Update `terraform/vms.tf` to 4GB for affected hosts
|
||||||
|
2. **Enable memory ballooning** - Configure VMs with dynamic memory allocation (see below)
|
||||||
|
3. **Use remote builds** - Configure `nix.buildMachines` to offload evaluation
|
||||||
|
4. **Reduce flake size** - Split configurations to reduce evaluation memory
|
||||||
|
|
||||||
|
### Memory Ballooning
|
||||||
|
|
||||||
|
Proxmox supports memory ballooning, which allows VMs to dynamically grow/shrink memory allocation based on demand. The balloon driver inside the guest communicates with the hypervisor to release or reclaim memory pages.
|
||||||
|
|
||||||
|
Configuration in `terraform/vms.tf`:
|
||||||
|
```hcl
|
||||||
|
memory = 4096 # maximum memory
|
||||||
|
balloon = 2048 # minimum memory (shrinks to this when idle)
|
||||||
|
```
|
||||||
|
|
||||||
|
Pros:
|
||||||
|
- VMs get memory on-demand without reboots
|
||||||
|
- Better host memory utilization
|
||||||
|
- Solves upgrade OOM without permanently allocating 4GB
|
||||||
|
|
||||||
|
Cons:
|
||||||
|
- Requires QEMU guest agent running in guest
|
||||||
|
- Guest can experience memory pressure if host is overcommitted
|
||||||
|
|
||||||
|
Ballooning and zram are complementary - ballooning provides headroom from the host, zram provides overflow within the guest.
|
||||||
219
docs/plans/monitoring-migration-victoriametrics.md
Normal file
219
docs/plans/monitoring-migration-victoriametrics.md
Normal file
@@ -0,0 +1,219 @@
|
|||||||
|
# Monitoring Stack Migration to VictoriaMetrics
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression
|
||||||
|
and longer retention. Run in parallel with monitoring01 until validated, then switch over using
|
||||||
|
a `monitoring` CNAME for seamless transition.
|
||||||
|
|
||||||
|
## Current State
|
||||||
|
|
||||||
|
**monitoring01** (10.69.13.13):
|
||||||
|
- 4 CPU cores, 4GB RAM, 33GB disk
|
||||||
|
- Prometheus with 30-day retention (15s scrape interval)
|
||||||
|
- Alertmanager (routes to alerttonotify webhook)
|
||||||
|
- Grafana (dashboards, datasources)
|
||||||
|
- Loki (log aggregation from all hosts via Promtail)
|
||||||
|
- Tempo (distributed tracing)
|
||||||
|
- Pyroscope (continuous profiling)
|
||||||
|
|
||||||
|
**Hardcoded References to monitoring01:**
|
||||||
|
- `system/monitoring/logs.nix` - Promtail sends logs to `http://monitoring01.home.2rjus.net:3100`
|
||||||
|
- `hosts/template2/bootstrap.nix` - Bootstrap logs to Loki (keep as-is until decommission)
|
||||||
|
- `services/http-proxy/proxy.nix` - Caddy proxies Prometheus, Alertmanager, Grafana, Pyroscope, Pushgateway
|
||||||
|
|
||||||
|
**Auto-generated:**
|
||||||
|
- Prometheus scrape targets (from `lib/monitoring.nix` + `homelab.monitoring.scrapeTargets`)
|
||||||
|
- Node-exporter targets (from all hosts with static IPs)
|
||||||
|
|
||||||
|
## Decision: VictoriaMetrics
|
||||||
|
|
||||||
|
Per `docs/plans/long-term-metrics-storage.md`, VictoriaMetrics is the recommended starting point:
|
||||||
|
- Single binary replacement for Prometheus
|
||||||
|
- 5-10x better compression (30 days could become 180+ days in same space)
|
||||||
|
- Same PromQL query language (Grafana dashboards work unchanged)
|
||||||
|
- Same scrape config format (existing auto-generated configs work)
|
||||||
|
|
||||||
|
If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────┐
|
||||||
|
│ monitoring02 │
|
||||||
|
│ VictoriaMetrics│
|
||||||
|
│ + Grafana │
|
||||||
|
monitoring │ + Loki │
|
||||||
|
CNAME ──────────│ + Tempo │
|
||||||
|
│ + Pyroscope │
|
||||||
|
│ + Alertmanager │
|
||||||
|
│ (vmalert) │
|
||||||
|
└─────────────────┘
|
||||||
|
▲
|
||||||
|
│ scrapes
|
||||||
|
┌───────────────┼───────────────┐
|
||||||
|
│ │ │
|
||||||
|
┌────┴────┐ ┌─────┴────┐ ┌─────┴────┐
|
||||||
|
│ ns1 │ │ ha1 │ │ ... │
|
||||||
|
│ :9100 │ │ :9100 │ │ :9100 │
|
||||||
|
└─────────┘ └──────────┘ └──────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
## Implementation Plan
|
||||||
|
|
||||||
|
### Phase 1: Create monitoring02 Host
|
||||||
|
|
||||||
|
Use `create-host` script which handles flake.nix and terraform/vms.tf automatically.
|
||||||
|
|
||||||
|
1. **Run create-host**: `nix develop -c create-host monitoring02 10.69.13.24`
|
||||||
|
2. **Update VM resources** in `terraform/vms.tf`:
|
||||||
|
- 4 cores (same as monitoring01)
|
||||||
|
- 8GB RAM (double, for VictoriaMetrics headroom)
|
||||||
|
- 100GB disk (for 3+ months retention with compression)
|
||||||
|
3. **Update host configuration**: Import monitoring services
|
||||||
|
4. **Create Vault AppRole**: Add to `terraform/vault/approle.tf`
|
||||||
|
|
||||||
|
### Phase 2: Set Up VictoriaMetrics Stack
|
||||||
|
|
||||||
|
Create new service module at `services/monitoring/victoriametrics/` for testing alongside existing
|
||||||
|
Prometheus config. Once validated, this can replace the Prometheus module.
|
||||||
|
|
||||||
|
1. **VictoriaMetrics** (port 8428):
|
||||||
|
- `services.victoriametrics.enable = true`
|
||||||
|
- `services.victoriametrics.retentionPeriod = "3m"` (3 months, increase later based on disk usage)
|
||||||
|
- Migrate scrape configs via `prometheusConfig`
|
||||||
|
- Use native push support (replaces Pushgateway)
|
||||||
|
|
||||||
|
2. **vmalert** for alerting rules:
|
||||||
|
- `services.vmalert.enable = true`
|
||||||
|
- Point to VictoriaMetrics for metrics evaluation
|
||||||
|
- Keep rules in separate `rules.yml` file (same format as Prometheus)
|
||||||
|
- No receiver configured during parallel operation (prevents duplicate alerts)
|
||||||
|
|
||||||
|
3. **Alertmanager** (port 9093):
|
||||||
|
- Keep existing configuration (alerttonotify webhook routing)
|
||||||
|
- Only enable receiver after cutover from monitoring01
|
||||||
|
|
||||||
|
4. **Loki** (port 3100):
|
||||||
|
- Same configuration as current
|
||||||
|
|
||||||
|
5. **Grafana** (port 3000):
|
||||||
|
- Define dashboards declaratively via NixOS options (not imported from monitoring01)
|
||||||
|
- Reference existing dashboards on monitoring01 for content inspiration
|
||||||
|
- Configure VictoriaMetrics datasource (port 8428)
|
||||||
|
- Configure Loki datasource
|
||||||
|
|
||||||
|
6. **Tempo** (ports 3200, 3201):
|
||||||
|
- Same configuration
|
||||||
|
|
||||||
|
7. **Pyroscope** (port 4040):
|
||||||
|
- Same Docker-based deployment
|
||||||
|
|
||||||
|
### Phase 3: Parallel Operation
|
||||||
|
|
||||||
|
Run both monitoring01 and monitoring02 simultaneously:
|
||||||
|
|
||||||
|
1. **Dual scraping**: Both hosts scrape the same targets
|
||||||
|
- Validates VictoriaMetrics is collecting data correctly
|
||||||
|
|
||||||
|
2. **Dual log shipping**: Configure Promtail to send logs to both Loki instances
|
||||||
|
- Add second client in `system/monitoring/logs.nix` pointing to monitoring02
|
||||||
|
|
||||||
|
3. **Validate dashboards**: Access Grafana on monitoring02, verify dashboards work
|
||||||
|
|
||||||
|
4. **Validate alerts**: Verify vmalert evaluates rules correctly (no receiver = no notifications)
|
||||||
|
|
||||||
|
5. **Compare resource usage**: Monitor disk/memory consumption between hosts
|
||||||
|
|
||||||
|
### Phase 4: Add monitoring CNAME
|
||||||
|
|
||||||
|
Add CNAME to monitoring02 once validated:
|
||||||
|
|
||||||
|
```nix
|
||||||
|
# hosts/monitoring02/configuration.nix
|
||||||
|
homelab.dns.cnames = [ "monitoring" ];
|
||||||
|
```
|
||||||
|
|
||||||
|
This creates `monitoring.home.2rjus.net` pointing to monitoring02.
|
||||||
|
|
||||||
|
### Phase 5: Update References
|
||||||
|
|
||||||
|
Update hardcoded references to use the CNAME:
|
||||||
|
|
||||||
|
1. **system/monitoring/logs.nix**:
|
||||||
|
- Remove dual-shipping, point only to `http://monitoring.home.2rjus.net:3100`
|
||||||
|
|
||||||
|
2. **services/http-proxy/proxy.nix**: Update reverse proxy backends:
|
||||||
|
- prometheus.home.2rjus.net -> monitoring.home.2rjus.net:8428
|
||||||
|
- alertmanager.home.2rjus.net -> monitoring.home.2rjus.net:9093
|
||||||
|
- grafana.home.2rjus.net -> monitoring.home.2rjus.net:3000
|
||||||
|
- pyroscope.home.2rjus.net -> monitoring.home.2rjus.net:4040
|
||||||
|
|
||||||
|
Note: `hosts/template2/bootstrap.nix` stays pointed at monitoring01 until decommission.
|
||||||
|
|
||||||
|
### Phase 6: Enable Alerting
|
||||||
|
|
||||||
|
Once ready to cut over:
|
||||||
|
1. Enable Alertmanager receiver on monitoring02
|
||||||
|
2. Verify test alerts route correctly
|
||||||
|
|
||||||
|
### Phase 7: Cutover and Decommission
|
||||||
|
|
||||||
|
1. **Stop monitoring01**: Prevent duplicate alerts during transition
|
||||||
|
2. **Update bootstrap.nix**: Point to `monitoring.home.2rjus.net`
|
||||||
|
3. **Verify all targets scraped**: Check VictoriaMetrics UI
|
||||||
|
4. **Verify logs flowing**: Check Loki on monitoring02
|
||||||
|
5. **Decommission monitoring01**:
|
||||||
|
- Remove from flake.nix
|
||||||
|
- Remove host configuration
|
||||||
|
- Destroy VM in Proxmox
|
||||||
|
- Remove from terraform state
|
||||||
|
|
||||||
|
## Open Questions
|
||||||
|
|
||||||
|
- [ ] What disk size for monitoring02? 100GB should allow 3+ months with VictoriaMetrics compression
|
||||||
|
- [ ] Which dashboards to recreate declaratively? (Review monitoring01 Grafana for current set)
|
||||||
|
|
||||||
|
## VictoriaMetrics Service Configuration
|
||||||
|
|
||||||
|
Example NixOS configuration for monitoring02:
|
||||||
|
|
||||||
|
```nix
|
||||||
|
# VictoriaMetrics replaces Prometheus
|
||||||
|
services.victoriametrics = {
|
||||||
|
enable = true;
|
||||||
|
retentionPeriod = "3m"; # 3 months, increase based on disk usage
|
||||||
|
prometheusConfig = {
|
||||||
|
global.scrape_interval = "15s";
|
||||||
|
scrape_configs = [
|
||||||
|
# Auto-generated node-exporter targets
|
||||||
|
# Service-specific scrape targets
|
||||||
|
# External targets
|
||||||
|
];
|
||||||
|
};
|
||||||
|
};
|
||||||
|
|
||||||
|
# vmalert for alerting rules (no receiver during parallel operation)
|
||||||
|
services.vmalert = {
|
||||||
|
enable = true;
|
||||||
|
datasource.url = "http://localhost:8428";
|
||||||
|
# notifier.alertmanager.url = "http://localhost:9093"; # Enable after cutover
|
||||||
|
rule = [ ./rules.yml ];
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
## Rollback Plan
|
||||||
|
|
||||||
|
If issues arise after cutover:
|
||||||
|
1. Move `monitoring` CNAME back to monitoring01
|
||||||
|
2. Restart monitoring01 services
|
||||||
|
3. Revert Promtail config to point only to monitoring01
|
||||||
|
4. Revert http-proxy backends
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
- VictoriaMetrics uses port 8428 vs Prometheus 9090
|
||||||
|
- PromQL compatibility is excellent
|
||||||
|
- VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
|
||||||
|
- monitoring02 deployed via OpenTofu using `create-host` script
|
||||||
|
- Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state
|
||||||
113
docs/plans/pgdb1-decommission.md
Normal file
113
docs/plans/pgdb1-decommission.md
Normal file
@@ -0,0 +1,113 @@
|
|||||||
|
# pgdb1 Decommissioning Plan
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Decommission the pgdb1 PostgreSQL server. The only consumer was Open WebUI on gunter, which has been migrated to use a local PostgreSQL instance.
|
||||||
|
|
||||||
|
## Pre-flight Verification
|
||||||
|
|
||||||
|
Before proceeding, verify that gunter is no longer using pgdb1:
|
||||||
|
|
||||||
|
1. Check Open WebUI on gunter is configured for local PostgreSQL (not 10.69.13.16)
|
||||||
|
2. Optionally: Check pgdb1 for recent connection activity:
|
||||||
|
```bash
|
||||||
|
ssh pgdb1 'sudo -u postgres psql -c "SELECT * FROM pg_stat_activity WHERE datname IS NOT NULL;"'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Files to Remove
|
||||||
|
|
||||||
|
### Host Configuration
|
||||||
|
- `hosts/pgdb1/default.nix`
|
||||||
|
- `hosts/pgdb1/configuration.nix`
|
||||||
|
- `hosts/pgdb1/hardware-configuration.nix`
|
||||||
|
- `hosts/pgdb1/` (directory)
|
||||||
|
|
||||||
|
### Service Module
|
||||||
|
- `services/postgres/postgres.nix`
|
||||||
|
- `services/postgres/default.nix`
|
||||||
|
- `services/postgres/` (directory)
|
||||||
|
|
||||||
|
Note: This service module is only used by pgdb1, so it can be removed entirely.
|
||||||
|
|
||||||
|
### Flake Entry
|
||||||
|
Remove from `flake.nix` (lines 131-138):
|
||||||
|
```nix
|
||||||
|
pgdb1 = nixpkgs.lib.nixosSystem {
|
||||||
|
inherit system;
|
||||||
|
specialArgs = {
|
||||||
|
inherit inputs self;
|
||||||
|
};
|
||||||
|
modules = commonModules ++ [
|
||||||
|
./hosts/pgdb1
|
||||||
|
];
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
### Vault AppRole
|
||||||
|
Remove from `terraform/vault/approle.tf` (lines 69-73):
|
||||||
|
```hcl
|
||||||
|
"pgdb1" = {
|
||||||
|
paths = [
|
||||||
|
"secret/data/hosts/pgdb1/*",
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Monitoring Rules
|
||||||
|
Remove from `services/monitoring/rules.yml` the `postgres_down` alert (lines 359-365):
|
||||||
|
```yaml
|
||||||
|
- name: postgres_rules
|
||||||
|
rules:
|
||||||
|
- alert: postgres_down
|
||||||
|
expr: node_systemd_unit_state{instance="pgdb1.home.2rjus.net:9100", name="postgresql.service", state="active"} == 0
|
||||||
|
for: 5m
|
||||||
|
labels:
|
||||||
|
severity: critical
|
||||||
|
```
|
||||||
|
|
||||||
|
### Utility Scripts
|
||||||
|
Delete `rebuild-all.sh` entirely (obsolete script).
|
||||||
|
|
||||||
|
## Execution Steps
|
||||||
|
|
||||||
|
### Phase 1: Verification
|
||||||
|
- [ ] Confirm Open WebUI on gunter uses local PostgreSQL
|
||||||
|
- [ ] Verify no active connections to pgdb1
|
||||||
|
|
||||||
|
### Phase 2: Code Cleanup
|
||||||
|
- [ ] Create feature branch: `git checkout -b decommission-pgdb1`
|
||||||
|
- [ ] Remove `hosts/pgdb1/` directory
|
||||||
|
- [ ] Remove `services/postgres/` directory
|
||||||
|
- [ ] Remove pgdb1 entry from `flake.nix`
|
||||||
|
- [ ] Remove postgres alert from `services/monitoring/rules.yml`
|
||||||
|
- [ ] Delete `rebuild-all.sh` (obsolete)
|
||||||
|
- [ ] Run `nix flake check` to verify no broken references
|
||||||
|
- [ ] Commit changes
|
||||||
|
|
||||||
|
### Phase 3: Terraform Cleanup
|
||||||
|
- [ ] Remove pgdb1 from `terraform/vault/approle.tf`
|
||||||
|
- [ ] Run `tofu plan` in `terraform/vault/` to preview changes
|
||||||
|
- [ ] Run `tofu apply` to remove the AppRole
|
||||||
|
- [ ] Commit terraform changes
|
||||||
|
|
||||||
|
### Phase 4: Infrastructure Cleanup
|
||||||
|
- [ ] Shut down pgdb1 VM in Proxmox
|
||||||
|
- [ ] Delete the VM from Proxmox
|
||||||
|
- [ ] (Optional) Remove any DNS entries if not auto-generated
|
||||||
|
|
||||||
|
### Phase 5: Finalize
|
||||||
|
- [ ] Merge feature branch to master
|
||||||
|
- [ ] Trigger auto-upgrade on DNS servers (ns1, ns2) to remove DNS entry
|
||||||
|
- [ ] Move this plan to `docs/plans/completed/`
|
||||||
|
|
||||||
|
## Rollback
|
||||||
|
|
||||||
|
If issues arise after decommissioning:
|
||||||
|
1. The VM can be recreated from template using the git history
|
||||||
|
2. Database data would need to be restored from backup (if any exists)
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
- pgdb1 IP: 10.69.13.16
|
||||||
|
- The postgres service allowed connections from gunter (10.69.30.105)
|
||||||
|
- No restic backup was configured for this host
|
||||||
224
docs/plans/security-hardening.md
Normal file
224
docs/plans/security-hardening.md
Normal file
@@ -0,0 +1,224 @@
|
|||||||
|
# Security Hardening Plan
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Address security gaps identified in infrastructure review. Focus areas: SSH hardening, network security, logging improvements, and secrets management.
|
||||||
|
|
||||||
|
## Current State
|
||||||
|
|
||||||
|
- SSH allows password auth and unrestricted root login (`system/sshd.nix`)
|
||||||
|
- Firewall disabled on all hosts (`networking.firewall.enable = false`)
|
||||||
|
- Promtail ships logs over HTTP to Loki
|
||||||
|
- Loki has no authentication (`auth_enabled = false`)
|
||||||
|
- AppRole secret-IDs never expire (`secret_id_ttl = 0`)
|
||||||
|
- Vault TLS verification disabled by default (`skipTlsVerify = true`)
|
||||||
|
- Audit logging exists (`common/ssh-audit.nix`) but not applied globally
|
||||||
|
- Alert rules focus on availability, no security event detection
|
||||||
|
|
||||||
|
## Priority Matrix
|
||||||
|
|
||||||
|
| Issue | Severity | Effort | Priority |
|
||||||
|
|-------|----------|--------|----------|
|
||||||
|
| SSH password auth | High | Low | **P1** |
|
||||||
|
| Firewall disabled | High | Medium | **P1** |
|
||||||
|
| Promtail HTTP (no TLS) | High | Medium | **P2** |
|
||||||
|
| No security alerting | Medium | Low | **P2** |
|
||||||
|
| Audit logging not global | Low | Low | **P2** |
|
||||||
|
| Loki no auth | Medium | Medium | **P3** |
|
||||||
|
| Secret-ID TTL | Medium | Medium | **P3** |
|
||||||
|
| Vault skipTlsVerify | Medium | Low | **P3** |
|
||||||
|
|
||||||
|
## Phase 1: Quick Wins (P1)
|
||||||
|
|
||||||
|
### 1.1 SSH Hardening
|
||||||
|
|
||||||
|
Edit `system/sshd.nix`:
|
||||||
|
|
||||||
|
```nix
|
||||||
|
services.openssh = {
|
||||||
|
enable = true;
|
||||||
|
settings = {
|
||||||
|
PermitRootLogin = "prohibit-password"; # Key-only root login
|
||||||
|
PasswordAuthentication = false;
|
||||||
|
KbdInteractiveAuthentication = false;
|
||||||
|
};
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
**Prerequisite:** Verify all hosts have SSH keys deployed for root.
|
||||||
|
|
||||||
|
### 1.2 Enable Firewall
|
||||||
|
|
||||||
|
Create `system/firewall.nix` with default deny policy:
|
||||||
|
|
||||||
|
```nix
|
||||||
|
{ ... }: {
|
||||||
|
networking.firewall.enable = true;
|
||||||
|
|
||||||
|
# Use openssh's built-in firewall integration
|
||||||
|
services.openssh.openFirewall = true;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Useful firewall options:**
|
||||||
|
|
||||||
|
| Option | Description |
|
||||||
|
|--------|-------------|
|
||||||
|
| `networking.firewall.trustedInterfaces` | Accept all traffic from these interfaces (e.g., `[ "lo" ]`) |
|
||||||
|
| `networking.firewall.interfaces.<name>.allowedTCPPorts` | Per-interface port rules |
|
||||||
|
| `networking.firewall.extraInputRules` | Custom nftables rules (for complex filtering) |
|
||||||
|
|
||||||
|
**Network range restrictions:** Consider restricting SSH to the infrastructure subnet (`10.69.13.0/24`) using `extraInputRules` for defense in depth. However, this adds complexity and may not be necessary given the trusted network model.
|
||||||
|
|
||||||
|
#### Per-Interface Rules (http-proxy WireGuard)
|
||||||
|
|
||||||
|
The `http-proxy` host has a WireGuard interface (`wg0`) that may need different rules than the LAN interface. Use `networking.firewall.interfaces` to apply per-interface policies:
|
||||||
|
|
||||||
|
```nix
|
||||||
|
# Example: http-proxy with different rules per interface
|
||||||
|
networking.firewall = {
|
||||||
|
enable = true;
|
||||||
|
|
||||||
|
# Default: only SSH (via openFirewall)
|
||||||
|
allowedTCPPorts = [ ];
|
||||||
|
|
||||||
|
# LAN interface: allow HTTP/HTTPS
|
||||||
|
interfaces.ens18 = {
|
||||||
|
allowedTCPPorts = [ 80 443 ];
|
||||||
|
};
|
||||||
|
|
||||||
|
# WireGuard interface: restrict to specific services or trust fully
|
||||||
|
interfaces.wg0 = {
|
||||||
|
allowedTCPPorts = [ 80 443 ];
|
||||||
|
# Or use trustedInterfaces = [ "wg0" ] if fully trusted
|
||||||
|
};
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
**TODO:** Investigate current WireGuard usage on http-proxy to determine appropriate rules.
|
||||||
|
|
||||||
|
Then per-host, open required ports:
|
||||||
|
|
||||||
|
| Host | Additional Ports |
|
||||||
|
|------|------------------|
|
||||||
|
| ns1/ns2 | 53 (TCP/UDP) |
|
||||||
|
| vault01 | 8200 |
|
||||||
|
| monitoring01 | 3100, 9090, 3000, 9093 |
|
||||||
|
| http-proxy | 80, 443 |
|
||||||
|
| nats1 | 4222 |
|
||||||
|
| ha1 | 1883, 8123 |
|
||||||
|
| jelly01 | 8096 |
|
||||||
|
| nix-cache01 | 5000 |
|
||||||
|
|
||||||
|
## Phase 2: Logging & Detection (P2)
|
||||||
|
|
||||||
|
### 2.1 Enable TLS for Promtail → Loki
|
||||||
|
|
||||||
|
Update `system/monitoring/logs.nix`:
|
||||||
|
|
||||||
|
```nix
|
||||||
|
clients = [{
|
||||||
|
url = "https://monitoring01.home.2rjus.net:3100/loki/api/v1/push";
|
||||||
|
tls_config = {
|
||||||
|
ca_file = "/etc/ssl/certs/homelab-root-ca.pem";
|
||||||
|
};
|
||||||
|
}];
|
||||||
|
```
|
||||||
|
|
||||||
|
Requires:
|
||||||
|
- Configure Loki with TLS certificate (use internal ACME)
|
||||||
|
- Ensure all hosts trust root CA (already done via `system/pki/root-ca.nix`)
|
||||||
|
|
||||||
|
### 2.2 Security Alert Rules
|
||||||
|
|
||||||
|
Add to `services/monitoring/rules.yml`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
- name: security_rules
|
||||||
|
rules:
|
||||||
|
- alert: ssh_auth_failures
|
||||||
|
expr: increase(node_logind_sessions_total[5m]) > 20
|
||||||
|
for: 0m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
summary: "Unusual login activity on {{ $labels.instance }}"
|
||||||
|
|
||||||
|
- alert: vault_secret_fetch_failure
|
||||||
|
expr: increase(vault_secret_failures[5m]) > 5
|
||||||
|
for: 0m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
summary: "Vault secret fetch failures on {{ $labels.instance }}"
|
||||||
|
```
|
||||||
|
|
||||||
|
Also add Loki-based alerts for:
|
||||||
|
- Failed SSH attempts: `{job="systemd-journal"} |= "Failed password"`
|
||||||
|
- sudo usage: `{job="systemd-journal"} |= "sudo"`
|
||||||
|
|
||||||
|
### 2.3 Global Audit Logging
|
||||||
|
|
||||||
|
Add `./common/ssh-audit.nix` import to `system/default.nix`:
|
||||||
|
|
||||||
|
```nix
|
||||||
|
imports = [
|
||||||
|
# ... existing imports
|
||||||
|
../common/ssh-audit.nix
|
||||||
|
];
|
||||||
|
```
|
||||||
|
|
||||||
|
## Phase 3: Defense in Depth (P3)
|
||||||
|
|
||||||
|
### 3.1 Loki Authentication
|
||||||
|
|
||||||
|
Options:
|
||||||
|
1. **Basic auth via reverse proxy** - Put Loki behind Caddy with auth
|
||||||
|
2. **Loki multi-tenancy** - Enable `auth_enabled = true` and use tenant IDs
|
||||||
|
3. **Network isolation** - Bind Loki only to localhost, expose via authenticated proxy
|
||||||
|
|
||||||
|
Recommendation: Option 1 (reverse proxy) is simplest for homelab.
|
||||||
|
|
||||||
|
### 3.2 AppRole Secret Rotation
|
||||||
|
|
||||||
|
Update `terraform/vault/approle.tf`:
|
||||||
|
|
||||||
|
```hcl
|
||||||
|
secret_id_ttl = 2592000 # 30 days
|
||||||
|
```
|
||||||
|
|
||||||
|
Add documentation for manual rotation procedure or implement automated rotation via the existing `restartTrigger` mechanism in `vault-secrets.nix`.
|
||||||
|
|
||||||
|
### 3.3 Enable Vault TLS Verification
|
||||||
|
|
||||||
|
Change default in `system/vault-secrets.nix`:
|
||||||
|
|
||||||
|
```nix
|
||||||
|
skipTlsVerify = mkOption {
|
||||||
|
type = types.bool;
|
||||||
|
default = false; # Changed from true
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
**Prerequisite:** Verify all hosts trust the internal CA that signed the Vault certificate.
|
||||||
|
|
||||||
|
## Implementation Order
|
||||||
|
|
||||||
|
1. **Test on test-tier first** - Deploy phases 1-2 to testvm01/02/03
|
||||||
|
2. **Validate SSH access** - Ensure key-based login works before disabling passwords
|
||||||
|
3. **Document firewall ports** - Create reference of ports per host before enabling
|
||||||
|
4. **Phase prod rollout** - Deploy to prod hosts one at a time, verify each
|
||||||
|
|
||||||
|
## Open Questions
|
||||||
|
|
||||||
|
- [ ] Do all hosts have SSH keys configured for root access?
|
||||||
|
- [ ] Should firewall rules be per-host or use a central definition with roles?
|
||||||
|
- [ ] Should Loki authentication use the existing Kanidm setup?
|
||||||
|
|
||||||
|
**Resolved:** Password-based SSH access for recovery is not required - most hosts have console access through Proxmox or physical access, which provides an out-of-band recovery path if SSH keys fail.
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
- Firewall changes are the highest risk - test thoroughly on test-tier
|
||||||
|
- SSH hardening must not lock out access - verify keys first
|
||||||
|
- Consider creating a "break glass" procedure for emergency access if keys fail
|
||||||
267
docs/user-management.md
Normal file
267
docs/user-management.md
Normal file
@@ -0,0 +1,267 @@
|
|||||||
|
# User Management with Kanidm
|
||||||
|
|
||||||
|
Central authentication for the homelab using Kanidm.
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
- **Server**: kanidm01.home.2rjus.net (auth.home.2rjus.net)
|
||||||
|
- **WebUI**: https://auth.home.2rjus.net
|
||||||
|
- **LDAPS**: port 636
|
||||||
|
|
||||||
|
## CLI Setup
|
||||||
|
|
||||||
|
The `kanidm` CLI is available in the devshell:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
nix develop
|
||||||
|
|
||||||
|
# Login as idm_admin
|
||||||
|
kanidm login --name idm_admin --url https://auth.home.2rjus.net
|
||||||
|
```
|
||||||
|
|
||||||
|
## User Management
|
||||||
|
|
||||||
|
POSIX users are managed imperatively via the `kanidm` CLI. This allows setting
|
||||||
|
all attributes (including UNIX password) in one workflow.
|
||||||
|
|
||||||
|
### Creating a POSIX User
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Create the person
|
||||||
|
kanidm person create <username> "<Display Name>"
|
||||||
|
|
||||||
|
# Add to groups
|
||||||
|
kanidm group add-members ssh-users <username>
|
||||||
|
|
||||||
|
# Enable POSIX (UID is auto-assigned)
|
||||||
|
kanidm person posix set <username>
|
||||||
|
|
||||||
|
# Set UNIX password (required for SSH login, min 10 characters)
|
||||||
|
kanidm person posix set-password <username>
|
||||||
|
|
||||||
|
# Optionally set login shell
|
||||||
|
kanidm person posix set <username> --shell /bin/zsh
|
||||||
|
```
|
||||||
|
|
||||||
|
### Example: Full User Creation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kanidm person create testuser "Test User"
|
||||||
|
kanidm group add-members ssh-users testuser
|
||||||
|
kanidm person posix set testuser
|
||||||
|
kanidm person posix set-password testuser
|
||||||
|
kanidm person get testuser
|
||||||
|
```
|
||||||
|
|
||||||
|
After creation, verify on a client host:
|
||||||
|
```bash
|
||||||
|
getent passwd testuser
|
||||||
|
ssh testuser@testvm01.home.2rjus.net
|
||||||
|
```
|
||||||
|
|
||||||
|
### Viewing User Details
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kanidm person get <username>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Removing a User
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kanidm person delete <username>
|
||||||
|
```
|
||||||
|
|
||||||
|
## Group Management
|
||||||
|
|
||||||
|
Groups for POSIX access are also managed via CLI.
|
||||||
|
|
||||||
|
### Creating a POSIX Group
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Create the group
|
||||||
|
kanidm group create <group-name>
|
||||||
|
|
||||||
|
# Enable POSIX with a specific GID
|
||||||
|
kanidm group posix set <group-name> --gidnumber <gid>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Adding Members
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kanidm group add-members <group-name> <username>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Viewing Group Details
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kanidm group get <group-name>
|
||||||
|
kanidm group list-members <group-name>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Example: Full Group Creation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kanidm group create testgroup
|
||||||
|
kanidm group posix set testgroup --gidnumber 68010
|
||||||
|
kanidm group add-members testgroup testuser
|
||||||
|
kanidm group get testgroup
|
||||||
|
```
|
||||||
|
|
||||||
|
After creation, verify on a client host:
|
||||||
|
```bash
|
||||||
|
getent group testgroup
|
||||||
|
```
|
||||||
|
|
||||||
|
### Current Groups
|
||||||
|
|
||||||
|
| Group | GID | Purpose |
|
||||||
|
|-------|-----|---------|
|
||||||
|
| ssh-users | 68000 | SSH login access |
|
||||||
|
| admins | 68001 | Administrative access |
|
||||||
|
| users | 68002 | General users |
|
||||||
|
|
||||||
|
### UID/GID Allocation
|
||||||
|
|
||||||
|
Kanidm auto-assigns UIDs/GIDs from its configured range. For manually assigned GIDs:
|
||||||
|
|
||||||
|
| Range | Purpose |
|
||||||
|
|-------|---------|
|
||||||
|
| 65,536+ | Users (auto-assigned) |
|
||||||
|
| 68,000 - 68,999 | Groups (manually assigned) |
|
||||||
|
|
||||||
|
## PAM/NSS Client Configuration
|
||||||
|
|
||||||
|
Enable central authentication on a host:
|
||||||
|
|
||||||
|
```nix
|
||||||
|
homelab.kanidm.enable = true;
|
||||||
|
```
|
||||||
|
|
||||||
|
This configures:
|
||||||
|
- `services.kanidm.enablePam = true`
|
||||||
|
- Client connection to auth.home.2rjus.net
|
||||||
|
- Login authorization for `ssh-users` group
|
||||||
|
- Short usernames (`torjus` instead of `torjus@home.2rjus.net`)
|
||||||
|
- Home directory symlinks (`/home/torjus` → UUID-based directory)
|
||||||
|
|
||||||
|
### Enabled Hosts
|
||||||
|
|
||||||
|
- testvm01, testvm02, testvm03 (test tier)
|
||||||
|
|
||||||
|
### Options
|
||||||
|
|
||||||
|
```nix
|
||||||
|
homelab.kanidm = {
|
||||||
|
enable = true;
|
||||||
|
server = "https://auth.home.2rjus.net"; # default
|
||||||
|
allowedLoginGroups = [ "ssh-users" ]; # default
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
### Home Directories
|
||||||
|
|
||||||
|
Home directories use UUID-based paths for stability (so renaming a user doesn't
|
||||||
|
require moving their home directory). Symlinks provide convenient access:
|
||||||
|
|
||||||
|
```
|
||||||
|
/home/torjus -> /home/e4f4c56c-4aee-4c20-846f-90cb69807733
|
||||||
|
```
|
||||||
|
|
||||||
|
The symlinks are created by `kanidm-unixd-tasks` on first login.
|
||||||
|
|
||||||
|
## Testing
|
||||||
|
|
||||||
|
### Verify NSS Resolution
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check user resolution
|
||||||
|
getent passwd <username>
|
||||||
|
|
||||||
|
# Check group resolution
|
||||||
|
getent group <group-name>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Test SSH Login
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh <username>@<hostname>.home.2rjus.net
|
||||||
|
```
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### "PAM user mismatch" error
|
||||||
|
|
||||||
|
SSH fails with "fatal: PAM user mismatch" in logs. This happens when Kanidm returns
|
||||||
|
usernames in SPN format (`torjus@home.2rjus.net`) but SSH expects short names (`torjus`).
|
||||||
|
|
||||||
|
**Solution**: Configure `uid_attr_map = "name"` in unixSettings (already set in our module).
|
||||||
|
|
||||||
|
Check current format:
|
||||||
|
```bash
|
||||||
|
getent passwd torjus
|
||||||
|
# Should show: torjus:x:65536:...
|
||||||
|
# NOT: torjus@home.2rjus.net:x:65536:...
|
||||||
|
```
|
||||||
|
|
||||||
|
### User resolves but SSH fails immediately
|
||||||
|
|
||||||
|
The user's login group (e.g., `ssh-users`) likely doesn't have POSIX enabled:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check if group has POSIX
|
||||||
|
getent group ssh-users
|
||||||
|
|
||||||
|
# If empty, enable POSIX on the server
|
||||||
|
kanidm group posix set ssh-users --gidnumber 68000
|
||||||
|
```
|
||||||
|
|
||||||
|
### User doesn't resolve via getent
|
||||||
|
|
||||||
|
1. Check kanidm-unixd service is running:
|
||||||
|
```bash
|
||||||
|
systemctl status kanidm-unixd
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Check unixd can reach server:
|
||||||
|
```bash
|
||||||
|
kanidm-unix status
|
||||||
|
# Should show: system: online, Kanidm: online
|
||||||
|
```
|
||||||
|
|
||||||
|
3. Check client can reach server:
|
||||||
|
```bash
|
||||||
|
curl -s https://auth.home.2rjus.net/status
|
||||||
|
```
|
||||||
|
|
||||||
|
4. Check user has POSIX enabled on server:
|
||||||
|
```bash
|
||||||
|
kanidm person get <username>
|
||||||
|
```
|
||||||
|
|
||||||
|
5. Restart nscd to clear stale cache:
|
||||||
|
```bash
|
||||||
|
systemctl restart nscd
|
||||||
|
```
|
||||||
|
|
||||||
|
6. Invalidate kanidm cache:
|
||||||
|
```bash
|
||||||
|
kanidm-unix cache-invalidate
|
||||||
|
```
|
||||||
|
|
||||||
|
### Changes not taking effect after deployment
|
||||||
|
|
||||||
|
NixOS uses nsncd (a Rust reimplementation of nscd) for NSS caching. After deploying
|
||||||
|
kanidm-unixd config changes, you may need to restart both services:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl restart kanidm-unixd
|
||||||
|
systemctl restart nscd
|
||||||
|
```
|
||||||
|
|
||||||
|
### Test PAM authentication directly
|
||||||
|
|
||||||
|
Use the kanidm-unix CLI to test PAM auth without SSH:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kanidm-unix auth-test --name <username>
|
||||||
|
```
|
||||||
37
flake.nix
37
flake.nix
@@ -65,15 +65,6 @@
|
|||||||
in
|
in
|
||||||
{
|
{
|
||||||
nixosConfigurations = {
|
nixosConfigurations = {
|
||||||
ns1 = nixpkgs.lib.nixosSystem {
|
|
||||||
inherit system;
|
|
||||||
specialArgs = {
|
|
||||||
inherit inputs self;
|
|
||||||
};
|
|
||||||
modules = commonModules ++ [
|
|
||||||
./hosts/ns1
|
|
||||||
];
|
|
||||||
};
|
|
||||||
ha1 = nixpkgs.lib.nixosSystem {
|
ha1 = nixpkgs.lib.nixosSystem {
|
||||||
inherit system;
|
inherit system;
|
||||||
specialArgs = {
|
specialArgs = {
|
||||||
@@ -128,15 +119,6 @@
|
|||||||
./hosts/nix-cache01
|
./hosts/nix-cache01
|
||||||
];
|
];
|
||||||
};
|
};
|
||||||
pgdb1 = nixpkgs.lib.nixosSystem {
|
|
||||||
inherit system;
|
|
||||||
specialArgs = {
|
|
||||||
inherit inputs self;
|
|
||||||
};
|
|
||||||
modules = commonModules ++ [
|
|
||||||
./hosts/pgdb1
|
|
||||||
];
|
|
||||||
};
|
|
||||||
nats1 = nixpkgs.lib.nixosSystem {
|
nats1 = nixpkgs.lib.nixosSystem {
|
||||||
inherit system;
|
inherit system;
|
||||||
specialArgs = {
|
specialArgs = {
|
||||||
@@ -191,6 +173,24 @@
|
|||||||
./hosts/ns2
|
./hosts/ns2
|
||||||
];
|
];
|
||||||
};
|
};
|
||||||
|
ns1 = nixpkgs.lib.nixosSystem {
|
||||||
|
inherit system;
|
||||||
|
specialArgs = {
|
||||||
|
inherit inputs self;
|
||||||
|
};
|
||||||
|
modules = commonModules ++ [
|
||||||
|
./hosts/ns1
|
||||||
|
];
|
||||||
|
};
|
||||||
|
kanidm01 = nixpkgs.lib.nixosSystem {
|
||||||
|
inherit system;
|
||||||
|
specialArgs = {
|
||||||
|
inherit inputs self;
|
||||||
|
};
|
||||||
|
modules = commonModules ++ [
|
||||||
|
./hosts/kanidm01
|
||||||
|
];
|
||||||
|
};
|
||||||
};
|
};
|
||||||
packages = forAllSystems (
|
packages = forAllSystems (
|
||||||
{ pkgs }:
|
{ pkgs }:
|
||||||
@@ -207,6 +207,7 @@
|
|||||||
pkgs.ansible
|
pkgs.ansible
|
||||||
pkgs.opentofu
|
pkgs.opentofu
|
||||||
pkgs.openbao
|
pkgs.openbao
|
||||||
|
pkgs.kanidm_1_8
|
||||||
(pkgs.callPackage ./scripts/create-host { })
|
(pkgs.callPackage ./scripts/create-host { })
|
||||||
homelab-deploy.packages.${pkgs.system}.default
|
homelab-deploy.packages.${pkgs.system}.default
|
||||||
];
|
];
|
||||||
|
|||||||
@@ -64,9 +64,5 @@
|
|||||||
vault.enable = true;
|
vault.enable = true;
|
||||||
homelab.deploy.enable = true;
|
homelab.deploy.enable = true;
|
||||||
|
|
||||||
zramSwap = {
|
|
||||||
enable = true;
|
|
||||||
};
|
|
||||||
|
|
||||||
system.stateVersion = "23.11"; # Did you read the comment?
|
system.stateVersion = "23.11"; # Did you read the comment?
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -1,25 +1,39 @@
|
|||||||
{
|
{
|
||||||
|
config,
|
||||||
|
lib,
|
||||||
pkgs,
|
pkgs,
|
||||||
...
|
...
|
||||||
}:
|
}:
|
||||||
|
|
||||||
{
|
{
|
||||||
imports = [
|
imports = [
|
||||||
./hardware-configuration.nix
|
../template2/hardware-configuration.nix
|
||||||
|
|
||||||
../../system
|
../../system
|
||||||
../../common/vm
|
../../common/vm
|
||||||
|
../../services/kanidm
|
||||||
];
|
];
|
||||||
|
|
||||||
nixpkgs.config.allowUnfree = true;
|
# Host metadata
|
||||||
# Use the systemd-boot EFI boot loader.
|
homelab.host = {
|
||||||
boot.loader.grub = {
|
tier = "test";
|
||||||
enable = true;
|
role = "auth";
|
||||||
device = "/dev/sda";
|
|
||||||
configurationLimit = 3;
|
|
||||||
};
|
};
|
||||||
|
|
||||||
networking.hostName = "pgdb1";
|
# DNS CNAME for auth.home.2rjus.net
|
||||||
|
homelab.dns.cnames = [ "auth" ];
|
||||||
|
|
||||||
|
# Enable Vault integration
|
||||||
|
vault.enable = true;
|
||||||
|
|
||||||
|
# Enable remote deployment via NATS
|
||||||
|
homelab.deploy.enable = true;
|
||||||
|
|
||||||
|
nixpkgs.config.allowUnfree = true;
|
||||||
|
boot.loader.grub.enable = true;
|
||||||
|
boot.loader.grub.device = "/dev/vda";
|
||||||
|
|
||||||
|
networking.hostName = "kanidm01";
|
||||||
networking.domain = "home.2rjus.net";
|
networking.domain = "home.2rjus.net";
|
||||||
networking.useNetworkd = true;
|
networking.useNetworkd = true;
|
||||||
networking.useDHCP = false;
|
networking.useDHCP = false;
|
||||||
@@ -33,7 +47,7 @@
|
|||||||
systemd.network.networks."ens18" = {
|
systemd.network.networks."ens18" = {
|
||||||
matchConfig.Name = "ens18";
|
matchConfig.Name = "ens18";
|
||||||
address = [
|
address = [
|
||||||
"10.69.13.16/24"
|
"10.69.13.23/24"
|
||||||
];
|
];
|
||||||
routes = [
|
routes = [
|
||||||
{ Gateway = "10.69.13.1"; }
|
{ Gateway = "10.69.13.1"; }
|
||||||
@@ -59,8 +73,5 @@
|
|||||||
# Or disable the firewall altogether.
|
# Or disable the firewall altogether.
|
||||||
networking.firewall.enable = false;
|
networking.firewall.enable = false;
|
||||||
|
|
||||||
vault.enable = true;
|
system.stateVersion = "25.11"; # Did you read the comment?
|
||||||
homelab.deploy.enable = true;
|
}
|
||||||
|
|
||||||
system.stateVersion = "23.11"; # Did you read the comment?
|
|
||||||
}
|
|
||||||
@@ -1,7 +1,5 @@
|
|||||||
{ ... }:
|
{ ... }: {
|
||||||
{
|
|
||||||
imports = [
|
imports = [
|
||||||
./configuration.nix
|
./configuration.nix
|
||||||
../../services/postgres
|
|
||||||
];
|
];
|
||||||
}
|
}
|
||||||
@@ -4,6 +4,5 @@
|
|||||||
./configuration.nix
|
./configuration.nix
|
||||||
../../services/nix-cache
|
../../services/nix-cache
|
||||||
../../services/actions-runner
|
../../services/actions-runner
|
||||||
./zram.nix
|
|
||||||
];
|
];
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -1,6 +0,0 @@
|
|||||||
{ ... }:
|
|
||||||
{
|
|
||||||
zramSwap = {
|
|
||||||
enable = true;
|
|
||||||
};
|
|
||||||
}
|
|
||||||
@@ -7,23 +7,38 @@
|
|||||||
|
|
||||||
{
|
{
|
||||||
imports = [
|
imports = [
|
||||||
./hardware-configuration.nix
|
../template2/hardware-configuration.nix
|
||||||
|
|
||||||
../../system
|
../../system
|
||||||
|
../../common/vm
|
||||||
|
|
||||||
|
# DNS services
|
||||||
../../services/ns/master-authorative.nix
|
../../services/ns/master-authorative.nix
|
||||||
../../services/ns/resolver.nix
|
../../services/ns/resolver.nix
|
||||||
../../common/vm
|
|
||||||
];
|
];
|
||||||
|
|
||||||
|
# Host metadata
|
||||||
|
homelab.host = {
|
||||||
|
tier = "prod";
|
||||||
|
role = "dns";
|
||||||
|
labels.dns_role = "primary";
|
||||||
|
};
|
||||||
|
|
||||||
|
# Enable Vault integration
|
||||||
|
vault.enable = true;
|
||||||
|
|
||||||
|
# Enable remote deployment via NATS
|
||||||
|
homelab.deploy.enable = true;
|
||||||
|
|
||||||
nixpkgs.config.allowUnfree = true;
|
nixpkgs.config.allowUnfree = true;
|
||||||
# Use the systemd-boot EFI boot loader.
|
|
||||||
boot.loader.grub.enable = true;
|
boot.loader.grub.enable = true;
|
||||||
boot.loader.grub.device = "/dev/sda";
|
boot.loader.grub.device = "/dev/vda";
|
||||||
|
|
||||||
networking.hostName = "ns1";
|
networking.hostName = "ns1";
|
||||||
networking.domain = "home.2rjus.net";
|
networking.domain = "home.2rjus.net";
|
||||||
networking.useNetworkd = true;
|
networking.useNetworkd = true;
|
||||||
networking.useDHCP = false;
|
networking.useDHCP = false;
|
||||||
|
# Disable resolved - conflicts with Unbound resolver
|
||||||
services.resolved.enable = false;
|
services.resolved.enable = false;
|
||||||
networking.nameservers = [
|
networking.nameservers = [
|
||||||
"10.69.13.5"
|
"10.69.13.5"
|
||||||
@@ -47,14 +62,6 @@
|
|||||||
"nix-command"
|
"nix-command"
|
||||||
"flakes"
|
"flakes"
|
||||||
];
|
];
|
||||||
vault.enable = true;
|
|
||||||
homelab.deploy.enable = true;
|
|
||||||
|
|
||||||
homelab.host = {
|
|
||||||
role = "dns";
|
|
||||||
labels.dns_role = "primary";
|
|
||||||
};
|
|
||||||
|
|
||||||
nix.settings.tarball-ttl = 0;
|
nix.settings.tarball-ttl = 0;
|
||||||
environment.systemPackages = with pkgs; [
|
environment.systemPackages = with pkgs; [
|
||||||
vim
|
vim
|
||||||
@@ -68,5 +75,5 @@
|
|||||||
# Or disable the firewall altogether.
|
# Or disable the firewall altogether.
|
||||||
networking.firewall.enable = false;
|
networking.firewall.enable = false;
|
||||||
|
|
||||||
system.stateVersion = "23.11"; # Did you read the comment?
|
system.stateVersion = "25.11"; # Did you read the comment?
|
||||||
}
|
}
|
||||||
@@ -2,4 +2,4 @@
|
|||||||
imports = [
|
imports = [
|
||||||
./configuration.nix
|
./configuration.nix
|
||||||
];
|
];
|
||||||
}
|
}
|
||||||
@@ -1,36 +0,0 @@
|
|||||||
{ config, lib, pkgs, modulesPath, ... }:
|
|
||||||
|
|
||||||
{
|
|
||||||
imports =
|
|
||||||
[
|
|
||||||
(modulesPath + "/profiles/qemu-guest.nix")
|
|
||||||
];
|
|
||||||
|
|
||||||
boot.initrd.availableKernelModules = [ "ata_piix" "uhci_hcd" "virtio_pci" "virtio_scsi" "sd_mod" "sr_mod" ];
|
|
||||||
boot.initrd.kernelModules = [ ];
|
|
||||||
# boot.kernelModules = [ ];
|
|
||||||
# boot.extraModulePackages = [ ];
|
|
||||||
|
|
||||||
fileSystems."/" =
|
|
||||||
{
|
|
||||||
device = "/dev/disk/by-uuid/6889aba9-61ed-4687-ab10-e5cf4017ac8d";
|
|
||||||
fsType = "xfs";
|
|
||||||
};
|
|
||||||
|
|
||||||
fileSystems."/boot" =
|
|
||||||
{
|
|
||||||
device = "/dev/disk/by-uuid/BC07-3B7A";
|
|
||||||
fsType = "vfat";
|
|
||||||
};
|
|
||||||
|
|
||||||
swapDevices =
|
|
||||||
[{ device = "/dev/disk/by-uuid/64e5757b-6625-4dd2-aa2a-66ca93444d23"; }];
|
|
||||||
|
|
||||||
# Enables DHCP on each ethernet and wireless interface. In case of scripted networking
|
|
||||||
# (the default) this is the recommended approach. When using systemd-networkd it's
|
|
||||||
# still possible to use this option, but it's recommended to use it in conjunction
|
|
||||||
# with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
|
|
||||||
# networking.interfaces.ens18.useDHCP = lib.mkDefault true;
|
|
||||||
|
|
||||||
nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
|
|
||||||
}
|
|
||||||
@@ -1,42 +0,0 @@
|
|||||||
{
|
|
||||||
config,
|
|
||||||
lib,
|
|
||||||
pkgs,
|
|
||||||
modulesPath,
|
|
||||||
...
|
|
||||||
}:
|
|
||||||
|
|
||||||
{
|
|
||||||
imports = [
|
|
||||||
(modulesPath + "/profiles/qemu-guest.nix")
|
|
||||||
];
|
|
||||||
boot.initrd.availableKernelModules = [
|
|
||||||
"ata_piix"
|
|
||||||
"uhci_hcd"
|
|
||||||
"virtio_pci"
|
|
||||||
"virtio_scsi"
|
|
||||||
"sd_mod"
|
|
||||||
"sr_mod"
|
|
||||||
];
|
|
||||||
boot.initrd.kernelModules = [ "dm-snapshot" ];
|
|
||||||
boot.kernelModules = [
|
|
||||||
"ptp_kvm"
|
|
||||||
];
|
|
||||||
boot.extraModulePackages = [ ];
|
|
||||||
|
|
||||||
fileSystems."/" = {
|
|
||||||
device = "/dev/disk/by-label/root";
|
|
||||||
fsType = "xfs";
|
|
||||||
};
|
|
||||||
|
|
||||||
swapDevices = [ { device = "/dev/disk/by-label/swap"; } ];
|
|
||||||
|
|
||||||
# Enables DHCP on each ethernet and wireless interface. In case of scripted networking
|
|
||||||
# (the default) this is the recommended approach. When using systemd-networkd it's
|
|
||||||
# still possible to use this option, but it's recommended to use it in conjunction
|
|
||||||
# with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
|
|
||||||
networking.useDHCP = lib.mkDefault true;
|
|
||||||
# networking.interfaces.ens18.useDHCP = lib.mkDefault true;
|
|
||||||
|
|
||||||
nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
|
|
||||||
}
|
|
||||||
@@ -58,6 +58,14 @@
|
|||||||
"flakes"
|
"flakes"
|
||||||
];
|
];
|
||||||
nix.settings.tarball-ttl = 0;
|
nix.settings.tarball-ttl = 0;
|
||||||
|
nix.settings.substituters = [
|
||||||
|
"https://nix-cache.home.2rjus.net"
|
||||||
|
"https://cache.nixos.org"
|
||||||
|
];
|
||||||
|
nix.settings.trusted-public-keys = [
|
||||||
|
"nix-cache.home.2rjus.net-1:2kowZOG6pvhoK4AHVO3alBlvcghH20wchzoR0V86UWI="
|
||||||
|
"cache.nixos.org-1:6NCHdD59X431o0gWypbMrAURkbJ16ZPMQFGspcDShjY="
|
||||||
|
];
|
||||||
environment.systemPackages = with pkgs; [
|
environment.systemPackages = with pkgs; [
|
||||||
age
|
age
|
||||||
vim
|
vim
|
||||||
@@ -71,5 +79,8 @@
|
|||||||
# Or disable the firewall altogether.
|
# Or disable the firewall altogether.
|
||||||
networking.firewall.enable = false;
|
networking.firewall.enable = false;
|
||||||
|
|
||||||
|
# Compressed swap in RAM - prevents OOM during bootstrap nixos-rebuild
|
||||||
|
zramSwap.enable = true;
|
||||||
|
|
||||||
system.stateVersion = "25.11";
|
system.stateVersion = "25.11";
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -11,6 +11,7 @@
|
|||||||
|
|
||||||
../../system
|
../../system
|
||||||
../../common/vm
|
../../common/vm
|
||||||
|
../../common/ssh-audit.nix
|
||||||
];
|
];
|
||||||
|
|
||||||
# Host metadata (adjust as needed)
|
# Host metadata (adjust as needed)
|
||||||
@@ -24,6 +25,9 @@
|
|||||||
# Enable remote deployment via NATS
|
# Enable remote deployment via NATS
|
||||||
homelab.deploy.enable = true;
|
homelab.deploy.enable = true;
|
||||||
|
|
||||||
|
# Enable Kanidm PAM/NSS for central authentication
|
||||||
|
homelab.kanidm.enable = true;
|
||||||
|
|
||||||
nixpkgs.config.allowUnfree = true;
|
nixpkgs.config.allowUnfree = true;
|
||||||
boot.loader.grub.enable = true;
|
boot.loader.grub.enable = true;
|
||||||
boot.loader.grub.device = "/dev/vda";
|
boot.loader.grub.device = "/dev/vda";
|
||||||
|
|||||||
@@ -11,6 +11,7 @@
|
|||||||
|
|
||||||
../../system
|
../../system
|
||||||
../../common/vm
|
../../common/vm
|
||||||
|
../../common/ssh-audit.nix
|
||||||
];
|
];
|
||||||
|
|
||||||
# Host metadata (adjust as needed)
|
# Host metadata (adjust as needed)
|
||||||
@@ -24,6 +25,9 @@
|
|||||||
# Enable remote deployment via NATS
|
# Enable remote deployment via NATS
|
||||||
homelab.deploy.enable = true;
|
homelab.deploy.enable = true;
|
||||||
|
|
||||||
|
# Enable Kanidm PAM/NSS for central authentication
|
||||||
|
homelab.kanidm.enable = true;
|
||||||
|
|
||||||
nixpkgs.config.allowUnfree = true;
|
nixpkgs.config.allowUnfree = true;
|
||||||
boot.loader.grub.enable = true;
|
boot.loader.grub.enable = true;
|
||||||
boot.loader.grub.device = "/dev/vda";
|
boot.loader.grub.device = "/dev/vda";
|
||||||
|
|||||||
@@ -11,6 +11,7 @@
|
|||||||
|
|
||||||
../../system
|
../../system
|
||||||
../../common/vm
|
../../common/vm
|
||||||
|
../../common/ssh-audit.nix
|
||||||
];
|
];
|
||||||
|
|
||||||
# Host metadata (adjust as needed)
|
# Host metadata (adjust as needed)
|
||||||
@@ -24,6 +25,9 @@
|
|||||||
# Enable remote deployment via NATS
|
# Enable remote deployment via NATS
|
||||||
homelab.deploy.enable = true;
|
homelab.deploy.enable = true;
|
||||||
|
|
||||||
|
# Enable Kanidm PAM/NSS for central authentication
|
||||||
|
homelab.kanidm.enable = true;
|
||||||
|
|
||||||
nixpkgs.config.allowUnfree = true;
|
nixpkgs.config.allowUnfree = true;
|
||||||
boot.loader.grub.enable = true;
|
boot.loader.grub.enable = true;
|
||||||
boot.loader.grub.device = "/dev/vda";
|
boot.loader.grub.device = "/dev/vda";
|
||||||
|
|||||||
@@ -1,19 +0,0 @@
|
|||||||
#!/usr/bin/env bash
|
|
||||||
set -euo pipefail
|
|
||||||
|
|
||||||
# array of hosts
|
|
||||||
HOSTS=(
|
|
||||||
"ns1"
|
|
||||||
"ns2"
|
|
||||||
"ha1"
|
|
||||||
"http-proxy"
|
|
||||||
"jelly01"
|
|
||||||
"monitoring01"
|
|
||||||
"nix-cache01"
|
|
||||||
"pgdb1"
|
|
||||||
)
|
|
||||||
|
|
||||||
for host in "${HOSTS[@]}"; do
|
|
||||||
echo "Rebuilding $host"
|
|
||||||
nixos-rebuild boot --flake .#${host} --target-host root@${host}
|
|
||||||
done
|
|
||||||
65
services/kanidm/default.nix
Normal file
65
services/kanidm/default.nix
Normal file
@@ -0,0 +1,65 @@
|
|||||||
|
{ config, lib, pkgs, ... }:
|
||||||
|
{
|
||||||
|
services.kanidm = {
|
||||||
|
package = pkgs.kanidmWithSecretProvisioning_1_8;
|
||||||
|
enableServer = true;
|
||||||
|
serverSettings = {
|
||||||
|
domain = "home.2rjus.net";
|
||||||
|
origin = "https://auth.home.2rjus.net";
|
||||||
|
bindaddress = "0.0.0.0:443";
|
||||||
|
ldapbindaddress = "0.0.0.0:636";
|
||||||
|
tls_chain = "/var/lib/acme/auth.home.2rjus.net/fullchain.pem";
|
||||||
|
tls_key = "/var/lib/acme/auth.home.2rjus.net/key.pem";
|
||||||
|
online_backup = {
|
||||||
|
path = "/var/lib/kanidm/backups";
|
||||||
|
schedule = "00 22 * * *";
|
||||||
|
versions = 7;
|
||||||
|
};
|
||||||
|
};
|
||||||
|
|
||||||
|
# Provision base groups only - users are managed via CLI
|
||||||
|
# See docs/user-management.md for details
|
||||||
|
provision = {
|
||||||
|
enable = true;
|
||||||
|
idmAdminPasswordFile = config.vault.secrets.kanidm-idm-admin.outputDir;
|
||||||
|
|
||||||
|
groups = {
|
||||||
|
admins = { };
|
||||||
|
users = { };
|
||||||
|
ssh-users = { };
|
||||||
|
};
|
||||||
|
|
||||||
|
# Regular users (persons) are managed imperatively via kanidm CLI
|
||||||
|
};
|
||||||
|
};
|
||||||
|
|
||||||
|
# Grant kanidm access to ACME certificates
|
||||||
|
users.users.kanidm.extraGroups = [ "acme" ];
|
||||||
|
|
||||||
|
# ACME certificate from internal CA
|
||||||
|
# Include both the CNAME (auth) and A record (kanidm01) for Prometheus scraping
|
||||||
|
security.acme.certs."auth.home.2rjus.net" = {
|
||||||
|
listenHTTP = ":80";
|
||||||
|
reloadServices = [ "kanidm" ];
|
||||||
|
extraDomainNames = [ "${config.networking.hostName}.home.2rjus.net" ];
|
||||||
|
};
|
||||||
|
|
||||||
|
# Vault secret for idm_admin password (used for provisioning)
|
||||||
|
vault.secrets.kanidm-idm-admin = {
|
||||||
|
secretPath = "kanidm/idm-admin-password";
|
||||||
|
extractKey = "password";
|
||||||
|
services = [ "kanidm" ];
|
||||||
|
owner = "kanidm";
|
||||||
|
group = "kanidm";
|
||||||
|
};
|
||||||
|
|
||||||
|
# Note: Kanidm does not expose Prometheus metrics
|
||||||
|
# If metrics support is added in the future, uncomment:
|
||||||
|
# homelab.monitoring.scrapeTargets = [
|
||||||
|
# {
|
||||||
|
# job_name = "kanidm";
|
||||||
|
# port = 443;
|
||||||
|
# scheme = "https";
|
||||||
|
# }
|
||||||
|
# ];
|
||||||
|
}
|
||||||
@@ -356,32 +356,6 @@ groups:
|
|||||||
annotations:
|
annotations:
|
||||||
summary: "Proxmox VM {{ $labels.id }} is stopped"
|
summary: "Proxmox VM {{ $labels.id }} is stopped"
|
||||||
description: "Proxmox VM {{ $labels.id }} ({{ $labels.name }}) has onboot=1 but is stopped."
|
description: "Proxmox VM {{ $labels.id }} ({{ $labels.name }}) has onboot=1 but is stopped."
|
||||||
- name: postgres_rules
|
|
||||||
rules:
|
|
||||||
- alert: postgres_down
|
|
||||||
expr: node_systemd_unit_state{instance="pgdb1.home.2rjus.net:9100", name="postgresql.service", state="active"} == 0
|
|
||||||
for: 5m
|
|
||||||
labels:
|
|
||||||
severity: critical
|
|
||||||
annotations:
|
|
||||||
summary: "PostgreSQL not running on {{ $labels.instance }}"
|
|
||||||
description: "PostgreSQL has been down on {{ $labels.instance }} more than 5 minutes."
|
|
||||||
- alert: postgres_exporter_down
|
|
||||||
expr: up{job="postgres"} == 0
|
|
||||||
for: 5m
|
|
||||||
labels:
|
|
||||||
severity: warning
|
|
||||||
annotations:
|
|
||||||
summary: "PostgreSQL exporter down on {{ $labels.instance }}"
|
|
||||||
description: "Cannot scrape PostgreSQL metrics from {{ $labels.instance }}."
|
|
||||||
- alert: postgres_high_connections
|
|
||||||
expr: pg_stat_activity_count / pg_settings_max_connections > 0.8
|
|
||||||
for: 5m
|
|
||||||
labels:
|
|
||||||
severity: warning
|
|
||||||
annotations:
|
|
||||||
summary: "PostgreSQL connection pool near exhaustion on {{ $labels.instance }}"
|
|
||||||
description: "PostgreSQL is using over 80% of max_connections on {{ $labels.instance }}."
|
|
||||||
- name: jellyfin_rules
|
- name: jellyfin_rules
|
||||||
rules:
|
rules:
|
||||||
- alert: jellyfin_down
|
- alert: jellyfin_down
|
||||||
|
|||||||
@@ -1,6 +0,0 @@
|
|||||||
{ ... }:
|
|
||||||
{
|
|
||||||
imports = [
|
|
||||||
./postgres.nix
|
|
||||||
];
|
|
||||||
}
|
|
||||||
@@ -1,23 +0,0 @@
|
|||||||
{ pkgs, ... }:
|
|
||||||
{
|
|
||||||
homelab.monitoring.scrapeTargets = [{
|
|
||||||
job_name = "postgres";
|
|
||||||
port = 9187;
|
|
||||||
}];
|
|
||||||
|
|
||||||
services.prometheus.exporters.postgres = {
|
|
||||||
enable = true;
|
|
||||||
runAsLocalSuperUser = true; # Use peer auth as postgres user
|
|
||||||
};
|
|
||||||
|
|
||||||
services.postgresql = {
|
|
||||||
enable = true;
|
|
||||||
enableJIT = true;
|
|
||||||
enableTCPIP = true;
|
|
||||||
extensions = ps: with ps; [ pgvector ];
|
|
||||||
authentication = ''
|
|
||||||
# Allow access to everything from gunter
|
|
||||||
host all all 10.69.30.105/32 scram-sha-256
|
|
||||||
'';
|
|
||||||
};
|
|
||||||
}
|
|
||||||
@@ -4,13 +4,16 @@
|
|||||||
./acme.nix
|
./acme.nix
|
||||||
./autoupgrade.nix
|
./autoupgrade.nix
|
||||||
./homelab-deploy.nix
|
./homelab-deploy.nix
|
||||||
|
./kanidm-client.nix
|
||||||
./monitoring
|
./monitoring
|
||||||
./motd.nix
|
./motd.nix
|
||||||
./packages.nix
|
./packages.nix
|
||||||
./nix.nix
|
./nix.nix
|
||||||
|
./pipe-to-loki.nix
|
||||||
./root-user.nix
|
./root-user.nix
|
||||||
./pki/root-ca.nix
|
./pki/root-ca.nix
|
||||||
./sshd.nix
|
./sshd.nix
|
||||||
./vault-secrets.nix
|
./vault-secrets.nix
|
||||||
|
./zram.nix
|
||||||
];
|
];
|
||||||
}
|
}
|
||||||
|
|||||||
42
system/kanidm-client.nix
Normal file
42
system/kanidm-client.nix
Normal file
@@ -0,0 +1,42 @@
|
|||||||
|
{ lib, config, pkgs, ... }:
|
||||||
|
let
|
||||||
|
cfg = config.homelab.kanidm;
|
||||||
|
in
|
||||||
|
{
|
||||||
|
options.homelab.kanidm = {
|
||||||
|
enable = lib.mkEnableOption "Kanidm PAM/NSS client for central authentication";
|
||||||
|
|
||||||
|
server = lib.mkOption {
|
||||||
|
type = lib.types.str;
|
||||||
|
default = "https://auth.home.2rjus.net";
|
||||||
|
description = "URI of the Kanidm server";
|
||||||
|
};
|
||||||
|
|
||||||
|
allowedLoginGroups = lib.mkOption {
|
||||||
|
type = lib.types.listOf lib.types.str;
|
||||||
|
default = [ "ssh-users" ];
|
||||||
|
description = "Groups allowed to log in via PAM";
|
||||||
|
};
|
||||||
|
};
|
||||||
|
|
||||||
|
config = lib.mkIf cfg.enable {
|
||||||
|
services.kanidm = {
|
||||||
|
package = pkgs.kanidm_1_8;
|
||||||
|
enablePam = true;
|
||||||
|
|
||||||
|
clientSettings = {
|
||||||
|
uri = cfg.server;
|
||||||
|
};
|
||||||
|
|
||||||
|
unixSettings = {
|
||||||
|
pam_allowed_login_groups = cfg.allowedLoginGroups;
|
||||||
|
# Use short names (torjus) instead of SPN format (torjus@home.2rjus.net)
|
||||||
|
# This prevents "PAM user mismatch" errors with SSH
|
||||||
|
uid_attr_map = "name";
|
||||||
|
gid_attr_map = "name";
|
||||||
|
# Create symlink /home/torjus -> /home/torjus@home.2rjus.net
|
||||||
|
home_alias = "name";
|
||||||
|
};
|
||||||
|
};
|
||||||
|
};
|
||||||
|
}
|
||||||
140
system/pipe-to-loki.nix
Normal file
140
system/pipe-to-loki.nix
Normal file
@@ -0,0 +1,140 @@
|
|||||||
|
{
|
||||||
|
config,
|
||||||
|
pkgs,
|
||||||
|
lib,
|
||||||
|
...
|
||||||
|
}:
|
||||||
|
let
|
||||||
|
pipe-to-loki = pkgs.writeShellApplication {
|
||||||
|
name = "pipe-to-loki";
|
||||||
|
runtimeInputs = with pkgs; [
|
||||||
|
curl
|
||||||
|
jq
|
||||||
|
util-linux
|
||||||
|
coreutils
|
||||||
|
];
|
||||||
|
text = ''
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
LOKI_URL="http://monitoring01.home.2rjus.net:3100/loki/api/v1/push"
|
||||||
|
HOSTNAME=$(hostname)
|
||||||
|
SESSION_ID=""
|
||||||
|
RECORD_MODE=false
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
echo "Usage: pipe-to-loki [--id ID] [--record]"
|
||||||
|
echo ""
|
||||||
|
echo "Send command output or interactive sessions to Loki."
|
||||||
|
echo ""
|
||||||
|
echo "Options:"
|
||||||
|
echo " --id ID Set custom session ID (default: auto-generated)"
|
||||||
|
echo " --record Start interactive recording session"
|
||||||
|
echo ""
|
||||||
|
echo "Examples:"
|
||||||
|
echo " command | pipe-to-loki # Pipe command output"
|
||||||
|
echo " command | pipe-to-loki --id foo # Pipe with custom ID"
|
||||||
|
echo " pipe-to-loki --record # Start recording session"
|
||||||
|
exit 1
|
||||||
|
}
|
||||||
|
|
||||||
|
generate_id() {
|
||||||
|
local random_chars
|
||||||
|
random_chars=$(head -c 2 /dev/urandom | od -An -tx1 | tr -d ' \n')
|
||||||
|
echo "''${HOSTNAME}-$(date +%s)-''${random_chars}"
|
||||||
|
}
|
||||||
|
|
||||||
|
send_to_loki() {
|
||||||
|
local content="$1"
|
||||||
|
local type="$2"
|
||||||
|
local timestamp_ns
|
||||||
|
timestamp_ns=$(date +%s%N)
|
||||||
|
|
||||||
|
local payload
|
||||||
|
payload=$(jq -n \
|
||||||
|
--arg job "pipe-to-loki" \
|
||||||
|
--arg host "$HOSTNAME" \
|
||||||
|
--arg type "$type" \
|
||||||
|
--arg id "$SESSION_ID" \
|
||||||
|
--arg ts "$timestamp_ns" \
|
||||||
|
--arg content "$content" \
|
||||||
|
'{
|
||||||
|
streams: [{
|
||||||
|
stream: {
|
||||||
|
job: $job,
|
||||||
|
host: $host,
|
||||||
|
type: $type,
|
||||||
|
id: $id
|
||||||
|
},
|
||||||
|
values: [[$ts, $content]]
|
||||||
|
}]
|
||||||
|
}')
|
||||||
|
|
||||||
|
if curl -s -X POST "$LOKI_URL" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d "$payload" > /dev/null; then
|
||||||
|
return 0
|
||||||
|
else
|
||||||
|
echo "Error: Failed to send to Loki" >&2
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Parse arguments
|
||||||
|
while [[ $# -gt 0 ]]; do
|
||||||
|
case $1 in
|
||||||
|
--id)
|
||||||
|
SESSION_ID="$2"
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
|
--record)
|
||||||
|
RECORD_MODE=true
|
||||||
|
shift
|
||||||
|
;;
|
||||||
|
--help|-h)
|
||||||
|
usage
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
echo "Unknown option: $1" >&2
|
||||||
|
usage
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
# Generate ID if not provided
|
||||||
|
if [[ -z "$SESSION_ID" ]]; then
|
||||||
|
SESSION_ID=$(generate_id)
|
||||||
|
fi
|
||||||
|
|
||||||
|
if $RECORD_MODE; then
|
||||||
|
# Session recording mode
|
||||||
|
SCRIPT_FILE=$(mktemp)
|
||||||
|
trap 'rm -f "$SCRIPT_FILE"' EXIT
|
||||||
|
|
||||||
|
echo "Recording session $SESSION_ID... (exit to send)"
|
||||||
|
|
||||||
|
# Use script to record the session
|
||||||
|
script -q "$SCRIPT_FILE"
|
||||||
|
|
||||||
|
# Read the transcript and send to Loki
|
||||||
|
content=$(cat "$SCRIPT_FILE")
|
||||||
|
if send_to_loki "$content" "session"; then
|
||||||
|
echo "Session $SESSION_ID sent to Loki"
|
||||||
|
fi
|
||||||
|
else
|
||||||
|
# Pipe mode - read from stdin
|
||||||
|
if [[ -t 0 ]]; then
|
||||||
|
echo "Error: No input provided. Pipe a command or use --record for interactive mode." >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
content=$(cat)
|
||||||
|
if send_to_loki "$content" "command"; then
|
||||||
|
echo "Sent to Loki with id: $SESSION_ID"
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
'';
|
||||||
|
};
|
||||||
|
in
|
||||||
|
{
|
||||||
|
environment.systemPackages = [ pipe-to-loki ];
|
||||||
|
}
|
||||||
8
system/zram.nix
Normal file
8
system/zram.nix
Normal file
@@ -0,0 +1,8 @@
|
|||||||
|
# Compressed swap in RAM
|
||||||
|
#
|
||||||
|
# Provides overflow memory during Nix builds and upgrades.
|
||||||
|
# Prevents OOM kills on low-memory hosts (2GB VMs).
|
||||||
|
{ ... }:
|
||||||
|
{
|
||||||
|
zramSwap.enable = true;
|
||||||
|
}
|
||||||
@@ -66,19 +66,7 @@ locals {
|
|||||||
]
|
]
|
||||||
}
|
}
|
||||||
|
|
||||||
"pgdb1" = {
|
|
||||||
paths = [
|
|
||||||
"secret/data/hosts/pgdb1/*",
|
|
||||||
]
|
|
||||||
}
|
|
||||||
|
|
||||||
# Wave 3: DNS servers
|
# Wave 3: DNS servers
|
||||||
"ns1" = {
|
|
||||||
paths = [
|
|
||||||
"secret/data/hosts/ns1/*",
|
|
||||||
"secret/data/shared/dns/*",
|
|
||||||
]
|
|
||||||
}
|
|
||||||
|
|
||||||
# Wave 4: http-proxy
|
# Wave 4: http-proxy
|
||||||
"http-proxy" = {
|
"http-proxy" = {
|
||||||
|
|||||||
@@ -26,6 +26,19 @@ locals {
|
|||||||
"secret/data/shared/dns/*",
|
"secret/data/shared/dns/*",
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
|
"ns1" = {
|
||||||
|
paths = [
|
||||||
|
"secret/data/hosts/ns1/*",
|
||||||
|
"secret/data/shared/dns/*",
|
||||||
|
"secret/data/shared/homelab-deploy/*",
|
||||||
|
]
|
||||||
|
}
|
||||||
|
"kanidm01" = {
|
||||||
|
paths = [
|
||||||
|
"secret/data/hosts/kanidm01/*",
|
||||||
|
"secret/data/kanidm/*",
|
||||||
|
]
|
||||||
|
}
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
@@ -102,6 +102,12 @@ locals {
|
|||||||
auto_generate = false
|
auto_generate = false
|
||||||
data = { nkey = var.homelab_deploy_admin_deployer_nkey }
|
data = { nkey = var.homelab_deploy_admin_deployer_nkey }
|
||||||
}
|
}
|
||||||
|
|
||||||
|
# Kanidm idm_admin password
|
||||||
|
"kanidm/idm-admin-password" = {
|
||||||
|
auto_generate = true
|
||||||
|
password_length = 32
|
||||||
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
@@ -65,6 +65,20 @@ locals {
|
|||||||
disk_size = "20G"
|
disk_size = "20G"
|
||||||
vault_wrapped_token = "s.3nran1e1Uim4B1OomIWCoS4T"
|
vault_wrapped_token = "s.3nran1e1Uim4B1OomIWCoS4T"
|
||||||
}
|
}
|
||||||
|
"ns1" = {
|
||||||
|
ip = "10.69.13.5/24"
|
||||||
|
cpu_cores = 2
|
||||||
|
memory = 2048
|
||||||
|
disk_size = "20G"
|
||||||
|
vault_wrapped_token = "s.b6ge0KMtNQctdKkvm0RNxGdt"
|
||||||
|
}
|
||||||
|
"kanidm01" = {
|
||||||
|
ip = "10.69.13.23/24"
|
||||||
|
cpu_cores = 2
|
||||||
|
memory = 2048
|
||||||
|
disk_size = "20G"
|
||||||
|
vault_wrapped_token = "s.OOqjEECeIV7dNgCS6jNmyY3K"
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
# Compute VM configurations with defaults applied
|
# Compute VM configurations with defaults applied
|
||||||
|
|||||||
Reference in New Issue
Block a user