11 Commits

Author SHA1 Message Date
9ed09c9a9c docs: add user-management documentation
All checks were successful
Run nix flake check / flake-check (pull_request) Successful in 3m33s
Run nix flake check / flake-check (push) Successful in 2m0s
- CLI workflows for creating users and groups
- Troubleshooting guide (nscd, cache invalidation)
- Home directory behavior (UUID-based with symlinks)
- Update auth-system-replacement plan with progress

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 15:14:21 +01:00
b31c64f1b9 kanidm: remove declarative user provisioning
Keep base groups (admins, users, ssh-users) provisioned declaratively
but manage regular users via the kanidm CLI. This allows setting POSIX
attributes and passwords in a single workflow.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 15:14:03 +01:00
54b6e37420 flake: add kanidm to devshell
Add kanidm_1_8 CLI for administering the Kanidm server.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 15:12:19 +01:00
b845a8bb8b system: add kanidm PAM/NSS client module
Add homelab.kanidm.enable option for central authentication via Kanidm.
The module configures:
- PAM/NSS integration with kanidm-unixd
- Client connection to auth.home.2rjus.net
- Login authorization for ssh-users group

Enable on testvm01-03 for testing.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 15:12:19 +01:00
bfbf0cea68 template2: enable zram for bootstrap
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m34s
Prevents OOM during initial nixos-rebuild on 2GB VMs.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 13:34:08 +01:00
3abe5e83a7 docs: add memory ballooning as fallback option
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m5s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 13:29:42 +01:00
67c27555f3 docs: add memory issues follow-up plan
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m2s
Track zram change effectiveness for OOM prevention during upgrades.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 13:26:31 +01:00
1674b6a844 system: enable zram swap for all hosts
Some checks failed
Run nix flake check / flake-check (push) Failing after 12m6s
Provides compressed swap in RAM to prevent OOM kills during
nixos-rebuild on low-memory VMs (2GB). Removes duplicate zram
configs from jelly01 and nix-cache01.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 13:02:58 +01:00
311be282b6 docs: add security hardening plan
Some checks failed
Run nix flake check / flake-check (push) Failing after 2s
Based on security review findings, covering SSH hardening, firewall
enablement, log transport TLS, security alerting, and secrets management.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 05:26:15 +01:00
11cbb64097 claude: make auditor delegation explicit in investigate-alarm
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
- Changed section 4 from "if needed" to always spawn auditor
- Added explicit "Do NOT query audit logs yourself" guidance
- Listed specific scenarios requiring auditor (service stopped, etc.)
- Added manual intervention as first common cause
- Updated guidelines to emphasize mandatory delegation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 05:11:09 +01:00
e2dd21c994 claude: add auditor agent and git-explorer MCP
Add new auditor agent for security-focused audit log analysis:
- SSH session tracking, command execution, sudo usage
- Suspicious activity detection patterns
- Can be used standalone or as sub-agent by investigate-alarm

Update investigate-alarm to delegate audit analysis to auditor
and add git-explorer MCP for configuration drift detection.

Add git-explorer to .mcp.json for repository inspection.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 04:48:55 +01:00
19 changed files with 988 additions and 45 deletions

180
.claude/agents/auditor.md Normal file
View File

@@ -0,0 +1,180 @@
---
name: auditor
description: Analyzes audit logs to investigate user activity, command execution, and suspicious behavior on hosts. Can be used standalone for security reviews or called by other agents for behavioral context.
tools: Read, Grep, Glob
mcpServers:
- lab-monitoring
---
You are a security auditor for a NixOS homelab infrastructure. Your task is to analyze audit logs and reconstruct user activity on hosts.
## Input
You may receive:
- A host or list of hosts to investigate
- A time window (e.g., "last hour", "today", "between 14:00 and 15:00")
- Optional context: specific events to look for, user to focus on, or suspicious activity to investigate
- Optional context from a parent investigation (e.g., "a service stopped at 14:32, what happened around that time?")
## Audit Log Structure
Logs are shipped to Loki via promtail. Audit events use these labels:
- `host` - hostname
- `systemd_unit` - typically `auditd.service` for audit logs
- `job` - typically `systemd-journal`
Audit log entries contain structured data:
- `EXECVE` - command execution with full arguments
- `USER_LOGIN` / `USER_LOGOUT` - session start/end
- `USER_CMD` - sudo command execution
- `CRED_ACQ` / `CRED_DISP` - credential acquisition/disposal
- `SERVICE_START` / `SERVICE_STOP` - systemd service events
## Investigation Techniques
### 1. SSH Session Activity
Find SSH logins and session activity:
```logql
{host="<hostname>", systemd_unit="sshd.service"}
```
Look for:
- Accepted/Failed authentication
- Session opened/closed
- Unusual source IPs or users
### 2. Command Execution
Query executed commands (filter out noise):
```logql
{host="<hostname>"} |= "EXECVE" != "PATH item" != "PROCTITLE" != "SYSCALL" != "BPF"
```
Further filtering:
- Exclude systemd noise: `!= "systemd" != "/nix/store"`
- Focus on specific commands: `|= "rm" |= "-rf"`
- Focus on specific user: `|= "uid=1000"`
### 3. Sudo Activity
Check for privilege escalation:
```logql
{host="<hostname>"} |= "sudo" |= "COMMAND"
```
Or via audit:
```logql
{host="<hostname>"} |= "USER_CMD"
```
### 4. Service Manipulation
Check if services were manually stopped/started:
```logql
{host="<hostname>"} |= "EXECVE" |= "systemctl"
```
### 5. File Operations
Look for file modifications (if auditd rules are configured):
```logql
{host="<hostname>"} |= "EXECVE" |= "vim"
{host="<hostname>"} |= "EXECVE" |= "nano"
{host="<hostname>"} |= "EXECVE" |= "rm"
```
## Query Guidelines
**Start narrow, expand if needed:**
- Begin with `limit: 20-30`
- Use tight time windows: `start: "15m"` or `start: "30m"`
- Add filters progressively
**Avoid:**
- Querying all audit logs without EXECVE filter (extremely verbose)
- Large time ranges without specific filters
- Limits over 50 without tight filters
**Time-bounded queries:**
When investigating around a specific event:
```logql
{host="<hostname>"} |= "EXECVE" != "systemd"
```
With `start: "2026-02-08T14:30:00Z"` and `end: "2026-02-08T14:35:00Z"`
## Suspicious Patterns to Watch For
1. **Unusual login times** - Activity outside normal hours
2. **Failed authentication** - Brute force attempts
3. **Privilege escalation** - Unexpected sudo usage
4. **Reconnaissance commands** - `whoami`, `id`, `uname`, `cat /etc/passwd`
5. **Data exfiltration indicators** - `curl`, `wget`, `scp`, `rsync` to external destinations
6. **Persistence mechanisms** - Cron modifications, systemd service creation
7. **Log tampering** - Commands targeting log files
8. **Lateral movement** - SSH to other internal hosts
9. **Service manipulation** - Stopping security services, disabling firewalls
10. **Cleanup activity** - Deleting bash history, clearing logs
## Output Format
### For Standalone Security Reviews
```
## Activity Summary
**Host:** <hostname>
**Time Period:** <start> to <end>
**Sessions Found:** <count>
## User Sessions
### Session 1: <user> from <source_ip>
- **Login:** HH:MM:SSZ
- **Logout:** HH:MM:SSZ (or ongoing)
- **Commands executed:**
- HH:MM:SSZ - <command>
- HH:MM:SSZ - <command>
## Suspicious Activity
[If any patterns from the watch list were detected]
- **Finding:** <description>
- **Evidence:** <log entries>
- **Risk Level:** Low / Medium / High
## Summary
[Overall assessment: normal activity, concerning patterns, or clear malicious activity]
```
### When Called by Another Agent
Provide a focused response addressing the specific question:
```
## Audit Findings
**Query:** <what was asked>
**Time Window:** <investigated period>
## Relevant Activity
[Chronological list of relevant events]
- HH:MM:SSZ - <event>
- HH:MM:SSZ - <event>
## Assessment
[Direct answer to the question with supporting evidence]
```
## Guidelines
- Reconstruct timelines chronologically
- Correlate events (login → commands → logout)
- Note gaps or missing data
- Distinguish between automated (systemd, cron) and interactive activity
- Consider the host's role and tier when assessing severity
- When called by another agent, focus on answering their specific question
- Don't speculate without evidence - state what the logs show and don't show

View File

@@ -4,6 +4,7 @@ description: Investigates a single system alarm by querying Prometheus metrics a
tools: Read, Grep, Glob
mcpServers:
- lab-monitoring
- git-explorer
---
You are an alarm investigation specialist for a NixOS homelab infrastructure. Your task is to analyze a single alarm and determine its root cause.
@@ -33,9 +34,9 @@ Gather evidence about the current system state:
- Use `list_targets` to verify the host/service is being scraped successfully
- Look for correlated metrics that might explain the issue
### 3. Check Logs
### 3. Check Service Logs
Search for relevant log entries using `query_logs`. **Be careful to avoid overly broad queries that return too much data.**
Search for relevant log entries using `query_logs`. Focus on service-specific logs and errors.
**Query strategies (start narrow, expand if needed):**
- Start with `limit: 20-30`, increase only if needed
@@ -43,23 +44,42 @@ Search for relevant log entries using `query_logs`. **Be careful to avoid overly
- Filter to specific services: `{host="<hostname>", systemd_unit="<service>.service"}`
- Search for errors: `{host="<hostname>"} |= "error"` or `|= "failed"`
**For audit logs (SSH sessions, command execution):**
- Filter to just commands: `{host="<hostname>"} |= "EXECVE"`
- Exclude verbose noise: `!= "PATH item" != "PROCTITLE" != "SYSCALL" != "BPF"`
- Example: `{host="testvm01"} |= "EXECVE" != "systemd"` (user commands only)
**Common patterns:**
- Service logs: `{host="<hostname>", systemd_unit="<service>.service"}`
- SSH activity: `{host="<hostname>", systemd_unit="sshd.service"}`
- All errors on host: `{host="<hostname>"} |= "error"`
- Specific command: `{host="<hostname>"} |= "EXECVE" |= "stress"`
- Journal for a unit: `{host="<hostname>", systemd_unit="nginx.service"} |= "failed"`
**Avoid:**
- Querying all audit logs without filtering (very verbose)
- Using `start: "1h"` with no filters on busy hosts
- Limits over 50 without specific filters
### 4. Check Configuration (if relevant)
### 4. Investigate User Activity
For any analysis of user activity, **always spawn the `auditor` agent**. Do not query audit logs (EXECVE, USER_LOGIN, etc.) directly - delegate this to the auditor.
**Always call the auditor when:**
- A service stopped unexpectedly (may have been manually stopped)
- A process was killed or a config was changed
- You need to know who was logged in around the time of an incident
- You need to understand what commands led to the current state
- The cause isn't obvious from service logs alone
**Do NOT try to query audit logs yourself.** The auditor is specialized for:
- Parsing EXECVE records and reconstructing command lines
- Correlating SSH sessions with commands executed
- Identifying suspicious patterns
- Filtering out systemd/nix-store noise
**Example prompt for auditor:**
```
Investigate user activity on <hostname> between <start_time> and <end_time>.
Context: The prometheus-node-exporter service stopped at 14:32.
Determine if it was manually stopped and by whom.
```
Incorporate the auditor's findings into your timeline and root cause analysis.
### 5. Check Configuration (if relevant)
If the alert relates to a NixOS-managed service:
- Check host configuration in `/hosts/<hostname>/`
@@ -67,9 +87,61 @@ If the alert relates to a NixOS-managed service:
- Look for thresholds, resource limits, or misconfigurations
- Check `homelab.host` options for tier/priority/role metadata
### 5. Consider Common Causes
### 6. Check for Configuration Drift
Use the git-explorer MCP server to compare the host's deployed configuration against the current master branch. This helps identify:
- Hosts running outdated configurations
- Recent changes that might have caused the issue
- Whether a fix has already been committed but not deployed
**Step 1: Get the deployed revision from Prometheus**
```promql
nixos_flake_info{hostname="<hostname>"}
```
The `current_rev` label contains the deployed git commit hash.
**Step 2: Check if the host is behind master**
```
resolve_ref("master") # Get current master commit
is_ancestor(deployed, master) # Check if host is behind
```
**Step 3: See what commits are missing**
```
commits_between(deployed, master) # List commits not yet deployed
```
**Step 4: Check which files changed**
```
get_diff_files(deployed, master) # Files modified since deployment
```
Look for files in `hosts/<hostname>/`, `services/<relevant-service>/`, or `system/` that affect this host.
**Step 5: View configuration at the deployed revision**
```
get_file_at_commit(deployed, "services/<service>/default.nix")
```
Compare against the current file to understand differences.
**Step 6: Find when something changed**
```
search_commits("<service-name>") # Find commits mentioning the service
get_commit_info(<hash>) # Get full details of a specific change
```
**Example workflow for a service-related alert:**
1. Query `nixos_flake_info{hostname="monitoring01"}``current_rev: 8959829`
2. `resolve_ref("master")``4633421`
3. `is_ancestor("8959829", "4633421")` → Yes, host is behind
4. `commits_between("8959829", "4633421")` → 7 commits missing
5. `get_diff_files("8959829", "4633421")` → Check if relevant service files changed
6. If a fix was committed after the deployed rev, recommend deployment
### 7. Consider Common Causes
For infrastructure alerts, common causes include:
- **Manual intervention**: Service manually stopped/restarted (call auditor to confirm)
- **Configuration drift**: Host running outdated config, fix already in master
- **Disk space**: Nix store growth, logs, temp files
- **Memory pressure**: Service memory leaks, insufficient limits
- **CPU**: Runaway processes, build jobs
@@ -77,6 +149,8 @@ For infrastructure alerts, common causes include:
- **Service restarts**: Failed upgrades, configuration errors
- **Scrape failures**: Service down, firewall issues, port changes
**Note:** If a service stopped unexpectedly and service logs don't show a crash or error, it was likely manual intervention - call the auditor to investigate.
## Output Format
Provide a concise report with one of two outcomes:
@@ -133,6 +207,5 @@ Provide a concise report with one of two outcomes:
- If the alert is a false positive or expected behavior, explain why
- Consider the host's tier (test vs prod) when assessing severity
- Build a timeline from log timestamps and metrics to show the sequence of events
- Include precursor events (logins, config changes, restarts) that led to the issue
- **Query logs incrementally**: start with narrow filters and small limits, expand only if needed
- **Avoid broad audit log queries**: always filter to EXECVE and exclude noise (PATH, SYSCALL, BPF)
- **Always delegate to the auditor agent** for any user activity analysis - never query EXECVE or audit logs directly

View File

@@ -33,6 +33,13 @@
"--nats-url", "nats://nats1.home.2rjus.net:4222",
"--nkey-file", "/home/torjus/.config/homelab-deploy/test-deployer.nkey"
]
},
"git-explorer": {
"command": "nix",
"args": ["run", "git+https://git.t-juice.club/torjus/labmcp#git-explorer", "--", "serve"],
"env": {
"GIT_REPO_PATH": "/home/torjus/git/nixos-servers"
}
}
}
}

View File

@@ -66,9 +66,9 @@ This future migration path is a strong argument for Kanidm over LDAP-only soluti
- Vault integration for idm_admin password
- LDAPS on port 636
2. **Configure declarative provisioning**
- Groups: `admins`, `users`, `ssh-users`
- User: `torjus` (member of all groups)
2. **Configure provisioning**
- Groups provisioned declaratively: `admins`, `users`, `ssh-users`
- Users managed imperatively via CLI (allows setting POSIX passwords in one step)
- POSIX attributes enabled (UID/GID range 65,536-69,999)
3. **Test NAS integration** (in progress)
@@ -80,14 +80,16 @@ This future migration path is a strong argument for Kanidm over LDAP-only soluti
- Grafana
- Other services as needed
5. **Create client module** in `system/` for PAM/NSS
- Enable on all hosts that need central auth
- Configure trusted CA
5. **Create client module** in `system/` for PAM/NSS
- Module: `system/kanidm-client.nix`
- `homelab.kanidm.enable = true` enables PAM/NSS
- Short usernames (not SPN format)
- Home directory symlinks via `home_alias`
- Enabled on test tier: testvm01, testvm02, testvm03
6. **Documentation**
- User management procedures
- Adding new OAuth2 clients
- Troubleshooting PAM/NSS issues
6. **Documentation**
- `docs/user-management.md` - CLI workflows, troubleshooting
- User/group creation procedures verified working
## Progress
@@ -106,14 +108,37 @@ This future migration path is a strong argument for Kanidm over LDAP-only soluti
- Prometheus monitoring scrape target configured
**Provisioned entities:**
- Groups: `admins`, `users`, `ssh-users`
- User: `torjus` (member of all groups, POSIX enabled with GID 65536)
- Groups: `admins`, `users`, `ssh-users` (declarative)
- Users managed via CLI (imperative)
**Verified working:**
- WebUI login with idm_admin
- LDAP bind and search with POSIX-enabled user
- LDAPS with valid internal CA certificate
### Completed (2026-02-08) - PAM/NSS Client
**Client module deployed (`system/kanidm-client.nix`):**
- `homelab.kanidm.enable = true` enables PAM/NSS integration
- Connects to auth.home.2rjus.net
- Short usernames (`torjus` instead of `torjus@home.2rjus.net`)
- Home directory symlinks (`/home/torjus` → UUID-based dir)
- Login restricted to `ssh-users` group
**Enabled on test tier:**
- testvm01, testvm02, testvm03
**Verified working:**
- User/group resolution via `getent`
- SSH login with Kanidm unix passwords
- Home directory creation with symlinks
- Imperative user/group creation via CLI
**Documentation:**
- `docs/user-management.md` with full CLI workflows
- Password requirements (min 10 chars)
- Troubleshooting guide (nscd, cache invalidation)
### UID/GID Range (Resolved)
**Range: 65,536 - 69,999** (manually allocated)
@@ -128,10 +153,9 @@ Rationale:
### Next Steps
1. Deploy to monitoring01 to enable Prometheus scraping
1. Enable PAM/NSS on production hosts (after test tier validation)
2. Configure TrueNAS LDAP client for NAS integration testing
3. Add OAuth2 clients (Grafana first)
4. Create PAM/NSS client module for other hosts
## References

View File

@@ -0,0 +1,116 @@
# Memory Issues Follow-up
Tracking the zram change to verify it resolves OOM issues during nixos-upgrade on low-memory hosts.
## Background
On 2026-02-08, ns2 (2GB RAM) experienced an OOM kill during nixos-upgrade. The Nix evaluation process consumed ~1.6GB before being killed by the kernel. ns1 (manually increased to 4GB) succeeded with the same upgrade.
Root cause: 2GB RAM is insufficient for Nix flake evaluation without swap.
## Fix Applied
**Commit:** `1674b6a` - system: enable zram swap for all hosts
**Merged:** 2026-02-08 ~12:15 UTC
**Change:** Added `zramSwap.enable = true` to `system/zram.nix`, providing ~2GB compressed swap on all hosts.
## Timeline
| Time (UTC) | Event |
|------------|-------|
| 05:00:46 | ns2 nixos-upgrade OOM killed |
| 05:01:47 | `nixos_upgrade_failed` alert fired |
| 12:15 | zram commit merged to master |
| 12:19 | ns2 rebooted with zram enabled |
| 12:20 | ns1 rebooted (memory reduced to 2GB via tofu) |
## Hosts Affected
All 2GB VMs that run nixos-upgrade:
- ns1, ns2 (DNS)
- vault01
- testvm01, testvm02, testvm03
- kanidm01
## Metrics to Monitor
Check these in Grafana or via PromQL to verify the fix:
### Swap availability (should be ~2GB after upgrade)
```promql
node_memory_SwapTotal_bytes / 1024 / 1024
```
### Swap usage during upgrades
```promql
(node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / 1024 / 1024
```
### Zswap compressed bytes (active compression)
```promql
node_memory_Zswap_bytes / 1024 / 1024
```
### Upgrade failures (should be 0)
```promql
node_systemd_unit_state{name="nixos-upgrade.service", state="failed"}
```
### Memory available during upgrades
```promql
node_memory_MemAvailable_bytes / 1024 / 1024
```
## Verification Steps
After a few days (allow auto-upgrades to run on all hosts):
1. Check all hosts have swap enabled:
```promql
node_memory_SwapTotal_bytes > 0
```
2. Check for any upgrade failures since the fix:
```promql
count_over_time(ALERTS{alertname="nixos_upgrade_failed"}[7d])
```
3. Review if any hosts used swap during upgrades (check historical graphs)
## Success Criteria
- No `nixos_upgrade_failed` alerts due to OOM after 2026-02-08
- All hosts show ~2GB swap available
- Upgrades complete successfully on 2GB VMs
## Fallback Options
If zram is insufficient:
1. **Increase VM memory** - Update `terraform/vms.tf` to 4GB for affected hosts
2. **Enable memory ballooning** - Configure VMs with dynamic memory allocation (see below)
3. **Use remote builds** - Configure `nix.buildMachines` to offload evaluation
4. **Reduce flake size** - Split configurations to reduce evaluation memory
### Memory Ballooning
Proxmox supports memory ballooning, which allows VMs to dynamically grow/shrink memory allocation based on demand. The balloon driver inside the guest communicates with the hypervisor to release or reclaim memory pages.
Configuration in `terraform/vms.tf`:
```hcl
memory = 4096 # maximum memory
balloon = 2048 # minimum memory (shrinks to this when idle)
```
Pros:
- VMs get memory on-demand without reboots
- Better host memory utilization
- Solves upgrade OOM without permanently allocating 4GB
Cons:
- Requires QEMU guest agent running in guest
- Guest can experience memory pressure if host is overcommitted
Ballooning and zram are complementary - ballooning provides headroom from the host, zram provides overflow within the guest.

View File

@@ -0,0 +1,224 @@
# Security Hardening Plan
## Overview
Address security gaps identified in infrastructure review. Focus areas: SSH hardening, network security, logging improvements, and secrets management.
## Current State
- SSH allows password auth and unrestricted root login (`system/sshd.nix`)
- Firewall disabled on all hosts (`networking.firewall.enable = false`)
- Promtail ships logs over HTTP to Loki
- Loki has no authentication (`auth_enabled = false`)
- AppRole secret-IDs never expire (`secret_id_ttl = 0`)
- Vault TLS verification disabled by default (`skipTlsVerify = true`)
- Audit logging exists (`common/ssh-audit.nix`) but not applied globally
- Alert rules focus on availability, no security event detection
## Priority Matrix
| Issue | Severity | Effort | Priority |
|-------|----------|--------|----------|
| SSH password auth | High | Low | **P1** |
| Firewall disabled | High | Medium | **P1** |
| Promtail HTTP (no TLS) | High | Medium | **P2** |
| No security alerting | Medium | Low | **P2** |
| Audit logging not global | Low | Low | **P2** |
| Loki no auth | Medium | Medium | **P3** |
| Secret-ID TTL | Medium | Medium | **P3** |
| Vault skipTlsVerify | Medium | Low | **P3** |
## Phase 1: Quick Wins (P1)
### 1.1 SSH Hardening
Edit `system/sshd.nix`:
```nix
services.openssh = {
enable = true;
settings = {
PermitRootLogin = "prohibit-password"; # Key-only root login
PasswordAuthentication = false;
KbdInteractiveAuthentication = false;
};
};
```
**Prerequisite:** Verify all hosts have SSH keys deployed for root.
### 1.2 Enable Firewall
Create `system/firewall.nix` with default deny policy:
```nix
{ ... }: {
networking.firewall.enable = true;
# Use openssh's built-in firewall integration
services.openssh.openFirewall = true;
}
```
**Useful firewall options:**
| Option | Description |
|--------|-------------|
| `networking.firewall.trustedInterfaces` | Accept all traffic from these interfaces (e.g., `[ "lo" ]`) |
| `networking.firewall.interfaces.<name>.allowedTCPPorts` | Per-interface port rules |
| `networking.firewall.extraInputRules` | Custom nftables rules (for complex filtering) |
**Network range restrictions:** Consider restricting SSH to the infrastructure subnet (`10.69.13.0/24`) using `extraInputRules` for defense in depth. However, this adds complexity and may not be necessary given the trusted network model.
#### Per-Interface Rules (http-proxy WireGuard)
The `http-proxy` host has a WireGuard interface (`wg0`) that may need different rules than the LAN interface. Use `networking.firewall.interfaces` to apply per-interface policies:
```nix
# Example: http-proxy with different rules per interface
networking.firewall = {
enable = true;
# Default: only SSH (via openFirewall)
allowedTCPPorts = [ ];
# LAN interface: allow HTTP/HTTPS
interfaces.ens18 = {
allowedTCPPorts = [ 80 443 ];
};
# WireGuard interface: restrict to specific services or trust fully
interfaces.wg0 = {
allowedTCPPorts = [ 80 443 ];
# Or use trustedInterfaces = [ "wg0" ] if fully trusted
};
};
```
**TODO:** Investigate current WireGuard usage on http-proxy to determine appropriate rules.
Then per-host, open required ports:
| Host | Additional Ports |
|------|------------------|
| ns1/ns2 | 53 (TCP/UDP) |
| vault01 | 8200 |
| monitoring01 | 3100, 9090, 3000, 9093 |
| http-proxy | 80, 443 |
| nats1 | 4222 |
| ha1 | 1883, 8123 |
| jelly01 | 8096 |
| nix-cache01 | 5000 |
## Phase 2: Logging & Detection (P2)
### 2.1 Enable TLS for Promtail → Loki
Update `system/monitoring/logs.nix`:
```nix
clients = [{
url = "https://monitoring01.home.2rjus.net:3100/loki/api/v1/push";
tls_config = {
ca_file = "/etc/ssl/certs/homelab-root-ca.pem";
};
}];
```
Requires:
- Configure Loki with TLS certificate (use internal ACME)
- Ensure all hosts trust root CA (already done via `system/pki/root-ca.nix`)
### 2.2 Security Alert Rules
Add to `services/monitoring/rules.yml`:
```yaml
- name: security_rules
rules:
- alert: ssh_auth_failures
expr: increase(node_logind_sessions_total[5m]) > 20
for: 0m
labels:
severity: warning
annotations:
summary: "Unusual login activity on {{ $labels.instance }}"
- alert: vault_secret_fetch_failure
expr: increase(vault_secret_failures[5m]) > 5
for: 0m
labels:
severity: warning
annotations:
summary: "Vault secret fetch failures on {{ $labels.instance }}"
```
Also add Loki-based alerts for:
- Failed SSH attempts: `{job="systemd-journal"} |= "Failed password"`
- sudo usage: `{job="systemd-journal"} |= "sudo"`
### 2.3 Global Audit Logging
Add `./common/ssh-audit.nix` import to `system/default.nix`:
```nix
imports = [
# ... existing imports
../common/ssh-audit.nix
];
```
## Phase 3: Defense in Depth (P3)
### 3.1 Loki Authentication
Options:
1. **Basic auth via reverse proxy** - Put Loki behind Caddy with auth
2. **Loki multi-tenancy** - Enable `auth_enabled = true` and use tenant IDs
3. **Network isolation** - Bind Loki only to localhost, expose via authenticated proxy
Recommendation: Option 1 (reverse proxy) is simplest for homelab.
### 3.2 AppRole Secret Rotation
Update `terraform/vault/approle.tf`:
```hcl
secret_id_ttl = 2592000 # 30 days
```
Add documentation for manual rotation procedure or implement automated rotation via the existing `restartTrigger` mechanism in `vault-secrets.nix`.
### 3.3 Enable Vault TLS Verification
Change default in `system/vault-secrets.nix`:
```nix
skipTlsVerify = mkOption {
type = types.bool;
default = false; # Changed from true
};
```
**Prerequisite:** Verify all hosts trust the internal CA that signed the Vault certificate.
## Implementation Order
1. **Test on test-tier first** - Deploy phases 1-2 to testvm01/02/03
2. **Validate SSH access** - Ensure key-based login works before disabling passwords
3. **Document firewall ports** - Create reference of ports per host before enabling
4. **Phase prod rollout** - Deploy to prod hosts one at a time, verify each
## Open Questions
- [ ] Do all hosts have SSH keys configured for root access?
- [ ] Should firewall rules be per-host or use a central definition with roles?
- [ ] Should Loki authentication use the existing Kanidm setup?
**Resolved:** Password-based SSH access for recovery is not required - most hosts have console access through Proxmox or physical access, which provides an out-of-band recovery path if SSH keys fail.
## Notes
- Firewall changes are the highest risk - test thoroughly on test-tier
- SSH hardening must not lock out access - verify keys first
- Consider creating a "break glass" procedure for emergency access if keys fail

267
docs/user-management.md Normal file
View File

@@ -0,0 +1,267 @@
# User Management with Kanidm
Central authentication for the homelab using Kanidm.
## Overview
- **Server**: kanidm01.home.2rjus.net (auth.home.2rjus.net)
- **WebUI**: https://auth.home.2rjus.net
- **LDAPS**: port 636
## CLI Setup
The `kanidm` CLI is available in the devshell:
```bash
nix develop
# Login as idm_admin
kanidm login --name idm_admin --url https://auth.home.2rjus.net
```
## User Management
POSIX users are managed imperatively via the `kanidm` CLI. This allows setting
all attributes (including UNIX password) in one workflow.
### Creating a POSIX User
```bash
# Create the person
kanidm person create <username> "<Display Name>"
# Add to groups
kanidm group add-members ssh-users <username>
# Enable POSIX (UID is auto-assigned)
kanidm person posix set <username>
# Set UNIX password (required for SSH login, min 10 characters)
kanidm person posix set-password <username>
# Optionally set login shell
kanidm person posix set <username> --shell /bin/zsh
```
### Example: Full User Creation
```bash
kanidm person create testuser "Test User"
kanidm group add-members ssh-users testuser
kanidm person posix set testuser
kanidm person posix set-password testuser
kanidm person get testuser
```
After creation, verify on a client host:
```bash
getent passwd testuser
ssh testuser@testvm01.home.2rjus.net
```
### Viewing User Details
```bash
kanidm person get <username>
```
### Removing a User
```bash
kanidm person delete <username>
```
## Group Management
Groups for POSIX access are also managed via CLI.
### Creating a POSIX Group
```bash
# Create the group
kanidm group create <group-name>
# Enable POSIX with a specific GID
kanidm group posix set <group-name> --gidnumber <gid>
```
### Adding Members
```bash
kanidm group add-members <group-name> <username>
```
### Viewing Group Details
```bash
kanidm group get <group-name>
kanidm group list-members <group-name>
```
### Example: Full Group Creation
```bash
kanidm group create testgroup
kanidm group posix set testgroup --gidnumber 68010
kanidm group add-members testgroup testuser
kanidm group get testgroup
```
After creation, verify on a client host:
```bash
getent group testgroup
```
### Current Groups
| Group | GID | Purpose |
|-------|-----|---------|
| ssh-users | 68000 | SSH login access |
| admins | 68001 | Administrative access |
| users | 68002 | General users |
### UID/GID Allocation
Kanidm auto-assigns UIDs/GIDs from its configured range. For manually assigned GIDs:
| Range | Purpose |
|-------|---------|
| 65,536+ | Users (auto-assigned) |
| 68,000 - 68,999 | Groups (manually assigned) |
## PAM/NSS Client Configuration
Enable central authentication on a host:
```nix
homelab.kanidm.enable = true;
```
This configures:
- `services.kanidm.enablePam = true`
- Client connection to auth.home.2rjus.net
- Login authorization for `ssh-users` group
- Short usernames (`torjus` instead of `torjus@home.2rjus.net`)
- Home directory symlinks (`/home/torjus` → UUID-based directory)
### Enabled Hosts
- testvm01, testvm02, testvm03 (test tier)
### Options
```nix
homelab.kanidm = {
enable = true;
server = "https://auth.home.2rjus.net"; # default
allowedLoginGroups = [ "ssh-users" ]; # default
};
```
### Home Directories
Home directories use UUID-based paths for stability (so renaming a user doesn't
require moving their home directory). Symlinks provide convenient access:
```
/home/torjus -> /home/e4f4c56c-4aee-4c20-846f-90cb69807733
```
The symlinks are created by `kanidm-unixd-tasks` on first login.
## Testing
### Verify NSS Resolution
```bash
# Check user resolution
getent passwd <username>
# Check group resolution
getent group <group-name>
```
### Test SSH Login
```bash
ssh <username>@<hostname>.home.2rjus.net
```
## Troubleshooting
### "PAM user mismatch" error
SSH fails with "fatal: PAM user mismatch" in logs. This happens when Kanidm returns
usernames in SPN format (`torjus@home.2rjus.net`) but SSH expects short names (`torjus`).
**Solution**: Configure `uid_attr_map = "name"` in unixSettings (already set in our module).
Check current format:
```bash
getent passwd torjus
# Should show: torjus:x:65536:...
# NOT: torjus@home.2rjus.net:x:65536:...
```
### User resolves but SSH fails immediately
The user's login group (e.g., `ssh-users`) likely doesn't have POSIX enabled:
```bash
# Check if group has POSIX
getent group ssh-users
# If empty, enable POSIX on the server
kanidm group posix set ssh-users --gidnumber 68000
```
### User doesn't resolve via getent
1. Check kanidm-unixd service is running:
```bash
systemctl status kanidm-unixd
```
2. Check unixd can reach server:
```bash
kanidm-unix status
# Should show: system: online, Kanidm: online
```
3. Check client can reach server:
```bash
curl -s https://auth.home.2rjus.net/status
```
4. Check user has POSIX enabled on server:
```bash
kanidm person get <username>
```
5. Restart nscd to clear stale cache:
```bash
systemctl restart nscd
```
6. Invalidate kanidm cache:
```bash
kanidm-unix cache-invalidate
```
### Changes not taking effect after deployment
NixOS uses nsncd (a Rust reimplementation of nscd) for NSS caching. After deploying
kanidm-unixd config changes, you may need to restart both services:
```bash
systemctl restart kanidm-unixd
systemctl restart nscd
```
### Test PAM authentication directly
Use the kanidm-unix CLI to test PAM auth without SSH:
```bash
kanidm-unix auth-test --name <username>
```

View File

@@ -207,6 +207,7 @@
pkgs.ansible
pkgs.opentofu
pkgs.openbao
pkgs.kanidm_1_8
(pkgs.callPackage ./scripts/create-host { })
homelab-deploy.packages.${pkgs.system}.default
];

View File

@@ -64,9 +64,5 @@
vault.enable = true;
homelab.deploy.enable = true;
zramSwap = {
enable = true;
};
system.stateVersion = "23.11"; # Did you read the comment?
}

View File

@@ -4,6 +4,5 @@
./configuration.nix
../../services/nix-cache
../../services/actions-runner
./zram.nix
];
}

View File

@@ -1,6 +0,0 @@
{ ... }:
{
zramSwap = {
enable = true;
};
}

View File

@@ -79,5 +79,8 @@
# Or disable the firewall altogether.
networking.firewall.enable = false;
# Compressed swap in RAM - prevents OOM during bootstrap nixos-rebuild
zramSwap.enable = true;
system.stateVersion = "25.11";
}

View File

@@ -25,6 +25,9 @@
# Enable remote deployment via NATS
homelab.deploy.enable = true;
# Enable Kanidm PAM/NSS for central authentication
homelab.kanidm.enable = true;
nixpkgs.config.allowUnfree = true;
boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/vda";

View File

@@ -25,6 +25,9 @@
# Enable remote deployment via NATS
homelab.deploy.enable = true;
# Enable Kanidm PAM/NSS for central authentication
homelab.kanidm.enable = true;
nixpkgs.config.allowUnfree = true;
boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/vda";

View File

@@ -25,6 +25,9 @@
# Enable remote deployment via NATS
homelab.deploy.enable = true;
# Enable Kanidm PAM/NSS for central authentication
homelab.kanidm.enable = true;
nixpkgs.config.allowUnfree = true;
boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/vda";

View File

@@ -17,7 +17,8 @@
};
};
# Provisioning - initial users/groups
# Provision base groups only - users are managed via CLI
# See docs/user-management.md for details
provision = {
enable = true;
idmAdminPasswordFile = config.vault.secrets.kanidm-idm-admin.outputDir;
@@ -28,10 +29,7 @@
ssh-users = { };
};
persons.torjus = {
displayName = "Torjus";
groups = [ "admins" "users" "ssh-users" ];
};
# Regular users (persons) are managed imperatively via kanidm CLI
};
};
@@ -46,7 +44,7 @@
extraDomainNames = [ "${config.networking.hostName}.home.2rjus.net" ];
};
# Vault secret for idm_admin password
# Vault secret for idm_admin password (used for provisioning)
vault.secrets.kanidm-idm-admin = {
secretPath = "kanidm/idm-admin-password";
extractKey = "password";

View File

@@ -4,6 +4,7 @@
./acme.nix
./autoupgrade.nix
./homelab-deploy.nix
./kanidm-client.nix
./monitoring
./motd.nix
./packages.nix
@@ -12,5 +13,6 @@
./pki/root-ca.nix
./sshd.nix
./vault-secrets.nix
./zram.nix
];
}

42
system/kanidm-client.nix Normal file
View File

@@ -0,0 +1,42 @@
{ lib, config, pkgs, ... }:
let
cfg = config.homelab.kanidm;
in
{
options.homelab.kanidm = {
enable = lib.mkEnableOption "Kanidm PAM/NSS client for central authentication";
server = lib.mkOption {
type = lib.types.str;
default = "https://auth.home.2rjus.net";
description = "URI of the Kanidm server";
};
allowedLoginGroups = lib.mkOption {
type = lib.types.listOf lib.types.str;
default = [ "ssh-users" ];
description = "Groups allowed to log in via PAM";
};
};
config = lib.mkIf cfg.enable {
services.kanidm = {
package = pkgs.kanidm_1_8;
enablePam = true;
clientSettings = {
uri = cfg.server;
};
unixSettings = {
pam_allowed_login_groups = cfg.allowedLoginGroups;
# Use short names (torjus) instead of SPN format (torjus@home.2rjus.net)
# This prevents "PAM user mismatch" errors with SSH
uid_attr_map = "name";
gid_attr_map = "name";
# Create symlink /home/torjus -> /home/torjus@home.2rjus.net
home_alias = "name";
};
};
};
}

8
system/zram.nix Normal file
View File

@@ -0,0 +1,8 @@
# Compressed swap in RAM
#
# Provides overflow memory during Nix builds and upgrades.
# Prevents OOM kills on low-memory hosts (2GB VMs).
{ ... }:
{
zramSwap.enable = true;
}