Compare commits
11 Commits
463342133e
...
kanidm-pam
| Author | SHA1 | Date | |
|---|---|---|---|
|
9ed09c9a9c
|
|||
|
b31c64f1b9
|
|||
|
54b6e37420
|
|||
|
b845a8bb8b
|
|||
|
bfbf0cea68
|
|||
|
3abe5e83a7
|
|||
|
67c27555f3
|
|||
|
1674b6a844
|
|||
|
311be282b6
|
|||
|
11cbb64097
|
|||
|
e2dd21c994
|
180
.claude/agents/auditor.md
Normal file
180
.claude/agents/auditor.md
Normal file
@@ -0,0 +1,180 @@
|
|||||||
|
---
|
||||||
|
name: auditor
|
||||||
|
description: Analyzes audit logs to investigate user activity, command execution, and suspicious behavior on hosts. Can be used standalone for security reviews or called by other agents for behavioral context.
|
||||||
|
tools: Read, Grep, Glob
|
||||||
|
mcpServers:
|
||||||
|
- lab-monitoring
|
||||||
|
---
|
||||||
|
|
||||||
|
You are a security auditor for a NixOS homelab infrastructure. Your task is to analyze audit logs and reconstruct user activity on hosts.
|
||||||
|
|
||||||
|
## Input
|
||||||
|
|
||||||
|
You may receive:
|
||||||
|
- A host or list of hosts to investigate
|
||||||
|
- A time window (e.g., "last hour", "today", "between 14:00 and 15:00")
|
||||||
|
- Optional context: specific events to look for, user to focus on, or suspicious activity to investigate
|
||||||
|
- Optional context from a parent investigation (e.g., "a service stopped at 14:32, what happened around that time?")
|
||||||
|
|
||||||
|
## Audit Log Structure
|
||||||
|
|
||||||
|
Logs are shipped to Loki via promtail. Audit events use these labels:
|
||||||
|
- `host` - hostname
|
||||||
|
- `systemd_unit` - typically `auditd.service` for audit logs
|
||||||
|
- `job` - typically `systemd-journal`
|
||||||
|
|
||||||
|
Audit log entries contain structured data:
|
||||||
|
- `EXECVE` - command execution with full arguments
|
||||||
|
- `USER_LOGIN` / `USER_LOGOUT` - session start/end
|
||||||
|
- `USER_CMD` - sudo command execution
|
||||||
|
- `CRED_ACQ` / `CRED_DISP` - credential acquisition/disposal
|
||||||
|
- `SERVICE_START` / `SERVICE_STOP` - systemd service events
|
||||||
|
|
||||||
|
## Investigation Techniques
|
||||||
|
|
||||||
|
### 1. SSH Session Activity
|
||||||
|
|
||||||
|
Find SSH logins and session activity:
|
||||||
|
```logql
|
||||||
|
{host="<hostname>", systemd_unit="sshd.service"}
|
||||||
|
```
|
||||||
|
|
||||||
|
Look for:
|
||||||
|
- Accepted/Failed authentication
|
||||||
|
- Session opened/closed
|
||||||
|
- Unusual source IPs or users
|
||||||
|
|
||||||
|
### 2. Command Execution
|
||||||
|
|
||||||
|
Query executed commands (filter out noise):
|
||||||
|
```logql
|
||||||
|
{host="<hostname>"} |= "EXECVE" != "PATH item" != "PROCTITLE" != "SYSCALL" != "BPF"
|
||||||
|
```
|
||||||
|
|
||||||
|
Further filtering:
|
||||||
|
- Exclude systemd noise: `!= "systemd" != "/nix/store"`
|
||||||
|
- Focus on specific commands: `|= "rm" |= "-rf"`
|
||||||
|
- Focus on specific user: `|= "uid=1000"`
|
||||||
|
|
||||||
|
### 3. Sudo Activity
|
||||||
|
|
||||||
|
Check for privilege escalation:
|
||||||
|
```logql
|
||||||
|
{host="<hostname>"} |= "sudo" |= "COMMAND"
|
||||||
|
```
|
||||||
|
|
||||||
|
Or via audit:
|
||||||
|
```logql
|
||||||
|
{host="<hostname>"} |= "USER_CMD"
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. Service Manipulation
|
||||||
|
|
||||||
|
Check if services were manually stopped/started:
|
||||||
|
```logql
|
||||||
|
{host="<hostname>"} |= "EXECVE" |= "systemctl"
|
||||||
|
```
|
||||||
|
|
||||||
|
### 5. File Operations
|
||||||
|
|
||||||
|
Look for file modifications (if auditd rules are configured):
|
||||||
|
```logql
|
||||||
|
{host="<hostname>"} |= "EXECVE" |= "vim"
|
||||||
|
{host="<hostname>"} |= "EXECVE" |= "nano"
|
||||||
|
{host="<hostname>"} |= "EXECVE" |= "rm"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Query Guidelines
|
||||||
|
|
||||||
|
**Start narrow, expand if needed:**
|
||||||
|
- Begin with `limit: 20-30`
|
||||||
|
- Use tight time windows: `start: "15m"` or `start: "30m"`
|
||||||
|
- Add filters progressively
|
||||||
|
|
||||||
|
**Avoid:**
|
||||||
|
- Querying all audit logs without EXECVE filter (extremely verbose)
|
||||||
|
- Large time ranges without specific filters
|
||||||
|
- Limits over 50 without tight filters
|
||||||
|
|
||||||
|
**Time-bounded queries:**
|
||||||
|
When investigating around a specific event:
|
||||||
|
```logql
|
||||||
|
{host="<hostname>"} |= "EXECVE" != "systemd"
|
||||||
|
```
|
||||||
|
With `start: "2026-02-08T14:30:00Z"` and `end: "2026-02-08T14:35:00Z"`
|
||||||
|
|
||||||
|
## Suspicious Patterns to Watch For
|
||||||
|
|
||||||
|
1. **Unusual login times** - Activity outside normal hours
|
||||||
|
2. **Failed authentication** - Brute force attempts
|
||||||
|
3. **Privilege escalation** - Unexpected sudo usage
|
||||||
|
4. **Reconnaissance commands** - `whoami`, `id`, `uname`, `cat /etc/passwd`
|
||||||
|
5. **Data exfiltration indicators** - `curl`, `wget`, `scp`, `rsync` to external destinations
|
||||||
|
6. **Persistence mechanisms** - Cron modifications, systemd service creation
|
||||||
|
7. **Log tampering** - Commands targeting log files
|
||||||
|
8. **Lateral movement** - SSH to other internal hosts
|
||||||
|
9. **Service manipulation** - Stopping security services, disabling firewalls
|
||||||
|
10. **Cleanup activity** - Deleting bash history, clearing logs
|
||||||
|
|
||||||
|
## Output Format
|
||||||
|
|
||||||
|
### For Standalone Security Reviews
|
||||||
|
|
||||||
|
```
|
||||||
|
## Activity Summary
|
||||||
|
|
||||||
|
**Host:** <hostname>
|
||||||
|
**Time Period:** <start> to <end>
|
||||||
|
**Sessions Found:** <count>
|
||||||
|
|
||||||
|
## User Sessions
|
||||||
|
|
||||||
|
### Session 1: <user> from <source_ip>
|
||||||
|
- **Login:** HH:MM:SSZ
|
||||||
|
- **Logout:** HH:MM:SSZ (or ongoing)
|
||||||
|
- **Commands executed:**
|
||||||
|
- HH:MM:SSZ - <command>
|
||||||
|
- HH:MM:SSZ - <command>
|
||||||
|
|
||||||
|
## Suspicious Activity
|
||||||
|
|
||||||
|
[If any patterns from the watch list were detected]
|
||||||
|
- **Finding:** <description>
|
||||||
|
- **Evidence:** <log entries>
|
||||||
|
- **Risk Level:** Low / Medium / High
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
[Overall assessment: normal activity, concerning patterns, or clear malicious activity]
|
||||||
|
```
|
||||||
|
|
||||||
|
### When Called by Another Agent
|
||||||
|
|
||||||
|
Provide a focused response addressing the specific question:
|
||||||
|
|
||||||
|
```
|
||||||
|
## Audit Findings
|
||||||
|
|
||||||
|
**Query:** <what was asked>
|
||||||
|
**Time Window:** <investigated period>
|
||||||
|
|
||||||
|
## Relevant Activity
|
||||||
|
|
||||||
|
[Chronological list of relevant events]
|
||||||
|
- HH:MM:SSZ - <event>
|
||||||
|
- HH:MM:SSZ - <event>
|
||||||
|
|
||||||
|
## Assessment
|
||||||
|
|
||||||
|
[Direct answer to the question with supporting evidence]
|
||||||
|
```
|
||||||
|
|
||||||
|
## Guidelines
|
||||||
|
|
||||||
|
- Reconstruct timelines chronologically
|
||||||
|
- Correlate events (login → commands → logout)
|
||||||
|
- Note gaps or missing data
|
||||||
|
- Distinguish between automated (systemd, cron) and interactive activity
|
||||||
|
- Consider the host's role and tier when assessing severity
|
||||||
|
- When called by another agent, focus on answering their specific question
|
||||||
|
- Don't speculate without evidence - state what the logs show and don't show
|
||||||
@@ -4,6 +4,7 @@ description: Investigates a single system alarm by querying Prometheus metrics a
|
|||||||
tools: Read, Grep, Glob
|
tools: Read, Grep, Glob
|
||||||
mcpServers:
|
mcpServers:
|
||||||
- lab-monitoring
|
- lab-monitoring
|
||||||
|
- git-explorer
|
||||||
---
|
---
|
||||||
|
|
||||||
You are an alarm investigation specialist for a NixOS homelab infrastructure. Your task is to analyze a single alarm and determine its root cause.
|
You are an alarm investigation specialist for a NixOS homelab infrastructure. Your task is to analyze a single alarm and determine its root cause.
|
||||||
@@ -33,9 +34,9 @@ Gather evidence about the current system state:
|
|||||||
- Use `list_targets` to verify the host/service is being scraped successfully
|
- Use `list_targets` to verify the host/service is being scraped successfully
|
||||||
- Look for correlated metrics that might explain the issue
|
- Look for correlated metrics that might explain the issue
|
||||||
|
|
||||||
### 3. Check Logs
|
### 3. Check Service Logs
|
||||||
|
|
||||||
Search for relevant log entries using `query_logs`. **Be careful to avoid overly broad queries that return too much data.**
|
Search for relevant log entries using `query_logs`. Focus on service-specific logs and errors.
|
||||||
|
|
||||||
**Query strategies (start narrow, expand if needed):**
|
**Query strategies (start narrow, expand if needed):**
|
||||||
- Start with `limit: 20-30`, increase only if needed
|
- Start with `limit: 20-30`, increase only if needed
|
||||||
@@ -43,23 +44,42 @@ Search for relevant log entries using `query_logs`. **Be careful to avoid overly
|
|||||||
- Filter to specific services: `{host="<hostname>", systemd_unit="<service>.service"}`
|
- Filter to specific services: `{host="<hostname>", systemd_unit="<service>.service"}`
|
||||||
- Search for errors: `{host="<hostname>"} |= "error"` or `|= "failed"`
|
- Search for errors: `{host="<hostname>"} |= "error"` or `|= "failed"`
|
||||||
|
|
||||||
**For audit logs (SSH sessions, command execution):**
|
|
||||||
- Filter to just commands: `{host="<hostname>"} |= "EXECVE"`
|
|
||||||
- Exclude verbose noise: `!= "PATH item" != "PROCTITLE" != "SYSCALL" != "BPF"`
|
|
||||||
- Example: `{host="testvm01"} |= "EXECVE" != "systemd"` (user commands only)
|
|
||||||
|
|
||||||
**Common patterns:**
|
**Common patterns:**
|
||||||
- Service logs: `{host="<hostname>", systemd_unit="<service>.service"}`
|
- Service logs: `{host="<hostname>", systemd_unit="<service>.service"}`
|
||||||
- SSH activity: `{host="<hostname>", systemd_unit="sshd.service"}`
|
|
||||||
- All errors on host: `{host="<hostname>"} |= "error"`
|
- All errors on host: `{host="<hostname>"} |= "error"`
|
||||||
- Specific command: `{host="<hostname>"} |= "EXECVE" |= "stress"`
|
- Journal for a unit: `{host="<hostname>", systemd_unit="nginx.service"} |= "failed"`
|
||||||
|
|
||||||
**Avoid:**
|
**Avoid:**
|
||||||
- Querying all audit logs without filtering (very verbose)
|
|
||||||
- Using `start: "1h"` with no filters on busy hosts
|
- Using `start: "1h"` with no filters on busy hosts
|
||||||
- Limits over 50 without specific filters
|
- Limits over 50 without specific filters
|
||||||
|
|
||||||
### 4. Check Configuration (if relevant)
|
### 4. Investigate User Activity
|
||||||
|
|
||||||
|
For any analysis of user activity, **always spawn the `auditor` agent**. Do not query audit logs (EXECVE, USER_LOGIN, etc.) directly - delegate this to the auditor.
|
||||||
|
|
||||||
|
**Always call the auditor when:**
|
||||||
|
- A service stopped unexpectedly (may have been manually stopped)
|
||||||
|
- A process was killed or a config was changed
|
||||||
|
- You need to know who was logged in around the time of an incident
|
||||||
|
- You need to understand what commands led to the current state
|
||||||
|
- The cause isn't obvious from service logs alone
|
||||||
|
|
||||||
|
**Do NOT try to query audit logs yourself.** The auditor is specialized for:
|
||||||
|
- Parsing EXECVE records and reconstructing command lines
|
||||||
|
- Correlating SSH sessions with commands executed
|
||||||
|
- Identifying suspicious patterns
|
||||||
|
- Filtering out systemd/nix-store noise
|
||||||
|
|
||||||
|
**Example prompt for auditor:**
|
||||||
|
```
|
||||||
|
Investigate user activity on <hostname> between <start_time> and <end_time>.
|
||||||
|
Context: The prometheus-node-exporter service stopped at 14:32.
|
||||||
|
Determine if it was manually stopped and by whom.
|
||||||
|
```
|
||||||
|
|
||||||
|
Incorporate the auditor's findings into your timeline and root cause analysis.
|
||||||
|
|
||||||
|
### 5. Check Configuration (if relevant)
|
||||||
|
|
||||||
If the alert relates to a NixOS-managed service:
|
If the alert relates to a NixOS-managed service:
|
||||||
- Check host configuration in `/hosts/<hostname>/`
|
- Check host configuration in `/hosts/<hostname>/`
|
||||||
@@ -67,9 +87,61 @@ If the alert relates to a NixOS-managed service:
|
|||||||
- Look for thresholds, resource limits, or misconfigurations
|
- Look for thresholds, resource limits, or misconfigurations
|
||||||
- Check `homelab.host` options for tier/priority/role metadata
|
- Check `homelab.host` options for tier/priority/role metadata
|
||||||
|
|
||||||
### 5. Consider Common Causes
|
### 6. Check for Configuration Drift
|
||||||
|
|
||||||
|
Use the git-explorer MCP server to compare the host's deployed configuration against the current master branch. This helps identify:
|
||||||
|
- Hosts running outdated configurations
|
||||||
|
- Recent changes that might have caused the issue
|
||||||
|
- Whether a fix has already been committed but not deployed
|
||||||
|
|
||||||
|
**Step 1: Get the deployed revision from Prometheus**
|
||||||
|
```promql
|
||||||
|
nixos_flake_info{hostname="<hostname>"}
|
||||||
|
```
|
||||||
|
The `current_rev` label contains the deployed git commit hash.
|
||||||
|
|
||||||
|
**Step 2: Check if the host is behind master**
|
||||||
|
```
|
||||||
|
resolve_ref("master") # Get current master commit
|
||||||
|
is_ancestor(deployed, master) # Check if host is behind
|
||||||
|
```
|
||||||
|
|
||||||
|
**Step 3: See what commits are missing**
|
||||||
|
```
|
||||||
|
commits_between(deployed, master) # List commits not yet deployed
|
||||||
|
```
|
||||||
|
|
||||||
|
**Step 4: Check which files changed**
|
||||||
|
```
|
||||||
|
get_diff_files(deployed, master) # Files modified since deployment
|
||||||
|
```
|
||||||
|
Look for files in `hosts/<hostname>/`, `services/<relevant-service>/`, or `system/` that affect this host.
|
||||||
|
|
||||||
|
**Step 5: View configuration at the deployed revision**
|
||||||
|
```
|
||||||
|
get_file_at_commit(deployed, "services/<service>/default.nix")
|
||||||
|
```
|
||||||
|
Compare against the current file to understand differences.
|
||||||
|
|
||||||
|
**Step 6: Find when something changed**
|
||||||
|
```
|
||||||
|
search_commits("<service-name>") # Find commits mentioning the service
|
||||||
|
get_commit_info(<hash>) # Get full details of a specific change
|
||||||
|
```
|
||||||
|
|
||||||
|
**Example workflow for a service-related alert:**
|
||||||
|
1. Query `nixos_flake_info{hostname="monitoring01"}` → `current_rev: 8959829`
|
||||||
|
2. `resolve_ref("master")` → `4633421`
|
||||||
|
3. `is_ancestor("8959829", "4633421")` → Yes, host is behind
|
||||||
|
4. `commits_between("8959829", "4633421")` → 7 commits missing
|
||||||
|
5. `get_diff_files("8959829", "4633421")` → Check if relevant service files changed
|
||||||
|
6. If a fix was committed after the deployed rev, recommend deployment
|
||||||
|
|
||||||
|
### 7. Consider Common Causes
|
||||||
|
|
||||||
For infrastructure alerts, common causes include:
|
For infrastructure alerts, common causes include:
|
||||||
|
- **Manual intervention**: Service manually stopped/restarted (call auditor to confirm)
|
||||||
|
- **Configuration drift**: Host running outdated config, fix already in master
|
||||||
- **Disk space**: Nix store growth, logs, temp files
|
- **Disk space**: Nix store growth, logs, temp files
|
||||||
- **Memory pressure**: Service memory leaks, insufficient limits
|
- **Memory pressure**: Service memory leaks, insufficient limits
|
||||||
- **CPU**: Runaway processes, build jobs
|
- **CPU**: Runaway processes, build jobs
|
||||||
@@ -77,6 +149,8 @@ For infrastructure alerts, common causes include:
|
|||||||
- **Service restarts**: Failed upgrades, configuration errors
|
- **Service restarts**: Failed upgrades, configuration errors
|
||||||
- **Scrape failures**: Service down, firewall issues, port changes
|
- **Scrape failures**: Service down, firewall issues, port changes
|
||||||
|
|
||||||
|
**Note:** If a service stopped unexpectedly and service logs don't show a crash or error, it was likely manual intervention - call the auditor to investigate.
|
||||||
|
|
||||||
## Output Format
|
## Output Format
|
||||||
|
|
||||||
Provide a concise report with one of two outcomes:
|
Provide a concise report with one of two outcomes:
|
||||||
@@ -133,6 +207,5 @@ Provide a concise report with one of two outcomes:
|
|||||||
- If the alert is a false positive or expected behavior, explain why
|
- If the alert is a false positive or expected behavior, explain why
|
||||||
- Consider the host's tier (test vs prod) when assessing severity
|
- Consider the host's tier (test vs prod) when assessing severity
|
||||||
- Build a timeline from log timestamps and metrics to show the sequence of events
|
- Build a timeline from log timestamps and metrics to show the sequence of events
|
||||||
- Include precursor events (logins, config changes, restarts) that led to the issue
|
|
||||||
- **Query logs incrementally**: start with narrow filters and small limits, expand only if needed
|
- **Query logs incrementally**: start with narrow filters and small limits, expand only if needed
|
||||||
- **Avoid broad audit log queries**: always filter to EXECVE and exclude noise (PATH, SYSCALL, BPF)
|
- **Always delegate to the auditor agent** for any user activity analysis - never query EXECVE or audit logs directly
|
||||||
|
|||||||
@@ -33,6 +33,13 @@
|
|||||||
"--nats-url", "nats://nats1.home.2rjus.net:4222",
|
"--nats-url", "nats://nats1.home.2rjus.net:4222",
|
||||||
"--nkey-file", "/home/torjus/.config/homelab-deploy/test-deployer.nkey"
|
"--nkey-file", "/home/torjus/.config/homelab-deploy/test-deployer.nkey"
|
||||||
]
|
]
|
||||||
|
},
|
||||||
|
"git-explorer": {
|
||||||
|
"command": "nix",
|
||||||
|
"args": ["run", "git+https://git.t-juice.club/torjus/labmcp#git-explorer", "--", "serve"],
|
||||||
|
"env": {
|
||||||
|
"GIT_REPO_PATH": "/home/torjus/git/nixos-servers"
|
||||||
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -66,9 +66,9 @@ This future migration path is a strong argument for Kanidm over LDAP-only soluti
|
|||||||
- Vault integration for idm_admin password
|
- Vault integration for idm_admin password
|
||||||
- LDAPS on port 636
|
- LDAPS on port 636
|
||||||
|
|
||||||
2. **Configure declarative provisioning** ✅
|
2. **Configure provisioning** ✅
|
||||||
- Groups: `admins`, `users`, `ssh-users`
|
- Groups provisioned declaratively: `admins`, `users`, `ssh-users`
|
||||||
- User: `torjus` (member of all groups)
|
- Users managed imperatively via CLI (allows setting POSIX passwords in one step)
|
||||||
- POSIX attributes enabled (UID/GID range 65,536-69,999)
|
- POSIX attributes enabled (UID/GID range 65,536-69,999)
|
||||||
|
|
||||||
3. **Test NAS integration** (in progress)
|
3. **Test NAS integration** (in progress)
|
||||||
@@ -80,14 +80,16 @@ This future migration path is a strong argument for Kanidm over LDAP-only soluti
|
|||||||
- Grafana
|
- Grafana
|
||||||
- Other services as needed
|
- Other services as needed
|
||||||
|
|
||||||
5. **Create client module** in `system/` for PAM/NSS
|
5. **Create client module** in `system/` for PAM/NSS ✅
|
||||||
- Enable on all hosts that need central auth
|
- Module: `system/kanidm-client.nix`
|
||||||
- Configure trusted CA
|
- `homelab.kanidm.enable = true` enables PAM/NSS
|
||||||
|
- Short usernames (not SPN format)
|
||||||
|
- Home directory symlinks via `home_alias`
|
||||||
|
- Enabled on test tier: testvm01, testvm02, testvm03
|
||||||
|
|
||||||
6. **Documentation**
|
6. **Documentation** ✅
|
||||||
- User management procedures
|
- `docs/user-management.md` - CLI workflows, troubleshooting
|
||||||
- Adding new OAuth2 clients
|
- User/group creation procedures verified working
|
||||||
- Troubleshooting PAM/NSS issues
|
|
||||||
|
|
||||||
## Progress
|
## Progress
|
||||||
|
|
||||||
@@ -106,14 +108,37 @@ This future migration path is a strong argument for Kanidm over LDAP-only soluti
|
|||||||
- Prometheus monitoring scrape target configured
|
- Prometheus monitoring scrape target configured
|
||||||
|
|
||||||
**Provisioned entities:**
|
**Provisioned entities:**
|
||||||
- Groups: `admins`, `users`, `ssh-users`
|
- Groups: `admins`, `users`, `ssh-users` (declarative)
|
||||||
- User: `torjus` (member of all groups, POSIX enabled with GID 65536)
|
- Users managed via CLI (imperative)
|
||||||
|
|
||||||
**Verified working:**
|
**Verified working:**
|
||||||
- WebUI login with idm_admin
|
- WebUI login with idm_admin
|
||||||
- LDAP bind and search with POSIX-enabled user
|
- LDAP bind and search with POSIX-enabled user
|
||||||
- LDAPS with valid internal CA certificate
|
- LDAPS with valid internal CA certificate
|
||||||
|
|
||||||
|
### Completed (2026-02-08) - PAM/NSS Client
|
||||||
|
|
||||||
|
**Client module deployed (`system/kanidm-client.nix`):**
|
||||||
|
- `homelab.kanidm.enable = true` enables PAM/NSS integration
|
||||||
|
- Connects to auth.home.2rjus.net
|
||||||
|
- Short usernames (`torjus` instead of `torjus@home.2rjus.net`)
|
||||||
|
- Home directory symlinks (`/home/torjus` → UUID-based dir)
|
||||||
|
- Login restricted to `ssh-users` group
|
||||||
|
|
||||||
|
**Enabled on test tier:**
|
||||||
|
- testvm01, testvm02, testvm03
|
||||||
|
|
||||||
|
**Verified working:**
|
||||||
|
- User/group resolution via `getent`
|
||||||
|
- SSH login with Kanidm unix passwords
|
||||||
|
- Home directory creation with symlinks
|
||||||
|
- Imperative user/group creation via CLI
|
||||||
|
|
||||||
|
**Documentation:**
|
||||||
|
- `docs/user-management.md` with full CLI workflows
|
||||||
|
- Password requirements (min 10 chars)
|
||||||
|
- Troubleshooting guide (nscd, cache invalidation)
|
||||||
|
|
||||||
### UID/GID Range (Resolved)
|
### UID/GID Range (Resolved)
|
||||||
|
|
||||||
**Range: 65,536 - 69,999** (manually allocated)
|
**Range: 65,536 - 69,999** (manually allocated)
|
||||||
@@ -128,10 +153,9 @@ Rationale:
|
|||||||
|
|
||||||
### Next Steps
|
### Next Steps
|
||||||
|
|
||||||
1. Deploy to monitoring01 to enable Prometheus scraping
|
1. Enable PAM/NSS on production hosts (after test tier validation)
|
||||||
2. Configure TrueNAS LDAP client for NAS integration testing
|
2. Configure TrueNAS LDAP client for NAS integration testing
|
||||||
3. Add OAuth2 clients (Grafana first)
|
3. Add OAuth2 clients (Grafana first)
|
||||||
4. Create PAM/NSS client module for other hosts
|
|
||||||
|
|
||||||
## References
|
## References
|
||||||
|
|
||||||
|
|||||||
116
docs/plans/memory-issues-follow-up.md
Normal file
116
docs/plans/memory-issues-follow-up.md
Normal file
@@ -0,0 +1,116 @@
|
|||||||
|
# Memory Issues Follow-up
|
||||||
|
|
||||||
|
Tracking the zram change to verify it resolves OOM issues during nixos-upgrade on low-memory hosts.
|
||||||
|
|
||||||
|
## Background
|
||||||
|
|
||||||
|
On 2026-02-08, ns2 (2GB RAM) experienced an OOM kill during nixos-upgrade. The Nix evaluation process consumed ~1.6GB before being killed by the kernel. ns1 (manually increased to 4GB) succeeded with the same upgrade.
|
||||||
|
|
||||||
|
Root cause: 2GB RAM is insufficient for Nix flake evaluation without swap.
|
||||||
|
|
||||||
|
## Fix Applied
|
||||||
|
|
||||||
|
**Commit:** `1674b6a` - system: enable zram swap for all hosts
|
||||||
|
|
||||||
|
**Merged:** 2026-02-08 ~12:15 UTC
|
||||||
|
|
||||||
|
**Change:** Added `zramSwap.enable = true` to `system/zram.nix`, providing ~2GB compressed swap on all hosts.
|
||||||
|
|
||||||
|
## Timeline
|
||||||
|
|
||||||
|
| Time (UTC) | Event |
|
||||||
|
|------------|-------|
|
||||||
|
| 05:00:46 | ns2 nixos-upgrade OOM killed |
|
||||||
|
| 05:01:47 | `nixos_upgrade_failed` alert fired |
|
||||||
|
| 12:15 | zram commit merged to master |
|
||||||
|
| 12:19 | ns2 rebooted with zram enabled |
|
||||||
|
| 12:20 | ns1 rebooted (memory reduced to 2GB via tofu) |
|
||||||
|
|
||||||
|
## Hosts Affected
|
||||||
|
|
||||||
|
All 2GB VMs that run nixos-upgrade:
|
||||||
|
- ns1, ns2 (DNS)
|
||||||
|
- vault01
|
||||||
|
- testvm01, testvm02, testvm03
|
||||||
|
- kanidm01
|
||||||
|
|
||||||
|
## Metrics to Monitor
|
||||||
|
|
||||||
|
Check these in Grafana or via PromQL to verify the fix:
|
||||||
|
|
||||||
|
### Swap availability (should be ~2GB after upgrade)
|
||||||
|
```promql
|
||||||
|
node_memory_SwapTotal_bytes / 1024 / 1024
|
||||||
|
```
|
||||||
|
|
||||||
|
### Swap usage during upgrades
|
||||||
|
```promql
|
||||||
|
(node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / 1024 / 1024
|
||||||
|
```
|
||||||
|
|
||||||
|
### Zswap compressed bytes (active compression)
|
||||||
|
```promql
|
||||||
|
node_memory_Zswap_bytes / 1024 / 1024
|
||||||
|
```
|
||||||
|
|
||||||
|
### Upgrade failures (should be 0)
|
||||||
|
```promql
|
||||||
|
node_systemd_unit_state{name="nixos-upgrade.service", state="failed"}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Memory available during upgrades
|
||||||
|
```promql
|
||||||
|
node_memory_MemAvailable_bytes / 1024 / 1024
|
||||||
|
```
|
||||||
|
|
||||||
|
## Verification Steps
|
||||||
|
|
||||||
|
After a few days (allow auto-upgrades to run on all hosts):
|
||||||
|
|
||||||
|
1. Check all hosts have swap enabled:
|
||||||
|
```promql
|
||||||
|
node_memory_SwapTotal_bytes > 0
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Check for any upgrade failures since the fix:
|
||||||
|
```promql
|
||||||
|
count_over_time(ALERTS{alertname="nixos_upgrade_failed"}[7d])
|
||||||
|
```
|
||||||
|
|
||||||
|
3. Review if any hosts used swap during upgrades (check historical graphs)
|
||||||
|
|
||||||
|
## Success Criteria
|
||||||
|
|
||||||
|
- No `nixos_upgrade_failed` alerts due to OOM after 2026-02-08
|
||||||
|
- All hosts show ~2GB swap available
|
||||||
|
- Upgrades complete successfully on 2GB VMs
|
||||||
|
|
||||||
|
## Fallback Options
|
||||||
|
|
||||||
|
If zram is insufficient:
|
||||||
|
|
||||||
|
1. **Increase VM memory** - Update `terraform/vms.tf` to 4GB for affected hosts
|
||||||
|
2. **Enable memory ballooning** - Configure VMs with dynamic memory allocation (see below)
|
||||||
|
3. **Use remote builds** - Configure `nix.buildMachines` to offload evaluation
|
||||||
|
4. **Reduce flake size** - Split configurations to reduce evaluation memory
|
||||||
|
|
||||||
|
### Memory Ballooning
|
||||||
|
|
||||||
|
Proxmox supports memory ballooning, which allows VMs to dynamically grow/shrink memory allocation based on demand. The balloon driver inside the guest communicates with the hypervisor to release or reclaim memory pages.
|
||||||
|
|
||||||
|
Configuration in `terraform/vms.tf`:
|
||||||
|
```hcl
|
||||||
|
memory = 4096 # maximum memory
|
||||||
|
balloon = 2048 # minimum memory (shrinks to this when idle)
|
||||||
|
```
|
||||||
|
|
||||||
|
Pros:
|
||||||
|
- VMs get memory on-demand without reboots
|
||||||
|
- Better host memory utilization
|
||||||
|
- Solves upgrade OOM without permanently allocating 4GB
|
||||||
|
|
||||||
|
Cons:
|
||||||
|
- Requires QEMU guest agent running in guest
|
||||||
|
- Guest can experience memory pressure if host is overcommitted
|
||||||
|
|
||||||
|
Ballooning and zram are complementary - ballooning provides headroom from the host, zram provides overflow within the guest.
|
||||||
224
docs/plans/security-hardening.md
Normal file
224
docs/plans/security-hardening.md
Normal file
@@ -0,0 +1,224 @@
|
|||||||
|
# Security Hardening Plan
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Address security gaps identified in infrastructure review. Focus areas: SSH hardening, network security, logging improvements, and secrets management.
|
||||||
|
|
||||||
|
## Current State
|
||||||
|
|
||||||
|
- SSH allows password auth and unrestricted root login (`system/sshd.nix`)
|
||||||
|
- Firewall disabled on all hosts (`networking.firewall.enable = false`)
|
||||||
|
- Promtail ships logs over HTTP to Loki
|
||||||
|
- Loki has no authentication (`auth_enabled = false`)
|
||||||
|
- AppRole secret-IDs never expire (`secret_id_ttl = 0`)
|
||||||
|
- Vault TLS verification disabled by default (`skipTlsVerify = true`)
|
||||||
|
- Audit logging exists (`common/ssh-audit.nix`) but not applied globally
|
||||||
|
- Alert rules focus on availability, no security event detection
|
||||||
|
|
||||||
|
## Priority Matrix
|
||||||
|
|
||||||
|
| Issue | Severity | Effort | Priority |
|
||||||
|
|-------|----------|--------|----------|
|
||||||
|
| SSH password auth | High | Low | **P1** |
|
||||||
|
| Firewall disabled | High | Medium | **P1** |
|
||||||
|
| Promtail HTTP (no TLS) | High | Medium | **P2** |
|
||||||
|
| No security alerting | Medium | Low | **P2** |
|
||||||
|
| Audit logging not global | Low | Low | **P2** |
|
||||||
|
| Loki no auth | Medium | Medium | **P3** |
|
||||||
|
| Secret-ID TTL | Medium | Medium | **P3** |
|
||||||
|
| Vault skipTlsVerify | Medium | Low | **P3** |
|
||||||
|
|
||||||
|
## Phase 1: Quick Wins (P1)
|
||||||
|
|
||||||
|
### 1.1 SSH Hardening
|
||||||
|
|
||||||
|
Edit `system/sshd.nix`:
|
||||||
|
|
||||||
|
```nix
|
||||||
|
services.openssh = {
|
||||||
|
enable = true;
|
||||||
|
settings = {
|
||||||
|
PermitRootLogin = "prohibit-password"; # Key-only root login
|
||||||
|
PasswordAuthentication = false;
|
||||||
|
KbdInteractiveAuthentication = false;
|
||||||
|
};
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
**Prerequisite:** Verify all hosts have SSH keys deployed for root.
|
||||||
|
|
||||||
|
### 1.2 Enable Firewall
|
||||||
|
|
||||||
|
Create `system/firewall.nix` with default deny policy:
|
||||||
|
|
||||||
|
```nix
|
||||||
|
{ ... }: {
|
||||||
|
networking.firewall.enable = true;
|
||||||
|
|
||||||
|
# Use openssh's built-in firewall integration
|
||||||
|
services.openssh.openFirewall = true;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Useful firewall options:**
|
||||||
|
|
||||||
|
| Option | Description |
|
||||||
|
|--------|-------------|
|
||||||
|
| `networking.firewall.trustedInterfaces` | Accept all traffic from these interfaces (e.g., `[ "lo" ]`) |
|
||||||
|
| `networking.firewall.interfaces.<name>.allowedTCPPorts` | Per-interface port rules |
|
||||||
|
| `networking.firewall.extraInputRules` | Custom nftables rules (for complex filtering) |
|
||||||
|
|
||||||
|
**Network range restrictions:** Consider restricting SSH to the infrastructure subnet (`10.69.13.0/24`) using `extraInputRules` for defense in depth. However, this adds complexity and may not be necessary given the trusted network model.
|
||||||
|
|
||||||
|
#### Per-Interface Rules (http-proxy WireGuard)
|
||||||
|
|
||||||
|
The `http-proxy` host has a WireGuard interface (`wg0`) that may need different rules than the LAN interface. Use `networking.firewall.interfaces` to apply per-interface policies:
|
||||||
|
|
||||||
|
```nix
|
||||||
|
# Example: http-proxy with different rules per interface
|
||||||
|
networking.firewall = {
|
||||||
|
enable = true;
|
||||||
|
|
||||||
|
# Default: only SSH (via openFirewall)
|
||||||
|
allowedTCPPorts = [ ];
|
||||||
|
|
||||||
|
# LAN interface: allow HTTP/HTTPS
|
||||||
|
interfaces.ens18 = {
|
||||||
|
allowedTCPPorts = [ 80 443 ];
|
||||||
|
};
|
||||||
|
|
||||||
|
# WireGuard interface: restrict to specific services or trust fully
|
||||||
|
interfaces.wg0 = {
|
||||||
|
allowedTCPPorts = [ 80 443 ];
|
||||||
|
# Or use trustedInterfaces = [ "wg0" ] if fully trusted
|
||||||
|
};
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
**TODO:** Investigate current WireGuard usage on http-proxy to determine appropriate rules.
|
||||||
|
|
||||||
|
Then per-host, open required ports:
|
||||||
|
|
||||||
|
| Host | Additional Ports |
|
||||||
|
|------|------------------|
|
||||||
|
| ns1/ns2 | 53 (TCP/UDP) |
|
||||||
|
| vault01 | 8200 |
|
||||||
|
| monitoring01 | 3100, 9090, 3000, 9093 |
|
||||||
|
| http-proxy | 80, 443 |
|
||||||
|
| nats1 | 4222 |
|
||||||
|
| ha1 | 1883, 8123 |
|
||||||
|
| jelly01 | 8096 |
|
||||||
|
| nix-cache01 | 5000 |
|
||||||
|
|
||||||
|
## Phase 2: Logging & Detection (P2)
|
||||||
|
|
||||||
|
### 2.1 Enable TLS for Promtail → Loki
|
||||||
|
|
||||||
|
Update `system/monitoring/logs.nix`:
|
||||||
|
|
||||||
|
```nix
|
||||||
|
clients = [{
|
||||||
|
url = "https://monitoring01.home.2rjus.net:3100/loki/api/v1/push";
|
||||||
|
tls_config = {
|
||||||
|
ca_file = "/etc/ssl/certs/homelab-root-ca.pem";
|
||||||
|
};
|
||||||
|
}];
|
||||||
|
```
|
||||||
|
|
||||||
|
Requires:
|
||||||
|
- Configure Loki with TLS certificate (use internal ACME)
|
||||||
|
- Ensure all hosts trust root CA (already done via `system/pki/root-ca.nix`)
|
||||||
|
|
||||||
|
### 2.2 Security Alert Rules
|
||||||
|
|
||||||
|
Add to `services/monitoring/rules.yml`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
- name: security_rules
|
||||||
|
rules:
|
||||||
|
- alert: ssh_auth_failures
|
||||||
|
expr: increase(node_logind_sessions_total[5m]) > 20
|
||||||
|
for: 0m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
summary: "Unusual login activity on {{ $labels.instance }}"
|
||||||
|
|
||||||
|
- alert: vault_secret_fetch_failure
|
||||||
|
expr: increase(vault_secret_failures[5m]) > 5
|
||||||
|
for: 0m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
summary: "Vault secret fetch failures on {{ $labels.instance }}"
|
||||||
|
```
|
||||||
|
|
||||||
|
Also add Loki-based alerts for:
|
||||||
|
- Failed SSH attempts: `{job="systemd-journal"} |= "Failed password"`
|
||||||
|
- sudo usage: `{job="systemd-journal"} |= "sudo"`
|
||||||
|
|
||||||
|
### 2.3 Global Audit Logging
|
||||||
|
|
||||||
|
Add `./common/ssh-audit.nix` import to `system/default.nix`:
|
||||||
|
|
||||||
|
```nix
|
||||||
|
imports = [
|
||||||
|
# ... existing imports
|
||||||
|
../common/ssh-audit.nix
|
||||||
|
];
|
||||||
|
```
|
||||||
|
|
||||||
|
## Phase 3: Defense in Depth (P3)
|
||||||
|
|
||||||
|
### 3.1 Loki Authentication
|
||||||
|
|
||||||
|
Options:
|
||||||
|
1. **Basic auth via reverse proxy** - Put Loki behind Caddy with auth
|
||||||
|
2. **Loki multi-tenancy** - Enable `auth_enabled = true` and use tenant IDs
|
||||||
|
3. **Network isolation** - Bind Loki only to localhost, expose via authenticated proxy
|
||||||
|
|
||||||
|
Recommendation: Option 1 (reverse proxy) is simplest for homelab.
|
||||||
|
|
||||||
|
### 3.2 AppRole Secret Rotation
|
||||||
|
|
||||||
|
Update `terraform/vault/approle.tf`:
|
||||||
|
|
||||||
|
```hcl
|
||||||
|
secret_id_ttl = 2592000 # 30 days
|
||||||
|
```
|
||||||
|
|
||||||
|
Add documentation for manual rotation procedure or implement automated rotation via the existing `restartTrigger` mechanism in `vault-secrets.nix`.
|
||||||
|
|
||||||
|
### 3.3 Enable Vault TLS Verification
|
||||||
|
|
||||||
|
Change default in `system/vault-secrets.nix`:
|
||||||
|
|
||||||
|
```nix
|
||||||
|
skipTlsVerify = mkOption {
|
||||||
|
type = types.bool;
|
||||||
|
default = false; # Changed from true
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
**Prerequisite:** Verify all hosts trust the internal CA that signed the Vault certificate.
|
||||||
|
|
||||||
|
## Implementation Order
|
||||||
|
|
||||||
|
1. **Test on test-tier first** - Deploy phases 1-2 to testvm01/02/03
|
||||||
|
2. **Validate SSH access** - Ensure key-based login works before disabling passwords
|
||||||
|
3. **Document firewall ports** - Create reference of ports per host before enabling
|
||||||
|
4. **Phase prod rollout** - Deploy to prod hosts one at a time, verify each
|
||||||
|
|
||||||
|
## Open Questions
|
||||||
|
|
||||||
|
- [ ] Do all hosts have SSH keys configured for root access?
|
||||||
|
- [ ] Should firewall rules be per-host or use a central definition with roles?
|
||||||
|
- [ ] Should Loki authentication use the existing Kanidm setup?
|
||||||
|
|
||||||
|
**Resolved:** Password-based SSH access for recovery is not required - most hosts have console access through Proxmox or physical access, which provides an out-of-band recovery path if SSH keys fail.
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
- Firewall changes are the highest risk - test thoroughly on test-tier
|
||||||
|
- SSH hardening must not lock out access - verify keys first
|
||||||
|
- Consider creating a "break glass" procedure for emergency access if keys fail
|
||||||
267
docs/user-management.md
Normal file
267
docs/user-management.md
Normal file
@@ -0,0 +1,267 @@
|
|||||||
|
# User Management with Kanidm
|
||||||
|
|
||||||
|
Central authentication for the homelab using Kanidm.
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
- **Server**: kanidm01.home.2rjus.net (auth.home.2rjus.net)
|
||||||
|
- **WebUI**: https://auth.home.2rjus.net
|
||||||
|
- **LDAPS**: port 636
|
||||||
|
|
||||||
|
## CLI Setup
|
||||||
|
|
||||||
|
The `kanidm` CLI is available in the devshell:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
nix develop
|
||||||
|
|
||||||
|
# Login as idm_admin
|
||||||
|
kanidm login --name idm_admin --url https://auth.home.2rjus.net
|
||||||
|
```
|
||||||
|
|
||||||
|
## User Management
|
||||||
|
|
||||||
|
POSIX users are managed imperatively via the `kanidm` CLI. This allows setting
|
||||||
|
all attributes (including UNIX password) in one workflow.
|
||||||
|
|
||||||
|
### Creating a POSIX User
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Create the person
|
||||||
|
kanidm person create <username> "<Display Name>"
|
||||||
|
|
||||||
|
# Add to groups
|
||||||
|
kanidm group add-members ssh-users <username>
|
||||||
|
|
||||||
|
# Enable POSIX (UID is auto-assigned)
|
||||||
|
kanidm person posix set <username>
|
||||||
|
|
||||||
|
# Set UNIX password (required for SSH login, min 10 characters)
|
||||||
|
kanidm person posix set-password <username>
|
||||||
|
|
||||||
|
# Optionally set login shell
|
||||||
|
kanidm person posix set <username> --shell /bin/zsh
|
||||||
|
```
|
||||||
|
|
||||||
|
### Example: Full User Creation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kanidm person create testuser "Test User"
|
||||||
|
kanidm group add-members ssh-users testuser
|
||||||
|
kanidm person posix set testuser
|
||||||
|
kanidm person posix set-password testuser
|
||||||
|
kanidm person get testuser
|
||||||
|
```
|
||||||
|
|
||||||
|
After creation, verify on a client host:
|
||||||
|
```bash
|
||||||
|
getent passwd testuser
|
||||||
|
ssh testuser@testvm01.home.2rjus.net
|
||||||
|
```
|
||||||
|
|
||||||
|
### Viewing User Details
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kanidm person get <username>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Removing a User
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kanidm person delete <username>
|
||||||
|
```
|
||||||
|
|
||||||
|
## Group Management
|
||||||
|
|
||||||
|
Groups for POSIX access are also managed via CLI.
|
||||||
|
|
||||||
|
### Creating a POSIX Group
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Create the group
|
||||||
|
kanidm group create <group-name>
|
||||||
|
|
||||||
|
# Enable POSIX with a specific GID
|
||||||
|
kanidm group posix set <group-name> --gidnumber <gid>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Adding Members
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kanidm group add-members <group-name> <username>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Viewing Group Details
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kanidm group get <group-name>
|
||||||
|
kanidm group list-members <group-name>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Example: Full Group Creation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kanidm group create testgroup
|
||||||
|
kanidm group posix set testgroup --gidnumber 68010
|
||||||
|
kanidm group add-members testgroup testuser
|
||||||
|
kanidm group get testgroup
|
||||||
|
```
|
||||||
|
|
||||||
|
After creation, verify on a client host:
|
||||||
|
```bash
|
||||||
|
getent group testgroup
|
||||||
|
```
|
||||||
|
|
||||||
|
### Current Groups
|
||||||
|
|
||||||
|
| Group | GID | Purpose |
|
||||||
|
|-------|-----|---------|
|
||||||
|
| ssh-users | 68000 | SSH login access |
|
||||||
|
| admins | 68001 | Administrative access |
|
||||||
|
| users | 68002 | General users |
|
||||||
|
|
||||||
|
### UID/GID Allocation
|
||||||
|
|
||||||
|
Kanidm auto-assigns UIDs/GIDs from its configured range. For manually assigned GIDs:
|
||||||
|
|
||||||
|
| Range | Purpose |
|
||||||
|
|-------|---------|
|
||||||
|
| 65,536+ | Users (auto-assigned) |
|
||||||
|
| 68,000 - 68,999 | Groups (manually assigned) |
|
||||||
|
|
||||||
|
## PAM/NSS Client Configuration
|
||||||
|
|
||||||
|
Enable central authentication on a host:
|
||||||
|
|
||||||
|
```nix
|
||||||
|
homelab.kanidm.enable = true;
|
||||||
|
```
|
||||||
|
|
||||||
|
This configures:
|
||||||
|
- `services.kanidm.enablePam = true`
|
||||||
|
- Client connection to auth.home.2rjus.net
|
||||||
|
- Login authorization for `ssh-users` group
|
||||||
|
- Short usernames (`torjus` instead of `torjus@home.2rjus.net`)
|
||||||
|
- Home directory symlinks (`/home/torjus` → UUID-based directory)
|
||||||
|
|
||||||
|
### Enabled Hosts
|
||||||
|
|
||||||
|
- testvm01, testvm02, testvm03 (test tier)
|
||||||
|
|
||||||
|
### Options
|
||||||
|
|
||||||
|
```nix
|
||||||
|
homelab.kanidm = {
|
||||||
|
enable = true;
|
||||||
|
server = "https://auth.home.2rjus.net"; # default
|
||||||
|
allowedLoginGroups = [ "ssh-users" ]; # default
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
### Home Directories
|
||||||
|
|
||||||
|
Home directories use UUID-based paths for stability (so renaming a user doesn't
|
||||||
|
require moving their home directory). Symlinks provide convenient access:
|
||||||
|
|
||||||
|
```
|
||||||
|
/home/torjus -> /home/e4f4c56c-4aee-4c20-846f-90cb69807733
|
||||||
|
```
|
||||||
|
|
||||||
|
The symlinks are created by `kanidm-unixd-tasks` on first login.
|
||||||
|
|
||||||
|
## Testing
|
||||||
|
|
||||||
|
### Verify NSS Resolution
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check user resolution
|
||||||
|
getent passwd <username>
|
||||||
|
|
||||||
|
# Check group resolution
|
||||||
|
getent group <group-name>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Test SSH Login
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh <username>@<hostname>.home.2rjus.net
|
||||||
|
```
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### "PAM user mismatch" error
|
||||||
|
|
||||||
|
SSH fails with "fatal: PAM user mismatch" in logs. This happens when Kanidm returns
|
||||||
|
usernames in SPN format (`torjus@home.2rjus.net`) but SSH expects short names (`torjus`).
|
||||||
|
|
||||||
|
**Solution**: Configure `uid_attr_map = "name"` in unixSettings (already set in our module).
|
||||||
|
|
||||||
|
Check current format:
|
||||||
|
```bash
|
||||||
|
getent passwd torjus
|
||||||
|
# Should show: torjus:x:65536:...
|
||||||
|
# NOT: torjus@home.2rjus.net:x:65536:...
|
||||||
|
```
|
||||||
|
|
||||||
|
### User resolves but SSH fails immediately
|
||||||
|
|
||||||
|
The user's login group (e.g., `ssh-users`) likely doesn't have POSIX enabled:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check if group has POSIX
|
||||||
|
getent group ssh-users
|
||||||
|
|
||||||
|
# If empty, enable POSIX on the server
|
||||||
|
kanidm group posix set ssh-users --gidnumber 68000
|
||||||
|
```
|
||||||
|
|
||||||
|
### User doesn't resolve via getent
|
||||||
|
|
||||||
|
1. Check kanidm-unixd service is running:
|
||||||
|
```bash
|
||||||
|
systemctl status kanidm-unixd
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Check unixd can reach server:
|
||||||
|
```bash
|
||||||
|
kanidm-unix status
|
||||||
|
# Should show: system: online, Kanidm: online
|
||||||
|
```
|
||||||
|
|
||||||
|
3. Check client can reach server:
|
||||||
|
```bash
|
||||||
|
curl -s https://auth.home.2rjus.net/status
|
||||||
|
```
|
||||||
|
|
||||||
|
4. Check user has POSIX enabled on server:
|
||||||
|
```bash
|
||||||
|
kanidm person get <username>
|
||||||
|
```
|
||||||
|
|
||||||
|
5. Restart nscd to clear stale cache:
|
||||||
|
```bash
|
||||||
|
systemctl restart nscd
|
||||||
|
```
|
||||||
|
|
||||||
|
6. Invalidate kanidm cache:
|
||||||
|
```bash
|
||||||
|
kanidm-unix cache-invalidate
|
||||||
|
```
|
||||||
|
|
||||||
|
### Changes not taking effect after deployment
|
||||||
|
|
||||||
|
NixOS uses nsncd (a Rust reimplementation of nscd) for NSS caching. After deploying
|
||||||
|
kanidm-unixd config changes, you may need to restart both services:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl restart kanidm-unixd
|
||||||
|
systemctl restart nscd
|
||||||
|
```
|
||||||
|
|
||||||
|
### Test PAM authentication directly
|
||||||
|
|
||||||
|
Use the kanidm-unix CLI to test PAM auth without SSH:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kanidm-unix auth-test --name <username>
|
||||||
|
```
|
||||||
@@ -207,6 +207,7 @@
|
|||||||
pkgs.ansible
|
pkgs.ansible
|
||||||
pkgs.opentofu
|
pkgs.opentofu
|
||||||
pkgs.openbao
|
pkgs.openbao
|
||||||
|
pkgs.kanidm_1_8
|
||||||
(pkgs.callPackage ./scripts/create-host { })
|
(pkgs.callPackage ./scripts/create-host { })
|
||||||
homelab-deploy.packages.${pkgs.system}.default
|
homelab-deploy.packages.${pkgs.system}.default
|
||||||
];
|
];
|
||||||
|
|||||||
@@ -64,9 +64,5 @@
|
|||||||
vault.enable = true;
|
vault.enable = true;
|
||||||
homelab.deploy.enable = true;
|
homelab.deploy.enable = true;
|
||||||
|
|
||||||
zramSwap = {
|
|
||||||
enable = true;
|
|
||||||
};
|
|
||||||
|
|
||||||
system.stateVersion = "23.11"; # Did you read the comment?
|
system.stateVersion = "23.11"; # Did you read the comment?
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -4,6 +4,5 @@
|
|||||||
./configuration.nix
|
./configuration.nix
|
||||||
../../services/nix-cache
|
../../services/nix-cache
|
||||||
../../services/actions-runner
|
../../services/actions-runner
|
||||||
./zram.nix
|
|
||||||
];
|
];
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -1,6 +0,0 @@
|
|||||||
{ ... }:
|
|
||||||
{
|
|
||||||
zramSwap = {
|
|
||||||
enable = true;
|
|
||||||
};
|
|
||||||
}
|
|
||||||
@@ -79,5 +79,8 @@
|
|||||||
# Or disable the firewall altogether.
|
# Or disable the firewall altogether.
|
||||||
networking.firewall.enable = false;
|
networking.firewall.enable = false;
|
||||||
|
|
||||||
|
# Compressed swap in RAM - prevents OOM during bootstrap nixos-rebuild
|
||||||
|
zramSwap.enable = true;
|
||||||
|
|
||||||
system.stateVersion = "25.11";
|
system.stateVersion = "25.11";
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -25,6 +25,9 @@
|
|||||||
# Enable remote deployment via NATS
|
# Enable remote deployment via NATS
|
||||||
homelab.deploy.enable = true;
|
homelab.deploy.enable = true;
|
||||||
|
|
||||||
|
# Enable Kanidm PAM/NSS for central authentication
|
||||||
|
homelab.kanidm.enable = true;
|
||||||
|
|
||||||
nixpkgs.config.allowUnfree = true;
|
nixpkgs.config.allowUnfree = true;
|
||||||
boot.loader.grub.enable = true;
|
boot.loader.grub.enable = true;
|
||||||
boot.loader.grub.device = "/dev/vda";
|
boot.loader.grub.device = "/dev/vda";
|
||||||
|
|||||||
@@ -25,6 +25,9 @@
|
|||||||
# Enable remote deployment via NATS
|
# Enable remote deployment via NATS
|
||||||
homelab.deploy.enable = true;
|
homelab.deploy.enable = true;
|
||||||
|
|
||||||
|
# Enable Kanidm PAM/NSS for central authentication
|
||||||
|
homelab.kanidm.enable = true;
|
||||||
|
|
||||||
nixpkgs.config.allowUnfree = true;
|
nixpkgs.config.allowUnfree = true;
|
||||||
boot.loader.grub.enable = true;
|
boot.loader.grub.enable = true;
|
||||||
boot.loader.grub.device = "/dev/vda";
|
boot.loader.grub.device = "/dev/vda";
|
||||||
|
|||||||
@@ -25,6 +25,9 @@
|
|||||||
# Enable remote deployment via NATS
|
# Enable remote deployment via NATS
|
||||||
homelab.deploy.enable = true;
|
homelab.deploy.enable = true;
|
||||||
|
|
||||||
|
# Enable Kanidm PAM/NSS for central authentication
|
||||||
|
homelab.kanidm.enable = true;
|
||||||
|
|
||||||
nixpkgs.config.allowUnfree = true;
|
nixpkgs.config.allowUnfree = true;
|
||||||
boot.loader.grub.enable = true;
|
boot.loader.grub.enable = true;
|
||||||
boot.loader.grub.device = "/dev/vda";
|
boot.loader.grub.device = "/dev/vda";
|
||||||
|
|||||||
@@ -17,7 +17,8 @@
|
|||||||
};
|
};
|
||||||
};
|
};
|
||||||
|
|
||||||
# Provisioning - initial users/groups
|
# Provision base groups only - users are managed via CLI
|
||||||
|
# See docs/user-management.md for details
|
||||||
provision = {
|
provision = {
|
||||||
enable = true;
|
enable = true;
|
||||||
idmAdminPasswordFile = config.vault.secrets.kanidm-idm-admin.outputDir;
|
idmAdminPasswordFile = config.vault.secrets.kanidm-idm-admin.outputDir;
|
||||||
@@ -28,10 +29,7 @@
|
|||||||
ssh-users = { };
|
ssh-users = { };
|
||||||
};
|
};
|
||||||
|
|
||||||
persons.torjus = {
|
# Regular users (persons) are managed imperatively via kanidm CLI
|
||||||
displayName = "Torjus";
|
|
||||||
groups = [ "admins" "users" "ssh-users" ];
|
|
||||||
};
|
|
||||||
};
|
};
|
||||||
};
|
};
|
||||||
|
|
||||||
@@ -46,7 +44,7 @@
|
|||||||
extraDomainNames = [ "${config.networking.hostName}.home.2rjus.net" ];
|
extraDomainNames = [ "${config.networking.hostName}.home.2rjus.net" ];
|
||||||
};
|
};
|
||||||
|
|
||||||
# Vault secret for idm_admin password
|
# Vault secret for idm_admin password (used for provisioning)
|
||||||
vault.secrets.kanidm-idm-admin = {
|
vault.secrets.kanidm-idm-admin = {
|
||||||
secretPath = "kanidm/idm-admin-password";
|
secretPath = "kanidm/idm-admin-password";
|
||||||
extractKey = "password";
|
extractKey = "password";
|
||||||
|
|||||||
@@ -4,6 +4,7 @@
|
|||||||
./acme.nix
|
./acme.nix
|
||||||
./autoupgrade.nix
|
./autoupgrade.nix
|
||||||
./homelab-deploy.nix
|
./homelab-deploy.nix
|
||||||
|
./kanidm-client.nix
|
||||||
./monitoring
|
./monitoring
|
||||||
./motd.nix
|
./motd.nix
|
||||||
./packages.nix
|
./packages.nix
|
||||||
@@ -12,5 +13,6 @@
|
|||||||
./pki/root-ca.nix
|
./pki/root-ca.nix
|
||||||
./sshd.nix
|
./sshd.nix
|
||||||
./vault-secrets.nix
|
./vault-secrets.nix
|
||||||
|
./zram.nix
|
||||||
];
|
];
|
||||||
}
|
}
|
||||||
|
|||||||
42
system/kanidm-client.nix
Normal file
42
system/kanidm-client.nix
Normal file
@@ -0,0 +1,42 @@
|
|||||||
|
{ lib, config, pkgs, ... }:
|
||||||
|
let
|
||||||
|
cfg = config.homelab.kanidm;
|
||||||
|
in
|
||||||
|
{
|
||||||
|
options.homelab.kanidm = {
|
||||||
|
enable = lib.mkEnableOption "Kanidm PAM/NSS client for central authentication";
|
||||||
|
|
||||||
|
server = lib.mkOption {
|
||||||
|
type = lib.types.str;
|
||||||
|
default = "https://auth.home.2rjus.net";
|
||||||
|
description = "URI of the Kanidm server";
|
||||||
|
};
|
||||||
|
|
||||||
|
allowedLoginGroups = lib.mkOption {
|
||||||
|
type = lib.types.listOf lib.types.str;
|
||||||
|
default = [ "ssh-users" ];
|
||||||
|
description = "Groups allowed to log in via PAM";
|
||||||
|
};
|
||||||
|
};
|
||||||
|
|
||||||
|
config = lib.mkIf cfg.enable {
|
||||||
|
services.kanidm = {
|
||||||
|
package = pkgs.kanidm_1_8;
|
||||||
|
enablePam = true;
|
||||||
|
|
||||||
|
clientSettings = {
|
||||||
|
uri = cfg.server;
|
||||||
|
};
|
||||||
|
|
||||||
|
unixSettings = {
|
||||||
|
pam_allowed_login_groups = cfg.allowedLoginGroups;
|
||||||
|
# Use short names (torjus) instead of SPN format (torjus@home.2rjus.net)
|
||||||
|
# This prevents "PAM user mismatch" errors with SSH
|
||||||
|
uid_attr_map = "name";
|
||||||
|
gid_attr_map = "name";
|
||||||
|
# Create symlink /home/torjus -> /home/torjus@home.2rjus.net
|
||||||
|
home_alias = "name";
|
||||||
|
};
|
||||||
|
};
|
||||||
|
};
|
||||||
|
}
|
||||||
8
system/zram.nix
Normal file
8
system/zram.nix
Normal file
@@ -0,0 +1,8 @@
|
|||||||
|
# Compressed swap in RAM
|
||||||
|
#
|
||||||
|
# Provides overflow memory during Nix builds and upgrades.
|
||||||
|
# Prevents OOM kills on low-memory hosts (2GB VMs).
|
||||||
|
{ ... }:
|
||||||
|
{
|
||||||
|
zramSwap.enable = true;
|
||||||
|
}
|
||||||
Reference in New Issue
Block a user