docs: add security hardening plan
Some checks failed
Run nix flake check / flake-check (push) Failing after 2s
Some checks failed
Run nix flake check / flake-check (push) Failing after 2s
Based on security review findings, covering SSH hardening, firewall enablement, log transport TLS, security alerting, and secrets management. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
224
docs/plans/security-hardening.md
Normal file
224
docs/plans/security-hardening.md
Normal file
@@ -0,0 +1,224 @@
|
||||
# Security Hardening Plan
|
||||
|
||||
## Overview
|
||||
|
||||
Address security gaps identified in infrastructure review. Focus areas: SSH hardening, network security, logging improvements, and secrets management.
|
||||
|
||||
## Current State
|
||||
|
||||
- SSH allows password auth and unrestricted root login (`system/sshd.nix`)
|
||||
- Firewall disabled on all hosts (`networking.firewall.enable = false`)
|
||||
- Promtail ships logs over HTTP to Loki
|
||||
- Loki has no authentication (`auth_enabled = false`)
|
||||
- AppRole secret-IDs never expire (`secret_id_ttl = 0`)
|
||||
- Vault TLS verification disabled by default (`skipTlsVerify = true`)
|
||||
- Audit logging exists (`common/ssh-audit.nix`) but not applied globally
|
||||
- Alert rules focus on availability, no security event detection
|
||||
|
||||
## Priority Matrix
|
||||
|
||||
| Issue | Severity | Effort | Priority |
|
||||
|-------|----------|--------|----------|
|
||||
| SSH password auth | High | Low | **P1** |
|
||||
| Firewall disabled | High | Medium | **P1** |
|
||||
| Promtail HTTP (no TLS) | High | Medium | **P2** |
|
||||
| No security alerting | Medium | Low | **P2** |
|
||||
| Audit logging not global | Low | Low | **P2** |
|
||||
| Loki no auth | Medium | Medium | **P3** |
|
||||
| Secret-ID TTL | Medium | Medium | **P3** |
|
||||
| Vault skipTlsVerify | Medium | Low | **P3** |
|
||||
|
||||
## Phase 1: Quick Wins (P1)
|
||||
|
||||
### 1.1 SSH Hardening
|
||||
|
||||
Edit `system/sshd.nix`:
|
||||
|
||||
```nix
|
||||
services.openssh = {
|
||||
enable = true;
|
||||
settings = {
|
||||
PermitRootLogin = "prohibit-password"; # Key-only root login
|
||||
PasswordAuthentication = false;
|
||||
KbdInteractiveAuthentication = false;
|
||||
};
|
||||
};
|
||||
```
|
||||
|
||||
**Prerequisite:** Verify all hosts have SSH keys deployed for root.
|
||||
|
||||
### 1.2 Enable Firewall
|
||||
|
||||
Create `system/firewall.nix` with default deny policy:
|
||||
|
||||
```nix
|
||||
{ ... }: {
|
||||
networking.firewall.enable = true;
|
||||
|
||||
# Use openssh's built-in firewall integration
|
||||
services.openssh.openFirewall = true;
|
||||
}
|
||||
```
|
||||
|
||||
**Useful firewall options:**
|
||||
|
||||
| Option | Description |
|
||||
|--------|-------------|
|
||||
| `networking.firewall.trustedInterfaces` | Accept all traffic from these interfaces (e.g., `[ "lo" ]`) |
|
||||
| `networking.firewall.interfaces.<name>.allowedTCPPorts` | Per-interface port rules |
|
||||
| `networking.firewall.extraInputRules` | Custom nftables rules (for complex filtering) |
|
||||
|
||||
**Network range restrictions:** Consider restricting SSH to the infrastructure subnet (`10.69.13.0/24`) using `extraInputRules` for defense in depth. However, this adds complexity and may not be necessary given the trusted network model.
|
||||
|
||||
#### Per-Interface Rules (http-proxy WireGuard)
|
||||
|
||||
The `http-proxy` host has a WireGuard interface (`wg0`) that may need different rules than the LAN interface. Use `networking.firewall.interfaces` to apply per-interface policies:
|
||||
|
||||
```nix
|
||||
# Example: http-proxy with different rules per interface
|
||||
networking.firewall = {
|
||||
enable = true;
|
||||
|
||||
# Default: only SSH (via openFirewall)
|
||||
allowedTCPPorts = [ ];
|
||||
|
||||
# LAN interface: allow HTTP/HTTPS
|
||||
interfaces.ens18 = {
|
||||
allowedTCPPorts = [ 80 443 ];
|
||||
};
|
||||
|
||||
# WireGuard interface: restrict to specific services or trust fully
|
||||
interfaces.wg0 = {
|
||||
allowedTCPPorts = [ 80 443 ];
|
||||
# Or use trustedInterfaces = [ "wg0" ] if fully trusted
|
||||
};
|
||||
};
|
||||
```
|
||||
|
||||
**TODO:** Investigate current WireGuard usage on http-proxy to determine appropriate rules.
|
||||
|
||||
Then per-host, open required ports:
|
||||
|
||||
| Host | Additional Ports |
|
||||
|------|------------------|
|
||||
| ns1/ns2 | 53 (TCP/UDP) |
|
||||
| vault01 | 8200 |
|
||||
| monitoring01 | 3100, 9090, 3000, 9093 |
|
||||
| http-proxy | 80, 443 |
|
||||
| nats1 | 4222 |
|
||||
| ha1 | 1883, 8123 |
|
||||
| jelly01 | 8096 |
|
||||
| nix-cache01 | 5000 |
|
||||
|
||||
## Phase 2: Logging & Detection (P2)
|
||||
|
||||
### 2.1 Enable TLS for Promtail → Loki
|
||||
|
||||
Update `system/monitoring/logs.nix`:
|
||||
|
||||
```nix
|
||||
clients = [{
|
||||
url = "https://monitoring01.home.2rjus.net:3100/loki/api/v1/push";
|
||||
tls_config = {
|
||||
ca_file = "/etc/ssl/certs/homelab-root-ca.pem";
|
||||
};
|
||||
}];
|
||||
```
|
||||
|
||||
Requires:
|
||||
- Configure Loki with TLS certificate (use internal ACME)
|
||||
- Ensure all hosts trust root CA (already done via `system/pki/root-ca.nix`)
|
||||
|
||||
### 2.2 Security Alert Rules
|
||||
|
||||
Add to `services/monitoring/rules.yml`:
|
||||
|
||||
```yaml
|
||||
- name: security_rules
|
||||
rules:
|
||||
- alert: ssh_auth_failures
|
||||
expr: increase(node_logind_sessions_total[5m]) > 20
|
||||
for: 0m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Unusual login activity on {{ $labels.instance }}"
|
||||
|
||||
- alert: vault_secret_fetch_failure
|
||||
expr: increase(vault_secret_failures[5m]) > 5
|
||||
for: 0m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Vault secret fetch failures on {{ $labels.instance }}"
|
||||
```
|
||||
|
||||
Also add Loki-based alerts for:
|
||||
- Failed SSH attempts: `{job="systemd-journal"} |= "Failed password"`
|
||||
- sudo usage: `{job="systemd-journal"} |= "sudo"`
|
||||
|
||||
### 2.3 Global Audit Logging
|
||||
|
||||
Add `./common/ssh-audit.nix` import to `system/default.nix`:
|
||||
|
||||
```nix
|
||||
imports = [
|
||||
# ... existing imports
|
||||
../common/ssh-audit.nix
|
||||
];
|
||||
```
|
||||
|
||||
## Phase 3: Defense in Depth (P3)
|
||||
|
||||
### 3.1 Loki Authentication
|
||||
|
||||
Options:
|
||||
1. **Basic auth via reverse proxy** - Put Loki behind Caddy with auth
|
||||
2. **Loki multi-tenancy** - Enable `auth_enabled = true` and use tenant IDs
|
||||
3. **Network isolation** - Bind Loki only to localhost, expose via authenticated proxy
|
||||
|
||||
Recommendation: Option 1 (reverse proxy) is simplest for homelab.
|
||||
|
||||
### 3.2 AppRole Secret Rotation
|
||||
|
||||
Update `terraform/vault/approle.tf`:
|
||||
|
||||
```hcl
|
||||
secret_id_ttl = 2592000 # 30 days
|
||||
```
|
||||
|
||||
Add documentation for manual rotation procedure or implement automated rotation via the existing `restartTrigger` mechanism in `vault-secrets.nix`.
|
||||
|
||||
### 3.3 Enable Vault TLS Verification
|
||||
|
||||
Change default in `system/vault-secrets.nix`:
|
||||
|
||||
```nix
|
||||
skipTlsVerify = mkOption {
|
||||
type = types.bool;
|
||||
default = false; # Changed from true
|
||||
};
|
||||
```
|
||||
|
||||
**Prerequisite:** Verify all hosts trust the internal CA that signed the Vault certificate.
|
||||
|
||||
## Implementation Order
|
||||
|
||||
1. **Test on test-tier first** - Deploy phases 1-2 to testvm01/02/03
|
||||
2. **Validate SSH access** - Ensure key-based login works before disabling passwords
|
||||
3. **Document firewall ports** - Create reference of ports per host before enabling
|
||||
4. **Phase prod rollout** - Deploy to prod hosts one at a time, verify each
|
||||
|
||||
## Open Questions
|
||||
|
||||
- [ ] Do all hosts have SSH keys configured for root access?
|
||||
- [ ] Should firewall rules be per-host or use a central definition with roles?
|
||||
- [ ] Should Loki authentication use the existing Kanidm setup?
|
||||
|
||||
**Resolved:** Password-based SSH access for recovery is not required - most hosts have console access through Proxmox or physical access, which provides an out-of-band recovery path if SSH keys fail.
|
||||
|
||||
## Notes
|
||||
|
||||
- Firewall changes are the highest risk - test thoroughly on test-tier
|
||||
- SSH hardening must not lock out access - verify keys first
|
||||
- Consider creating a "break glass" procedure for emergency access if keys fail
|
||||
Reference in New Issue
Block a user