Some checks failed
Run nix flake check / flake-check (push) Failing after 2s
Based on security review findings, covering SSH hardening, firewall enablement, log transport TLS, security alerting, and secrets management. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
225 lines
6.6 KiB
Markdown
225 lines
6.6 KiB
Markdown
# Security Hardening Plan
|
|
|
|
## Overview
|
|
|
|
Address security gaps identified in infrastructure review. Focus areas: SSH hardening, network security, logging improvements, and secrets management.
|
|
|
|
## Current State
|
|
|
|
- SSH allows password auth and unrestricted root login (`system/sshd.nix`)
|
|
- Firewall disabled on all hosts (`networking.firewall.enable = false`)
|
|
- Promtail ships logs over HTTP to Loki
|
|
- Loki has no authentication (`auth_enabled = false`)
|
|
- AppRole secret-IDs never expire (`secret_id_ttl = 0`)
|
|
- Vault TLS verification disabled by default (`skipTlsVerify = true`)
|
|
- Audit logging exists (`common/ssh-audit.nix`) but not applied globally
|
|
- Alert rules focus on availability, no security event detection
|
|
|
|
## Priority Matrix
|
|
|
|
| Issue | Severity | Effort | Priority |
|
|
|-------|----------|--------|----------|
|
|
| SSH password auth | High | Low | **P1** |
|
|
| Firewall disabled | High | Medium | **P1** |
|
|
| Promtail HTTP (no TLS) | High | Medium | **P2** |
|
|
| No security alerting | Medium | Low | **P2** |
|
|
| Audit logging not global | Low | Low | **P2** |
|
|
| Loki no auth | Medium | Medium | **P3** |
|
|
| Secret-ID TTL | Medium | Medium | **P3** |
|
|
| Vault skipTlsVerify | Medium | Low | **P3** |
|
|
|
|
## Phase 1: Quick Wins (P1)
|
|
|
|
### 1.1 SSH Hardening
|
|
|
|
Edit `system/sshd.nix`:
|
|
|
|
```nix
|
|
services.openssh = {
|
|
enable = true;
|
|
settings = {
|
|
PermitRootLogin = "prohibit-password"; # Key-only root login
|
|
PasswordAuthentication = false;
|
|
KbdInteractiveAuthentication = false;
|
|
};
|
|
};
|
|
```
|
|
|
|
**Prerequisite:** Verify all hosts have SSH keys deployed for root.
|
|
|
|
### 1.2 Enable Firewall
|
|
|
|
Create `system/firewall.nix` with default deny policy:
|
|
|
|
```nix
|
|
{ ... }: {
|
|
networking.firewall.enable = true;
|
|
|
|
# Use openssh's built-in firewall integration
|
|
services.openssh.openFirewall = true;
|
|
}
|
|
```
|
|
|
|
**Useful firewall options:**
|
|
|
|
| Option | Description |
|
|
|--------|-------------|
|
|
| `networking.firewall.trustedInterfaces` | Accept all traffic from these interfaces (e.g., `[ "lo" ]`) |
|
|
| `networking.firewall.interfaces.<name>.allowedTCPPorts` | Per-interface port rules |
|
|
| `networking.firewall.extraInputRules` | Custom nftables rules (for complex filtering) |
|
|
|
|
**Network range restrictions:** Consider restricting SSH to the infrastructure subnet (`10.69.13.0/24`) using `extraInputRules` for defense in depth. However, this adds complexity and may not be necessary given the trusted network model.
|
|
|
|
#### Per-Interface Rules (http-proxy WireGuard)
|
|
|
|
The `http-proxy` host has a WireGuard interface (`wg0`) that may need different rules than the LAN interface. Use `networking.firewall.interfaces` to apply per-interface policies:
|
|
|
|
```nix
|
|
# Example: http-proxy with different rules per interface
|
|
networking.firewall = {
|
|
enable = true;
|
|
|
|
# Default: only SSH (via openFirewall)
|
|
allowedTCPPorts = [ ];
|
|
|
|
# LAN interface: allow HTTP/HTTPS
|
|
interfaces.ens18 = {
|
|
allowedTCPPorts = [ 80 443 ];
|
|
};
|
|
|
|
# WireGuard interface: restrict to specific services or trust fully
|
|
interfaces.wg0 = {
|
|
allowedTCPPorts = [ 80 443 ];
|
|
# Or use trustedInterfaces = [ "wg0" ] if fully trusted
|
|
};
|
|
};
|
|
```
|
|
|
|
**TODO:** Investigate current WireGuard usage on http-proxy to determine appropriate rules.
|
|
|
|
Then per-host, open required ports:
|
|
|
|
| Host | Additional Ports |
|
|
|------|------------------|
|
|
| ns1/ns2 | 53 (TCP/UDP) |
|
|
| vault01 | 8200 |
|
|
| monitoring01 | 3100, 9090, 3000, 9093 |
|
|
| http-proxy | 80, 443 |
|
|
| nats1 | 4222 |
|
|
| ha1 | 1883, 8123 |
|
|
| jelly01 | 8096 |
|
|
| nix-cache01 | 5000 |
|
|
|
|
## Phase 2: Logging & Detection (P2)
|
|
|
|
### 2.1 Enable TLS for Promtail → Loki
|
|
|
|
Update `system/monitoring/logs.nix`:
|
|
|
|
```nix
|
|
clients = [{
|
|
url = "https://monitoring01.home.2rjus.net:3100/loki/api/v1/push";
|
|
tls_config = {
|
|
ca_file = "/etc/ssl/certs/homelab-root-ca.pem";
|
|
};
|
|
}];
|
|
```
|
|
|
|
Requires:
|
|
- Configure Loki with TLS certificate (use internal ACME)
|
|
- Ensure all hosts trust root CA (already done via `system/pki/root-ca.nix`)
|
|
|
|
### 2.2 Security Alert Rules
|
|
|
|
Add to `services/monitoring/rules.yml`:
|
|
|
|
```yaml
|
|
- name: security_rules
|
|
rules:
|
|
- alert: ssh_auth_failures
|
|
expr: increase(node_logind_sessions_total[5m]) > 20
|
|
for: 0m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Unusual login activity on {{ $labels.instance }}"
|
|
|
|
- alert: vault_secret_fetch_failure
|
|
expr: increase(vault_secret_failures[5m]) > 5
|
|
for: 0m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Vault secret fetch failures on {{ $labels.instance }}"
|
|
```
|
|
|
|
Also add Loki-based alerts for:
|
|
- Failed SSH attempts: `{job="systemd-journal"} |= "Failed password"`
|
|
- sudo usage: `{job="systemd-journal"} |= "sudo"`
|
|
|
|
### 2.3 Global Audit Logging
|
|
|
|
Add `./common/ssh-audit.nix` import to `system/default.nix`:
|
|
|
|
```nix
|
|
imports = [
|
|
# ... existing imports
|
|
../common/ssh-audit.nix
|
|
];
|
|
```
|
|
|
|
## Phase 3: Defense in Depth (P3)
|
|
|
|
### 3.1 Loki Authentication
|
|
|
|
Options:
|
|
1. **Basic auth via reverse proxy** - Put Loki behind Caddy with auth
|
|
2. **Loki multi-tenancy** - Enable `auth_enabled = true` and use tenant IDs
|
|
3. **Network isolation** - Bind Loki only to localhost, expose via authenticated proxy
|
|
|
|
Recommendation: Option 1 (reverse proxy) is simplest for homelab.
|
|
|
|
### 3.2 AppRole Secret Rotation
|
|
|
|
Update `terraform/vault/approle.tf`:
|
|
|
|
```hcl
|
|
secret_id_ttl = 2592000 # 30 days
|
|
```
|
|
|
|
Add documentation for manual rotation procedure or implement automated rotation via the existing `restartTrigger` mechanism in `vault-secrets.nix`.
|
|
|
|
### 3.3 Enable Vault TLS Verification
|
|
|
|
Change default in `system/vault-secrets.nix`:
|
|
|
|
```nix
|
|
skipTlsVerify = mkOption {
|
|
type = types.bool;
|
|
default = false; # Changed from true
|
|
};
|
|
```
|
|
|
|
**Prerequisite:** Verify all hosts trust the internal CA that signed the Vault certificate.
|
|
|
|
## Implementation Order
|
|
|
|
1. **Test on test-tier first** - Deploy phases 1-2 to testvm01/02/03
|
|
2. **Validate SSH access** - Ensure key-based login works before disabling passwords
|
|
3. **Document firewall ports** - Create reference of ports per host before enabling
|
|
4. **Phase prod rollout** - Deploy to prod hosts one at a time, verify each
|
|
|
|
## Open Questions
|
|
|
|
- [ ] Do all hosts have SSH keys configured for root access?
|
|
- [ ] Should firewall rules be per-host or use a central definition with roles?
|
|
- [ ] Should Loki authentication use the existing Kanidm setup?
|
|
|
|
**Resolved:** Password-based SSH access for recovery is not required - most hosts have console access through Proxmox or physical access, which provides an out-of-band recovery path if SSH keys fail.
|
|
|
|
## Notes
|
|
|
|
- Firewall changes are the highest risk - test thoroughly on test-tier
|
|
- SSH hardening must not lock out access - verify keys first
|
|
- Consider creating a "break glass" procedure for emergency access if keys fail
|