Files
nixos-servers/docs/plans/security-hardening.md
Torjus Håkestad 311be282b6
Some checks failed
Run nix flake check / flake-check (push) Failing after 2s
docs: add security hardening plan
Based on security review findings, covering SSH hardening, firewall
enablement, log transport TLS, security alerting, and secrets management.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 05:26:15 +01:00

6.6 KiB

Security Hardening Plan

Overview

Address security gaps identified in infrastructure review. Focus areas: SSH hardening, network security, logging improvements, and secrets management.

Current State

  • SSH allows password auth and unrestricted root login (system/sshd.nix)
  • Firewall disabled on all hosts (networking.firewall.enable = false)
  • Promtail ships logs over HTTP to Loki
  • Loki has no authentication (auth_enabled = false)
  • AppRole secret-IDs never expire (secret_id_ttl = 0)
  • Vault TLS verification disabled by default (skipTlsVerify = true)
  • Audit logging exists (common/ssh-audit.nix) but not applied globally
  • Alert rules focus on availability, no security event detection

Priority Matrix

Issue Severity Effort Priority
SSH password auth High Low P1
Firewall disabled High Medium P1
Promtail HTTP (no TLS) High Medium P2
No security alerting Medium Low P2
Audit logging not global Low Low P2
Loki no auth Medium Medium P3
Secret-ID TTL Medium Medium P3
Vault skipTlsVerify Medium Low P3

Phase 1: Quick Wins (P1)

1.1 SSH Hardening

Edit system/sshd.nix:

services.openssh = {
  enable = true;
  settings = {
    PermitRootLogin = "prohibit-password";  # Key-only root login
    PasswordAuthentication = false;
    KbdInteractiveAuthentication = false;
  };
};

Prerequisite: Verify all hosts have SSH keys deployed for root.

1.2 Enable Firewall

Create system/firewall.nix with default deny policy:

{ ... }: {
  networking.firewall.enable = true;

  # Use openssh's built-in firewall integration
  services.openssh.openFirewall = true;
}

Useful firewall options:

Option Description
networking.firewall.trustedInterfaces Accept all traffic from these interfaces (e.g., [ "lo" ])
networking.firewall.interfaces.<name>.allowedTCPPorts Per-interface port rules
networking.firewall.extraInputRules Custom nftables rules (for complex filtering)

Network range restrictions: Consider restricting SSH to the infrastructure subnet (10.69.13.0/24) using extraInputRules for defense in depth. However, this adds complexity and may not be necessary given the trusted network model.

Per-Interface Rules (http-proxy WireGuard)

The http-proxy host has a WireGuard interface (wg0) that may need different rules than the LAN interface. Use networking.firewall.interfaces to apply per-interface policies:

# Example: http-proxy with different rules per interface
networking.firewall = {
  enable = true;

  # Default: only SSH (via openFirewall)
  allowedTCPPorts = [ ];

  # LAN interface: allow HTTP/HTTPS
  interfaces.ens18 = {
    allowedTCPPorts = [ 80 443 ];
  };

  # WireGuard interface: restrict to specific services or trust fully
  interfaces.wg0 = {
    allowedTCPPorts = [ 80 443 ];
    # Or use trustedInterfaces = [ "wg0" ] if fully trusted
  };
};

TODO: Investigate current WireGuard usage on http-proxy to determine appropriate rules.

Then per-host, open required ports:

Host Additional Ports
ns1/ns2 53 (TCP/UDP)
vault01 8200
monitoring01 3100, 9090, 3000, 9093
http-proxy 80, 443
nats1 4222
ha1 1883, 8123
jelly01 8096
nix-cache01 5000

Phase 2: Logging & Detection (P2)

2.1 Enable TLS for Promtail → Loki

Update system/monitoring/logs.nix:

clients = [{
  url = "https://monitoring01.home.2rjus.net:3100/loki/api/v1/push";
  tls_config = {
    ca_file = "/etc/ssl/certs/homelab-root-ca.pem";
  };
}];

Requires:

  • Configure Loki with TLS certificate (use internal ACME)
  • Ensure all hosts trust root CA (already done via system/pki/root-ca.nix)

2.2 Security Alert Rules

Add to services/monitoring/rules.yml:

- name: security_rules
  rules:
    - alert: ssh_auth_failures
      expr: increase(node_logind_sessions_total[5m]) > 20
      for: 0m
      labels:
        severity: warning
      annotations:
        summary: "Unusual login activity on {{ $labels.instance }}"

    - alert: vault_secret_fetch_failure
      expr: increase(vault_secret_failures[5m]) > 5
      for: 0m
      labels:
        severity: warning
      annotations:
        summary: "Vault secret fetch failures on {{ $labels.instance }}"

Also add Loki-based alerts for:

  • Failed SSH attempts: {job="systemd-journal"} |= "Failed password"
  • sudo usage: {job="systemd-journal"} |= "sudo"

2.3 Global Audit Logging

Add ./common/ssh-audit.nix import to system/default.nix:

imports = [
  # ... existing imports
  ../common/ssh-audit.nix
];

Phase 3: Defense in Depth (P3)

3.1 Loki Authentication

Options:

  1. Basic auth via reverse proxy - Put Loki behind Caddy with auth
  2. Loki multi-tenancy - Enable auth_enabled = true and use tenant IDs
  3. Network isolation - Bind Loki only to localhost, expose via authenticated proxy

Recommendation: Option 1 (reverse proxy) is simplest for homelab.

3.2 AppRole Secret Rotation

Update terraform/vault/approle.tf:

secret_id_ttl  = 2592000  # 30 days

Add documentation for manual rotation procedure or implement automated rotation via the existing restartTrigger mechanism in vault-secrets.nix.

3.3 Enable Vault TLS Verification

Change default in system/vault-secrets.nix:

skipTlsVerify = mkOption {
  type = types.bool;
  default = false;  # Changed from true
};

Prerequisite: Verify all hosts trust the internal CA that signed the Vault certificate.

Implementation Order

  1. Test on test-tier first - Deploy phases 1-2 to testvm01/02/03
  2. Validate SSH access - Ensure key-based login works before disabling passwords
  3. Document firewall ports - Create reference of ports per host before enabling
  4. Phase prod rollout - Deploy to prod hosts one at a time, verify each

Open Questions

  • Do all hosts have SSH keys configured for root access?
  • Should firewall rules be per-host or use a central definition with roles?
  • Should Loki authentication use the existing Kanidm setup?

Resolved: Password-based SSH access for recovery is not required - most hosts have console access through Proxmox or physical access, which provides an out-of-band recovery path if SSH keys fail.

Notes

  • Firewall changes are the highest risk - test thoroughly on test-tier
  • SSH hardening must not lock out access - verify keys first
  • Consider creating a "break glass" procedure for emergency access if keys fail