diff --git a/docs/plans/new-services.md b/docs/plans/new-services.md new file mode 100644 index 0000000..ce6d7b8 --- /dev/null +++ b/docs/plans/new-services.md @@ -0,0 +1,145 @@ +# New Service Candidates + +Ideas for additional services to deploy in the homelab. These lean more enterprise/obscure +than the typical self-hosted fare. + +## Litestream + +Continuous SQLite replication to S3-compatible storage. Streams WAL changes in near-real-time, +providing point-in-time recovery without scheduled backup jobs. + +**Why:** Several services use SQLite (Home Assistant, potentially others). Litestream would +give continuous backup to Garage S3 with minimal resource overhead and near-zero configuration. +Replaces cron-based backup scripts with a small daemon per database. + +**Integration points:** +- Garage S3 as replication target (already deployed) +- Home Assistant SQLite database is the primary candidate +- Could also cover any future SQLite-backed services + +**Complexity:** Low. Single Go binary, minimal config (source DB path + S3 endpoint). + +**NixOS packaging:** Available in nixpkgs as `litestream`. + +--- + +## ntopng + +Deep network traffic analysis and flow monitoring. Provides real-time visibility into bandwidth +usage, protocol distribution, top talkers, and anomaly detection via a web UI. + +**Why:** We have host-level metrics (node-exporter) and logs (Loki) but no network-level +visibility. ntopng would show traffic patterns across the infrastructure — NFS throughput to +the NAS, DNS query volume, inter-host traffic, and bandwidth anomalies. Useful for capacity +planning and debugging network issues. + +**Integration points:** +- Could export metrics to Prometheus via its built-in exporter +- Web UI behind http-proxy with Kanidm OIDC (if supported) or Pomerium +- NetFlow/sFlow from managed switches (if available) +- Passive traffic capture on a mirror port or the monitoring host itself + +**Complexity:** Medium. Needs network tap or mirror port for full visibility, or can run +in host-local mode. May need a dedicated interface or VLAN mirror. + +**NixOS packaging:** Available in nixpkgs as `ntopng`. + +--- + +## Renovate + +Automated dependency update bot that understands Nix flakes natively. Creates branches/PRs +to bump flake inputs on a configurable schedule. + +**Why:** Currently `nix flake update` is manual. Renovate can automatically propose updates +to individual flake inputs (nixpkgs, homelab-deploy, nixos-exporter, etc.), group related +updates, and respect schedules. More granular than updating everything at once — can bump +nixpkgs weekly but hold back other inputs, auto-merge patch-level changes, etc. + +**Integration points:** +- Runs against git.t-juice.club repositories +- Understands `flake.lock` format natively +- Could target both `nixos-servers` and `nixos` repos +- Update branches would be validated by homelab-deploy builder + +**Complexity:** Medium. Needs git forge integration (Gitea/Forgejo API). Self-hosted runner +mode available. Configuration via `renovate.json` in each repo. + +**NixOS packaging:** Available in nixpkgs as `renovate`. + +--- + +## Pomerium + +Identity-aware reverse proxy implementing zero-trust access. Every request is authenticated +and authorized based on identity, device, and context — not just network location. + +**Why:** Currently Caddy terminates TLS but doesn't enforce authentication on most services. +Pomerium would put Kanidm OIDC authentication in front of every internal service, with +per-route authorization policies (e.g., "only admins can access Prometheus," "require re-auth +for Vault UI"). Directly addresses the security hardening plan's goals. + +**Integration points:** +- Kanidm as OIDC identity provider (already deployed) +- Could replace or sit in front of Caddy for internal services +- Per-route policies based on Kanidm groups (admins, users, ssh-users) +- Centralizes access logging and audit trail + +**Complexity:** Medium-high. Needs careful integration with existing Caddy reverse proxy. +Decision needed on whether Pomerium replaces Caddy or works alongside it (Pomerium for +auth, Caddy for TLS termination and routing, or Pomerium handles everything). + +**NixOS packaging:** Available in nixpkgs as `pomerium`. + +--- + +## Apache Guacamole + +Clientless remote desktop and SSH gateway. Provides browser-based access to hosts via +RDP, VNC, SSH, and Telnet with no client software required. Supports session recording +and playback. + +**Why:** Provides an alternative remote access path that doesn't require VPN software or +SSH keys on the client device. Useful for accessing hosts from untrusted machines (phone, +borrowed laptop) or providing temporary access to others. Session recording gives an audit +trail. Could complement the WireGuard remote access plan rather than replace it. + +**Integration points:** +- Kanidm for authentication (OIDC or LDAP) +- Behind http-proxy or Pomerium for TLS +- SSH access to all hosts in the fleet +- Session recordings could be stored on Garage S3 +- Could serve as the "emergency access" path when VPN is unavailable + +**Complexity:** Medium. Java-based (guacd + web app), typically needs PostgreSQL for +connection/user storage (already available). Docker is the common deployment method but +native packaging exists. + +**NixOS packaging:** Available in nixpkgs as `guacamole-server` and `guacamole-client`. + +--- + +## CrowdSec + +Collaborative intrusion prevention system with crowd-sourced threat intelligence. +Parses logs to detect attack patterns, applies remediation (firewall bans, CAPTCHA), +and shares/receives threat signals from a global community network. + +**Why:** Goes beyond fail2ban with behavioral detection, crowd-sourced IP reputation, +and a scenario-based engine. Fits the security hardening plan. The community blocklist +means we benefit from threat intelligence gathered across thousands of deployments. +Could parse SSH logs, HTTP access logs, and other service logs to detect and block +malicious activity. + +**Integration points:** +- Could consume logs from Loki or directly from journald/log files +- Firewall bouncer for iptables/nftables remediation +- Caddy bouncer for HTTP-level blocking +- Prometheus metrics exporter for alert integration +- Scenarios available for SSH brute force, HTTP scanning, and more +- Feeds into existing alerting pipeline (Alertmanager -> alerttonotify) + +**Complexity:** Medium. Agent (log parser + decision engine) on each host or centralized. +Bouncers (enforcement) on edge hosts. Free community tier includes threat intel access. + +**NixOS packaging:** Available in nixpkgs as `crowdsec`. diff --git a/docs/plans/nixos-router.md b/docs/plans/nixos-router.md new file mode 100644 index 0000000..4a15a40 --- /dev/null +++ b/docs/plans/nixos-router.md @@ -0,0 +1,162 @@ +# NixOS Router — Replace EdgeRouter + +Replace the aging Ubiquiti EdgeRouter (gw, 10.69.10.1) with a NixOS-based router. +The EdgeRouter is suspected to be a throughput bottleneck. A NixOS router integrates +naturally with the existing fleet: same config management, same monitoring pipeline, +same deployment workflow. + +## Goals + +- Eliminate the EdgeRouter throughput bottleneck +- Full integration with existing monitoring (node-exporter, promtail, Prometheus, Loki) +- Declarative firewall and routing config managed in the flake +- Inter-VLAN routing for all existing subnets +- DHCP server for client subnets +- NetFlow/traffic accounting for future ntopng integration +- Foundation for WireGuard remote access (see remote-access.md) + +## Current Network Topology + +**Subnets (known VLANs):** +| VLAN/Subnet | Purpose | Notable hosts | +|----------------|------------------|----------------------------------------| +| 10.69.10.0/24 | Gateway | gw (10.69.10.1) | +| 10.69.12.0/24 | Core services | nas, pve1, arr jails, restic | +| 10.69.13.0/24 | Infrastructure | All NixOS servers (static IPs) | +| 10.69.22.0/24 | WLAN | unifi-ctrl | +| 10.69.30.0/24 | Workstations | gunter | +| 10.69.31.0/24 | Media | media | +| 10.69.99.0/24 | Management | sw1 (MikroTik CRS326-24G-2S+) | + +**DNS:** ns1 (10.69.13.5) and ns2 (10.69.13.6) handle all resolution. Upstream is +Cloudflare/Google over DoT via Unbound. + +**Switch:** MikroTik CRS326-24G-2S+ — L2 switching with VLAN trunking. Capable of +L3 routing via RouterOS but not ideal for sustained routing throughput. + +## Hardware + +Needs a small x86 box with: +- At least 2 NICs (WAN + LAN trunk). Dual 2.5GbE preferred. +- Enough CPU for nftables NAT at line rate (any modern x86 is fine) +- 4-8 GB RAM (plenty for routing + DHCP + NetFlow accounting) +- Low power consumption, fanless preferred for always-on use + +Candidates: +- Topton / CWWK mini PC with dual/quad Intel 2.5GbE (~100-150 EUR) +- Protectli Vault (more expensive, ~200-300 EUR, proven in pfSense/OPNsense community) +- Any mini PC with one onboard NIC + one USB 2.5GbE adapter (cheapest, less ideal) + +The LAN port would carry a VLAN trunk to the MikroTik switch, with sub-interfaces +for each VLAN. WAN port connects to the ISP uplink. + +## NixOS Configuration + +### Stability Policy + +The router is treated differently from the rest of the fleet: +- **No auto-upgrade** — `system.autoUpgrade.enable = false` +- **No homelab-deploy listener** — `homelab.deploy.enable = false` +- **Manual updates only** — update every few months, test-build first +- **Use `nixos-rebuild boot`** — changes take effect on next deliberate reboot +- **Tier: prod, priority: high** — alerts treated with highest priority + +### Core Services + +**Routing & NAT:** +- `systemd-networkd` for all interface config (consistent with rest of fleet) +- VLAN sub-interfaces on the LAN trunk (one per subnet) +- `networking.nftables` for stateful firewall and NAT +- IP forwarding enabled (`net.ipv4.ip_forward = 1`) +- Masquerade outbound traffic on WAN interface + +**DHCP:** +- Kea or dnsmasq for DHCP on client subnets (WLAN, workstations, media) +- Infrastructure subnet (10.69.13.0/24) stays static — no DHCP needed +- Static leases for known devices + +**Firewall (nftables):** +- Default deny between VLANs +- Explicit allow rules for known cross-VLAN traffic: + - All subnets → ns1/ns2 (DNS) + - All subnets → monitoring01 (metrics/logs) + - Infrastructure → all (management access) + - Workstations → media, core services +- NAT masquerade on WAN +- Rate limiting on WAN-facing services + +**Traffic Accounting:** +- nftables flow accounting or softflowd for NetFlow export +- Export to future ntopng instance (see new-services.md) + +### Monitoring Integration + +Since this is a NixOS host in the flake, it gets the standard monitoring stack for free: +- node-exporter for system metrics (CPU, memory, NIC throughput per interface) +- promtail shipping logs to Loki +- Prometheus scrape target auto-registration +- Alertmanager alerts for host-down, high CPU, etc. + +Additional router-specific monitoring: +- Per-VLAN interface traffic metrics via node-exporter (automatic for all interfaces) +- NAT connection tracking table size +- WAN uplink status and throughput +- DHCP lease metrics (if Kea, it has a Prometheus exporter) + +This is a significant advantage over the EdgeRouter — full observability through +the existing Grafana dashboards and Loki log search, debuggable via the monitoring +MCP tools. + +### WireGuard Integration + +The remote access plan (remote-access.md) currently proposes a separate `extgw01` +gateway host. With a NixOS router, there's a decision to make: + +**Option A:** WireGuard terminates on the router itself. Simplest topology — the +router is already the gateway, so VPN traffic doesn't need extra hops or firewall +rules. But adds complexity to the router, which should stay simple. + +**Option B:** Keep extgw01 as a separate host (original plan). Router just routes +traffic to it. Better separation of concerns, router stays minimal. + +Recommendation: Start with option B (keep it separate). The router should do routing +and nothing else. WireGuard can move to the router later if extgw01 feels redundant. + +## Migration Plan + +### Phase 1: Build and lab test +- Acquire hardware +- Create host config in the flake (routing, NAT, DHCP, firewall) +- Test-build on workstation: `nix build .#nixosConfigurations.router01.config.system.build.toplevel` +- Lab test with a temporary setup if possible (two NICs, isolated VLAN) + +### Phase 2: Prepare cutover +- Pre-configure the MikroTik switch trunk port for the new router +- Document current EdgeRouter config (port forwarding, NAT rules, DHCP leases) +- Replicate all rules in the NixOS config +- Verify DNS, DHCP, and inter-VLAN routing work in test + +### Phase 3: Cutover +- Schedule a maintenance window (brief downtime expected) +- Swap WAN cable from EdgeRouter to new router +- Swap LAN trunk from EdgeRouter to new router +- Verify connectivity from each VLAN +- Verify internet access, DNS resolution, inter-VLAN routing +- Monitor via Prometheus/Loki (immediately available since it's a fleet host) + +### Phase 4: Decommission EdgeRouter +- Keep EdgeRouter available as fallback for a few weeks +- Remove `gw` entry from external-hosts.nix, replace with flake-managed host +- Update any references to 10.69.10.1 if the router IP changes + +## Open Questions + +- **Router IP:** Keep 10.69.10.1 or move to a different address? Each VLAN + sub-interface needs an IP (the gateway address for that subnet). +- **ISP uplink:** What type of WAN connection? PPPoE, DHCP, static IP? +- **Port forwarding:** What ports are currently forwarded on the EdgeRouter? + These need to be replicated in nftables. +- **DHCP scope:** Which subnets currently get DHCP from the EdgeRouter vs + other sources (UniFi controller for WLAN?)? +- **UPnP/NAT-PMP:** Needed for any devices? (gaming consoles, etc.) +- **Hardware preference:** Fanless mini PC budget and preferred vendor?