docs: add new service candidates and NixOS router plans

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 13:21:34 +01:00
parent fd7caf7f00
commit 4d614d8716
2 changed files with 307 additions and 0 deletions
--- a/docs/plans/new-services.md
+++ b/docs/plans/new-services.md
@@ -0,0 +1,145 @@
+# New Service Candidates
+
+Ideas for additional services to deploy in the homelab. These lean more enterprise/obscure
+than the typical self-hosted fare.
+
+## Litestream
+
+Continuous SQLite replication to S3-compatible storage. Streams WAL changes in near-real-time,
+providing point-in-time recovery without scheduled backup jobs.
+
+**Why:** Several services use SQLite (Home Assistant, potentially others). Litestream would
+give continuous backup to Garage S3 with minimal resource overhead and near-zero configuration.
+Replaces cron-based backup scripts with a small daemon per database.
+
+**Integration points:**
+- Garage S3 as replication target (already deployed)
+- Home Assistant SQLite database is the primary candidate
+- Could also cover any future SQLite-backed services
+
+**Complexity:** Low. Single Go binary, minimal config (source DB path + S3 endpoint).
+
+**NixOS packaging:** Available in nixpkgs as `litestream`.
+
+---
+
+## ntopng
+
+Deep network traffic analysis and flow monitoring. Provides real-time visibility into bandwidth
+usage, protocol distribution, top talkers, and anomaly detection via a web UI.
+
+**Why:** We have host-level metrics (node-exporter) and logs (Loki) but no network-level
+visibility. ntopng would show traffic patterns across the infrastructure — NFS throughput to
+the NAS, DNS query volume, inter-host traffic, and bandwidth anomalies. Useful for capacity
+planning and debugging network issues.
+
+**Integration points:**
+- Could export metrics to Prometheus via its built-in exporter
+- Web UI behind http-proxy with Kanidm OIDC (if supported) or Pomerium
+- NetFlow/sFlow from managed switches (if available)
+- Passive traffic capture on a mirror port or the monitoring host itself
+
+**Complexity:** Medium. Needs network tap or mirror port for full visibility, or can run
+in host-local mode. May need a dedicated interface or VLAN mirror.
+
+**NixOS packaging:** Available in nixpkgs as `ntopng`.
+
+---
+
+## Renovate
+
+Automated dependency update bot that understands Nix flakes natively. Creates branches/PRs
+to bump flake inputs on a configurable schedule.
+
+**Why:** Currently `nix flake update` is manual. Renovate can automatically propose updates
+to individual flake inputs (nixpkgs, homelab-deploy, nixos-exporter, etc.), group related
+updates, and respect schedules. More granular than updating everything at once — can bump
+nixpkgs weekly but hold back other inputs, auto-merge patch-level changes, etc.
+
+**Integration points:**
+- Runs against git.t-juice.club repositories
+- Understands `flake.lock` format natively
+- Could target both `nixos-servers` and `nixos` repos
+- Update branches would be validated by homelab-deploy builder
+
+**Complexity:** Medium. Needs git forge integration (Gitea/Forgejo API). Self-hosted runner
+mode available. Configuration via `renovate.json` in each repo.
+
+**NixOS packaging:** Available in nixpkgs as `renovate`.
+
+---
+
+## Pomerium
+
+Identity-aware reverse proxy implementing zero-trust access. Every request is authenticated
+and authorized based on identity, device, and context — not just network location.
+
+**Why:** Currently Caddy terminates TLS but doesn't enforce authentication on most services.
+Pomerium would put Kanidm OIDC authentication in front of every internal service, with
+per-route authorization policies (e.g., "only admins can access Prometheus," "require re-auth
+for Vault UI"). Directly addresses the security hardening plan's goals.
+
+**Integration points:**
+- Kanidm as OIDC identity provider (already deployed)
+- Could replace or sit in front of Caddy for internal services
+- Per-route policies based on Kanidm groups (admins, users, ssh-users)
+- Centralizes access logging and audit trail
+
+**Complexity:** Medium-high. Needs careful integration with existing Caddy reverse proxy.
+Decision needed on whether Pomerium replaces Caddy or works alongside it (Pomerium for
+auth, Caddy for TLS termination and routing, or Pomerium handles everything).
+
+**NixOS packaging:** Available in nixpkgs as `pomerium`.
+
+---
+
+## Apache Guacamole
+
+Clientless remote desktop and SSH gateway. Provides browser-based access to hosts via
+RDP, VNC, SSH, and Telnet with no client software required. Supports session recording
+and playback.
+
+**Why:** Provides an alternative remote access path that doesn't require VPN software or
+SSH keys on the client device. Useful for accessing hosts from untrusted machines (phone,
+borrowed laptop) or providing temporary access to others. Session recording gives an audit
+trail. Could complement the WireGuard remote access plan rather than replace it.
+
+**Integration points:**
+- Kanidm for authentication (OIDC or LDAP)
+- Behind http-proxy or Pomerium for TLS
+- SSH access to all hosts in the fleet
+- Session recordings could be stored on Garage S3
+- Could serve as the "emergency access" path when VPN is unavailable
+
+**Complexity:** Medium. Java-based (guacd + web app), typically needs PostgreSQL for
+connection/user storage (already available). Docker is the common deployment method but
+native packaging exists.
+
+**NixOS packaging:** Available in nixpkgs as `guacamole-server` and `guacamole-client`.
+
+---
+
+## CrowdSec
+
+Collaborative intrusion prevention system with crowd-sourced threat intelligence.
+Parses logs to detect attack patterns, applies remediation (firewall bans, CAPTCHA),
+and shares/receives threat signals from a global community network.
+
+**Why:** Goes beyond fail2ban with behavioral detection, crowd-sourced IP reputation,
+and a scenario-based engine. Fits the security hardening plan. The community blocklist
+means we benefit from threat intelligence gathered across thousands of deployments.
+Could parse SSH logs, HTTP access logs, and other service logs to detect and block
+malicious activity.
+
+**Integration points:**
+- Could consume logs from Loki or directly from journald/log files
+- Firewall bouncer for iptables/nftables remediation
+- Caddy bouncer for HTTP-level blocking
+- Prometheus metrics exporter for alert integration
+- Scenarios available for SSH brute force, HTTP scanning, and more
+- Feeds into existing alerting pipeline (Alertmanager -> alerttonotify)
+
+**Complexity:** Medium. Agent (log parser + decision engine) on each host or centralized.
+Bouncers (enforcement) on edge hosts. Free community tier includes threat intel access.
+
+**NixOS packaging:** Available in nixpkgs as `crowdsec`.
--- a/docs/plans/nixos-router.md
+++ b/docs/plans/nixos-router.md
@@ -0,0 +1,162 @@
+# NixOS Router — Replace EdgeRouter
+
+Replace the aging Ubiquiti EdgeRouter (gw, 10.69.10.1) with a NixOS-based router.
+The EdgeRouter is suspected to be a throughput bottleneck. A NixOS router integrates
+naturally with the existing fleet: same config management, same monitoring pipeline,
+same deployment workflow.
+
+## Goals
+
+- Eliminate the EdgeRouter throughput bottleneck
+- Full integration with existing monitoring (node-exporter, promtail, Prometheus, Loki)
+- Declarative firewall and routing config managed in the flake
+- Inter-VLAN routing for all existing subnets
+- DHCP server for client subnets
+- NetFlow/traffic accounting for future ntopng integration
+- Foundation for WireGuard remote access (see remote-access.md)
+
+## Current Network Topology
+
+**Subnets (known VLANs):**
+| VLAN/Subnet    | Purpose          | Notable hosts                          |
+|----------------|------------------|----------------------------------------|
+| 10.69.10.0/24  | Gateway          | gw (10.69.10.1)                        |
+| 10.69.12.0/24  | Core services    | nas, pve1, arr jails, restic           |
+| 10.69.13.0/24  | Infrastructure   | All NixOS servers (static IPs)         |
+| 10.69.22.0/24  | WLAN             | unifi-ctrl                             |
+| 10.69.30.0/24  | Workstations     | gunter                                 |
+| 10.69.31.0/24  | Media            | media                                  |
+| 10.69.99.0/24  | Management       | sw1 (MikroTik CRS326-24G-2S+)         |
+
+**DNS:** ns1 (10.69.13.5) and ns2 (10.69.13.6) handle all resolution. Upstream is
+Cloudflare/Google over DoT via Unbound.
+
+**Switch:** MikroTik CRS326-24G-2S+ — L2 switching with VLAN trunking. Capable of
+L3 routing via RouterOS but not ideal for sustained routing throughput.
+
+## Hardware
+
+Needs a small x86 box with:
+- At least 2 NICs (WAN + LAN trunk). Dual 2.5GbE preferred.
+- Enough CPU for nftables NAT at line rate (any modern x86 is fine)
+- 4-8 GB RAM (plenty for routing + DHCP + NetFlow accounting)
+- Low power consumption, fanless preferred for always-on use
+
+Candidates:
+- Topton / CWWK mini PC with dual/quad Intel 2.5GbE (~100-150 EUR)
+- Protectli Vault (more expensive, ~200-300 EUR, proven in pfSense/OPNsense community)
+- Any mini PC with one onboard NIC + one USB 2.5GbE adapter (cheapest, less ideal)
+
+The LAN port would carry a VLAN trunk to the MikroTik switch, with sub-interfaces
+for each VLAN. WAN port connects to the ISP uplink.
+
+## NixOS Configuration
+
+### Stability Policy
+
+The router is treated differently from the rest of the fleet:
+- **No auto-upgrade** — `system.autoUpgrade.enable = false`
+- **No homelab-deploy listener** — `homelab.deploy.enable = false`
+- **Manual updates only** — update every few months, test-build first
+- **Use `nixos-rebuild boot`** — changes take effect on next deliberate reboot
+- **Tier: prod, priority: high** — alerts treated with highest priority
+
+### Core Services
+
+**Routing & NAT:**
+- `systemd-networkd` for all interface config (consistent with rest of fleet)
+- VLAN sub-interfaces on the LAN trunk (one per subnet)
+- `networking.nftables` for stateful firewall and NAT
+- IP forwarding enabled (`net.ipv4.ip_forward = 1`)
+- Masquerade outbound traffic on WAN interface
+
+**DHCP:**
+- Kea or dnsmasq for DHCP on client subnets (WLAN, workstations, media)
+- Infrastructure subnet (10.69.13.0/24) stays static — no DHCP needed
+- Static leases for known devices
+
+**Firewall (nftables):**
+- Default deny between VLANs
+- Explicit allow rules for known cross-VLAN traffic:
+  - All subnets → ns1/ns2 (DNS)
+  - All subnets → monitoring01 (metrics/logs)
+  - Infrastructure → all (management access)
+  - Workstations → media, core services
+- NAT masquerade on WAN
+- Rate limiting on WAN-facing services
+
+**Traffic Accounting:**
+- nftables flow accounting or softflowd for NetFlow export
+- Export to future ntopng instance (see new-services.md)
+
+### Monitoring Integration
+
+Since this is a NixOS host in the flake, it gets the standard monitoring stack for free:
+- node-exporter for system metrics (CPU, memory, NIC throughput per interface)
+- promtail shipping logs to Loki
+- Prometheus scrape target auto-registration
+- Alertmanager alerts for host-down, high CPU, etc.
+
+Additional router-specific monitoring:
+- Per-VLAN interface traffic metrics via node-exporter (automatic for all interfaces)
+- NAT connection tracking table size
+- WAN uplink status and throughput
+- DHCP lease metrics (if Kea, it has a Prometheus exporter)
+
+This is a significant advantage over the EdgeRouter — full observability through
+the existing Grafana dashboards and Loki log search, debuggable via the monitoring
+MCP tools.
+
+### WireGuard Integration
+
+The remote access plan (remote-access.md) currently proposes a separate `extgw01`
+gateway host. With a NixOS router, there's a decision to make:
+
+**Option A:** WireGuard terminates on the router itself. Simplest topology — the
+router is already the gateway, so VPN traffic doesn't need extra hops or firewall
+rules. But adds complexity to the router, which should stay simple.
+
+**Option B:** Keep extgw01 as a separate host (original plan). Router just routes
+traffic to it. Better separation of concerns, router stays minimal.
+
+Recommendation: Start with option B (keep it separate). The router should do routing
+and nothing else. WireGuard can move to the router later if extgw01 feels redundant.
+
+## Migration Plan
+
+### Phase 1: Build and lab test
+- Acquire hardware
+- Create host config in the flake (routing, NAT, DHCP, firewall)
+- Test-build on workstation: `nix build .#nixosConfigurations.router01.config.system.build.toplevel`
+- Lab test with a temporary setup if possible (two NICs, isolated VLAN)
+
+### Phase 2: Prepare cutover
+- Pre-configure the MikroTik switch trunk port for the new router
+- Document current EdgeRouter config (port forwarding, NAT rules, DHCP leases)
+- Replicate all rules in the NixOS config
+- Verify DNS, DHCP, and inter-VLAN routing work in test
+
+### Phase 3: Cutover
+- Schedule a maintenance window (brief downtime expected)
+- Swap WAN cable from EdgeRouter to new router
+- Swap LAN trunk from EdgeRouter to new router
+- Verify connectivity from each VLAN
+- Verify internet access, DNS resolution, inter-VLAN routing
+- Monitor via Prometheus/Loki (immediately available since it's a fleet host)
+
+### Phase 4: Decommission EdgeRouter
+- Keep EdgeRouter available as fallback for a few weeks
+- Remove `gw` entry from external-hosts.nix, replace with flake-managed host
+- Update any references to 10.69.10.1 if the router IP changes
+
+## Open Questions
+
+- **Router IP:** Keep 10.69.10.1 or move to a different address? Each VLAN
+  sub-interface needs an IP (the gateway address for that subnet).
+- **ISP uplink:** What type of WAN connection? PPPoE, DHCP, static IP?
+- **Port forwarding:** What ports are currently forwarded on the EdgeRouter?
+  These need to be replicated in nftables.
+- **DHCP scope:** Which subnets currently get DHCP from the EdgeRouter vs
+  other sources (UniFi controller for WLAN?)?
+- **UPnP/NAT-PMP:** Needed for any devices? (gaming consoles, etc.)
+- **Hardware preference:** Fanless mini PC budget and preferred vendor?