docs: add new service candidates and NixOS router plans
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
145
docs/plans/new-services.md
Normal file
145
docs/plans/new-services.md
Normal file
@@ -0,0 +1,145 @@
|
|||||||
|
# New Service Candidates
|
||||||
|
|
||||||
|
Ideas for additional services to deploy in the homelab. These lean more enterprise/obscure
|
||||||
|
than the typical self-hosted fare.
|
||||||
|
|
||||||
|
## Litestream
|
||||||
|
|
||||||
|
Continuous SQLite replication to S3-compatible storage. Streams WAL changes in near-real-time,
|
||||||
|
providing point-in-time recovery without scheduled backup jobs.
|
||||||
|
|
||||||
|
**Why:** Several services use SQLite (Home Assistant, potentially others). Litestream would
|
||||||
|
give continuous backup to Garage S3 with minimal resource overhead and near-zero configuration.
|
||||||
|
Replaces cron-based backup scripts with a small daemon per database.
|
||||||
|
|
||||||
|
**Integration points:**
|
||||||
|
- Garage S3 as replication target (already deployed)
|
||||||
|
- Home Assistant SQLite database is the primary candidate
|
||||||
|
- Could also cover any future SQLite-backed services
|
||||||
|
|
||||||
|
**Complexity:** Low. Single Go binary, minimal config (source DB path + S3 endpoint).
|
||||||
|
|
||||||
|
**NixOS packaging:** Available in nixpkgs as `litestream`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ntopng
|
||||||
|
|
||||||
|
Deep network traffic analysis and flow monitoring. Provides real-time visibility into bandwidth
|
||||||
|
usage, protocol distribution, top talkers, and anomaly detection via a web UI.
|
||||||
|
|
||||||
|
**Why:** We have host-level metrics (node-exporter) and logs (Loki) but no network-level
|
||||||
|
visibility. ntopng would show traffic patterns across the infrastructure — NFS throughput to
|
||||||
|
the NAS, DNS query volume, inter-host traffic, and bandwidth anomalies. Useful for capacity
|
||||||
|
planning and debugging network issues.
|
||||||
|
|
||||||
|
**Integration points:**
|
||||||
|
- Could export metrics to Prometheus via its built-in exporter
|
||||||
|
- Web UI behind http-proxy with Kanidm OIDC (if supported) or Pomerium
|
||||||
|
- NetFlow/sFlow from managed switches (if available)
|
||||||
|
- Passive traffic capture on a mirror port or the monitoring host itself
|
||||||
|
|
||||||
|
**Complexity:** Medium. Needs network tap or mirror port for full visibility, or can run
|
||||||
|
in host-local mode. May need a dedicated interface or VLAN mirror.
|
||||||
|
|
||||||
|
**NixOS packaging:** Available in nixpkgs as `ntopng`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Renovate
|
||||||
|
|
||||||
|
Automated dependency update bot that understands Nix flakes natively. Creates branches/PRs
|
||||||
|
to bump flake inputs on a configurable schedule.
|
||||||
|
|
||||||
|
**Why:** Currently `nix flake update` is manual. Renovate can automatically propose updates
|
||||||
|
to individual flake inputs (nixpkgs, homelab-deploy, nixos-exporter, etc.), group related
|
||||||
|
updates, and respect schedules. More granular than updating everything at once — can bump
|
||||||
|
nixpkgs weekly but hold back other inputs, auto-merge patch-level changes, etc.
|
||||||
|
|
||||||
|
**Integration points:**
|
||||||
|
- Runs against git.t-juice.club repositories
|
||||||
|
- Understands `flake.lock` format natively
|
||||||
|
- Could target both `nixos-servers` and `nixos` repos
|
||||||
|
- Update branches would be validated by homelab-deploy builder
|
||||||
|
|
||||||
|
**Complexity:** Medium. Needs git forge integration (Gitea/Forgejo API). Self-hosted runner
|
||||||
|
mode available. Configuration via `renovate.json` in each repo.
|
||||||
|
|
||||||
|
**NixOS packaging:** Available in nixpkgs as `renovate`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Pomerium
|
||||||
|
|
||||||
|
Identity-aware reverse proxy implementing zero-trust access. Every request is authenticated
|
||||||
|
and authorized based on identity, device, and context — not just network location.
|
||||||
|
|
||||||
|
**Why:** Currently Caddy terminates TLS but doesn't enforce authentication on most services.
|
||||||
|
Pomerium would put Kanidm OIDC authentication in front of every internal service, with
|
||||||
|
per-route authorization policies (e.g., "only admins can access Prometheus," "require re-auth
|
||||||
|
for Vault UI"). Directly addresses the security hardening plan's goals.
|
||||||
|
|
||||||
|
**Integration points:**
|
||||||
|
- Kanidm as OIDC identity provider (already deployed)
|
||||||
|
- Could replace or sit in front of Caddy for internal services
|
||||||
|
- Per-route policies based on Kanidm groups (admins, users, ssh-users)
|
||||||
|
- Centralizes access logging and audit trail
|
||||||
|
|
||||||
|
**Complexity:** Medium-high. Needs careful integration with existing Caddy reverse proxy.
|
||||||
|
Decision needed on whether Pomerium replaces Caddy or works alongside it (Pomerium for
|
||||||
|
auth, Caddy for TLS termination and routing, or Pomerium handles everything).
|
||||||
|
|
||||||
|
**NixOS packaging:** Available in nixpkgs as `pomerium`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Apache Guacamole
|
||||||
|
|
||||||
|
Clientless remote desktop and SSH gateway. Provides browser-based access to hosts via
|
||||||
|
RDP, VNC, SSH, and Telnet with no client software required. Supports session recording
|
||||||
|
and playback.
|
||||||
|
|
||||||
|
**Why:** Provides an alternative remote access path that doesn't require VPN software or
|
||||||
|
SSH keys on the client device. Useful for accessing hosts from untrusted machines (phone,
|
||||||
|
borrowed laptop) or providing temporary access to others. Session recording gives an audit
|
||||||
|
trail. Could complement the WireGuard remote access plan rather than replace it.
|
||||||
|
|
||||||
|
**Integration points:**
|
||||||
|
- Kanidm for authentication (OIDC or LDAP)
|
||||||
|
- Behind http-proxy or Pomerium for TLS
|
||||||
|
- SSH access to all hosts in the fleet
|
||||||
|
- Session recordings could be stored on Garage S3
|
||||||
|
- Could serve as the "emergency access" path when VPN is unavailable
|
||||||
|
|
||||||
|
**Complexity:** Medium. Java-based (guacd + web app), typically needs PostgreSQL for
|
||||||
|
connection/user storage (already available). Docker is the common deployment method but
|
||||||
|
native packaging exists.
|
||||||
|
|
||||||
|
**NixOS packaging:** Available in nixpkgs as `guacamole-server` and `guacamole-client`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## CrowdSec
|
||||||
|
|
||||||
|
Collaborative intrusion prevention system with crowd-sourced threat intelligence.
|
||||||
|
Parses logs to detect attack patterns, applies remediation (firewall bans, CAPTCHA),
|
||||||
|
and shares/receives threat signals from a global community network.
|
||||||
|
|
||||||
|
**Why:** Goes beyond fail2ban with behavioral detection, crowd-sourced IP reputation,
|
||||||
|
and a scenario-based engine. Fits the security hardening plan. The community blocklist
|
||||||
|
means we benefit from threat intelligence gathered across thousands of deployments.
|
||||||
|
Could parse SSH logs, HTTP access logs, and other service logs to detect and block
|
||||||
|
malicious activity.
|
||||||
|
|
||||||
|
**Integration points:**
|
||||||
|
- Could consume logs from Loki or directly from journald/log files
|
||||||
|
- Firewall bouncer for iptables/nftables remediation
|
||||||
|
- Caddy bouncer for HTTP-level blocking
|
||||||
|
- Prometheus metrics exporter for alert integration
|
||||||
|
- Scenarios available for SSH brute force, HTTP scanning, and more
|
||||||
|
- Feeds into existing alerting pipeline (Alertmanager -> alerttonotify)
|
||||||
|
|
||||||
|
**Complexity:** Medium. Agent (log parser + decision engine) on each host or centralized.
|
||||||
|
Bouncers (enforcement) on edge hosts. Free community tier includes threat intel access.
|
||||||
|
|
||||||
|
**NixOS packaging:** Available in nixpkgs as `crowdsec`.
|
||||||
162
docs/plans/nixos-router.md
Normal file
162
docs/plans/nixos-router.md
Normal file
@@ -0,0 +1,162 @@
|
|||||||
|
# NixOS Router — Replace EdgeRouter
|
||||||
|
|
||||||
|
Replace the aging Ubiquiti EdgeRouter (gw, 10.69.10.1) with a NixOS-based router.
|
||||||
|
The EdgeRouter is suspected to be a throughput bottleneck. A NixOS router integrates
|
||||||
|
naturally with the existing fleet: same config management, same monitoring pipeline,
|
||||||
|
same deployment workflow.
|
||||||
|
|
||||||
|
## Goals
|
||||||
|
|
||||||
|
- Eliminate the EdgeRouter throughput bottleneck
|
||||||
|
- Full integration with existing monitoring (node-exporter, promtail, Prometheus, Loki)
|
||||||
|
- Declarative firewall and routing config managed in the flake
|
||||||
|
- Inter-VLAN routing for all existing subnets
|
||||||
|
- DHCP server for client subnets
|
||||||
|
- NetFlow/traffic accounting for future ntopng integration
|
||||||
|
- Foundation for WireGuard remote access (see remote-access.md)
|
||||||
|
|
||||||
|
## Current Network Topology
|
||||||
|
|
||||||
|
**Subnets (known VLANs):**
|
||||||
|
| VLAN/Subnet | Purpose | Notable hosts |
|
||||||
|
|----------------|------------------|----------------------------------------|
|
||||||
|
| 10.69.10.0/24 | Gateway | gw (10.69.10.1) |
|
||||||
|
| 10.69.12.0/24 | Core services | nas, pve1, arr jails, restic |
|
||||||
|
| 10.69.13.0/24 | Infrastructure | All NixOS servers (static IPs) |
|
||||||
|
| 10.69.22.0/24 | WLAN | unifi-ctrl |
|
||||||
|
| 10.69.30.0/24 | Workstations | gunter |
|
||||||
|
| 10.69.31.0/24 | Media | media |
|
||||||
|
| 10.69.99.0/24 | Management | sw1 (MikroTik CRS326-24G-2S+) |
|
||||||
|
|
||||||
|
**DNS:** ns1 (10.69.13.5) and ns2 (10.69.13.6) handle all resolution. Upstream is
|
||||||
|
Cloudflare/Google over DoT via Unbound.
|
||||||
|
|
||||||
|
**Switch:** MikroTik CRS326-24G-2S+ — L2 switching with VLAN trunking. Capable of
|
||||||
|
L3 routing via RouterOS but not ideal for sustained routing throughput.
|
||||||
|
|
||||||
|
## Hardware
|
||||||
|
|
||||||
|
Needs a small x86 box with:
|
||||||
|
- At least 2 NICs (WAN + LAN trunk). Dual 2.5GbE preferred.
|
||||||
|
- Enough CPU for nftables NAT at line rate (any modern x86 is fine)
|
||||||
|
- 4-8 GB RAM (plenty for routing + DHCP + NetFlow accounting)
|
||||||
|
- Low power consumption, fanless preferred for always-on use
|
||||||
|
|
||||||
|
Candidates:
|
||||||
|
- Topton / CWWK mini PC with dual/quad Intel 2.5GbE (~100-150 EUR)
|
||||||
|
- Protectli Vault (more expensive, ~200-300 EUR, proven in pfSense/OPNsense community)
|
||||||
|
- Any mini PC with one onboard NIC + one USB 2.5GbE adapter (cheapest, less ideal)
|
||||||
|
|
||||||
|
The LAN port would carry a VLAN trunk to the MikroTik switch, with sub-interfaces
|
||||||
|
for each VLAN. WAN port connects to the ISP uplink.
|
||||||
|
|
||||||
|
## NixOS Configuration
|
||||||
|
|
||||||
|
### Stability Policy
|
||||||
|
|
||||||
|
The router is treated differently from the rest of the fleet:
|
||||||
|
- **No auto-upgrade** — `system.autoUpgrade.enable = false`
|
||||||
|
- **No homelab-deploy listener** — `homelab.deploy.enable = false`
|
||||||
|
- **Manual updates only** — update every few months, test-build first
|
||||||
|
- **Use `nixos-rebuild boot`** — changes take effect on next deliberate reboot
|
||||||
|
- **Tier: prod, priority: high** — alerts treated with highest priority
|
||||||
|
|
||||||
|
### Core Services
|
||||||
|
|
||||||
|
**Routing & NAT:**
|
||||||
|
- `systemd-networkd` for all interface config (consistent with rest of fleet)
|
||||||
|
- VLAN sub-interfaces on the LAN trunk (one per subnet)
|
||||||
|
- `networking.nftables` for stateful firewall and NAT
|
||||||
|
- IP forwarding enabled (`net.ipv4.ip_forward = 1`)
|
||||||
|
- Masquerade outbound traffic on WAN interface
|
||||||
|
|
||||||
|
**DHCP:**
|
||||||
|
- Kea or dnsmasq for DHCP on client subnets (WLAN, workstations, media)
|
||||||
|
- Infrastructure subnet (10.69.13.0/24) stays static — no DHCP needed
|
||||||
|
- Static leases for known devices
|
||||||
|
|
||||||
|
**Firewall (nftables):**
|
||||||
|
- Default deny between VLANs
|
||||||
|
- Explicit allow rules for known cross-VLAN traffic:
|
||||||
|
- All subnets → ns1/ns2 (DNS)
|
||||||
|
- All subnets → monitoring01 (metrics/logs)
|
||||||
|
- Infrastructure → all (management access)
|
||||||
|
- Workstations → media, core services
|
||||||
|
- NAT masquerade on WAN
|
||||||
|
- Rate limiting on WAN-facing services
|
||||||
|
|
||||||
|
**Traffic Accounting:**
|
||||||
|
- nftables flow accounting or softflowd for NetFlow export
|
||||||
|
- Export to future ntopng instance (see new-services.md)
|
||||||
|
|
||||||
|
### Monitoring Integration
|
||||||
|
|
||||||
|
Since this is a NixOS host in the flake, it gets the standard monitoring stack for free:
|
||||||
|
- node-exporter for system metrics (CPU, memory, NIC throughput per interface)
|
||||||
|
- promtail shipping logs to Loki
|
||||||
|
- Prometheus scrape target auto-registration
|
||||||
|
- Alertmanager alerts for host-down, high CPU, etc.
|
||||||
|
|
||||||
|
Additional router-specific monitoring:
|
||||||
|
- Per-VLAN interface traffic metrics via node-exporter (automatic for all interfaces)
|
||||||
|
- NAT connection tracking table size
|
||||||
|
- WAN uplink status and throughput
|
||||||
|
- DHCP lease metrics (if Kea, it has a Prometheus exporter)
|
||||||
|
|
||||||
|
This is a significant advantage over the EdgeRouter — full observability through
|
||||||
|
the existing Grafana dashboards and Loki log search, debuggable via the monitoring
|
||||||
|
MCP tools.
|
||||||
|
|
||||||
|
### WireGuard Integration
|
||||||
|
|
||||||
|
The remote access plan (remote-access.md) currently proposes a separate `extgw01`
|
||||||
|
gateway host. With a NixOS router, there's a decision to make:
|
||||||
|
|
||||||
|
**Option A:** WireGuard terminates on the router itself. Simplest topology — the
|
||||||
|
router is already the gateway, so VPN traffic doesn't need extra hops or firewall
|
||||||
|
rules. But adds complexity to the router, which should stay simple.
|
||||||
|
|
||||||
|
**Option B:** Keep extgw01 as a separate host (original plan). Router just routes
|
||||||
|
traffic to it. Better separation of concerns, router stays minimal.
|
||||||
|
|
||||||
|
Recommendation: Start with option B (keep it separate). The router should do routing
|
||||||
|
and nothing else. WireGuard can move to the router later if extgw01 feels redundant.
|
||||||
|
|
||||||
|
## Migration Plan
|
||||||
|
|
||||||
|
### Phase 1: Build and lab test
|
||||||
|
- Acquire hardware
|
||||||
|
- Create host config in the flake (routing, NAT, DHCP, firewall)
|
||||||
|
- Test-build on workstation: `nix build .#nixosConfigurations.router01.config.system.build.toplevel`
|
||||||
|
- Lab test with a temporary setup if possible (two NICs, isolated VLAN)
|
||||||
|
|
||||||
|
### Phase 2: Prepare cutover
|
||||||
|
- Pre-configure the MikroTik switch trunk port for the new router
|
||||||
|
- Document current EdgeRouter config (port forwarding, NAT rules, DHCP leases)
|
||||||
|
- Replicate all rules in the NixOS config
|
||||||
|
- Verify DNS, DHCP, and inter-VLAN routing work in test
|
||||||
|
|
||||||
|
### Phase 3: Cutover
|
||||||
|
- Schedule a maintenance window (brief downtime expected)
|
||||||
|
- Swap WAN cable from EdgeRouter to new router
|
||||||
|
- Swap LAN trunk from EdgeRouter to new router
|
||||||
|
- Verify connectivity from each VLAN
|
||||||
|
- Verify internet access, DNS resolution, inter-VLAN routing
|
||||||
|
- Monitor via Prometheus/Loki (immediately available since it's a fleet host)
|
||||||
|
|
||||||
|
### Phase 4: Decommission EdgeRouter
|
||||||
|
- Keep EdgeRouter available as fallback for a few weeks
|
||||||
|
- Remove `gw` entry from external-hosts.nix, replace with flake-managed host
|
||||||
|
- Update any references to 10.69.10.1 if the router IP changes
|
||||||
|
|
||||||
|
## Open Questions
|
||||||
|
|
||||||
|
- **Router IP:** Keep 10.69.10.1 or move to a different address? Each VLAN
|
||||||
|
sub-interface needs an IP (the gateway address for that subnet).
|
||||||
|
- **ISP uplink:** What type of WAN connection? PPPoE, DHCP, static IP?
|
||||||
|
- **Port forwarding:** What ports are currently forwarded on the EdgeRouter?
|
||||||
|
These need to be replicated in nftables.
|
||||||
|
- **DHCP scope:** Which subnets currently get DHCP from the EdgeRouter vs
|
||||||
|
other sources (UniFi controller for WLAN?)?
|
||||||
|
- **UPnP/NAT-PMP:** Needed for any devices? (gaming consoles, etc.)
|
||||||
|
- **Hardware preference:** Fanless mini PC budget and preferred vendor?
|
||||||
Reference in New Issue
Block a user