diff --git a/docs/infrastructure.md b/docs/infrastructure.md new file mode 100644 index 0000000..114fb3d --- /dev/null +++ b/docs/infrastructure.md @@ -0,0 +1,282 @@ +# Homelab Infrastructure + +This document describes the physical and virtual infrastructure components that support the NixOS-managed servers in this repository. + +## Overview + +The homelab consists of several core infrastructure components: +- **Proxmox VE** - Hypervisor hosting all NixOS VMs +- **TrueNAS** - Network storage and backup target +- **Ubiquiti EdgeRouter** - Primary router and gateway +- **Mikrotik Switch** - Core network switching + +All NixOS configurations in this repository run as VMs on Proxmox and rely on these underlying infrastructure components. + +## Network Topology + +### Subnets + +VLAN numbers are based on third octet of ip address. + +TODO: VLAN naming is currently inconsistent across router/switch/Proxmox configurations. Need to standardize VLAN names and update all device configs to use consistent naming. + +- `10.69.8.x` - Kubernetes (no longer in use) +- `10.69.12.x` - Core services +- `10.69.13.x` - NixOS VMs and core services +- `10.69.30.x` - Client network 1 +- `10.69.31.x` - Clients network 2 +- `10.69.99.x` - Management network + +### Core Network Services + +- **Gateway**: Web UI exposed on 10.69.10.1 +- **DNS**: ns1 (10.69.13.5), ns2 (10.69.13.6) +- **Primary DNS Domain**: `home.2rjus.net` + +## Hardware Components + +### Proxmox Hypervisor + +**Purpose**: Hosts all NixOS VMs defined in this repository + +**Hardware**: +- CPU: AMD Ryzen 9 3900X 12-Core Processor +- RAM: 96GB (94Gi) +- Storage: 1TB NVMe SSD (nvme0n1) + +**Management**: +- Web UI: `https://pve1.home.2rjus.net:8006` +- Cluster: Standalone +- Version: Proxmox VE 8.4.16 (kernel 6.8.12-18-pve) + +**VM Provisioning**: +- Template VM: ID 9000 (built from `hosts/template2`) +- See `/terraform` directory for automated VM deployment using OpenTofu + +**Storage**: +- ZFS pool: `rpool` on NVMe partition (nvme0n1p3) + - Total capacity: ~900GB (232GB used, 667GB available) + - Configuration: Single disk (no RAID) + - Scrub status: Last scrub completed successfully with 0 errors + +**Networking**: +- Management interface: `vmbr0` - 10.69.12.75/24 (VLAN 12 - Core services) +- Physical interface: `enp9s0` (primary), `enp4s0` (unused) +- VM bridges: + - `vmbr0` - Main bridge (bridged to enp9s0) + - `vmbr0v8` - VLAN 8 (Kubernetes - deprecated) + - `vmbr0v13` - VLAN 13 (NixOS VMs and core services) + +### TrueNAS + +**Purpose**: Network storage, backup target, media storage + +**Hardware**: +- Model: Custom build +- CPU: AMD Ryzen 5 5600G with Radeon Graphics +- RAM: 32GB (31.2 GiB) +- Disks: + - 2x Kingston SA400S37 240GB SSD (boot pool, mirrored) + - 2x Seagate ST16000NE000 16TB HDD (hdd-pool mirror-0) + - 2x WD WD80EFBX 8TB HDD (hdd-pool mirror-1) + - 2x Seagate ST8000VN004 8TB HDD (hdd-pool mirror-2) + - 1x NVMe 2TB (nvme-pool, no redundancy) + +**Management**: +- Web UI: `https://nas.home.2rjus.net` (10.69.12.50) +- Hostname: `nas.home.2rjus.net` +- Version: TrueNAS-13.0-U6.1 (Core) + +**Networking**: +- Primary interface: `mlxen0` - 10GbE (10Gbase-CX4) connected to sw1 +- IP: 10.69.12.50/24 (VLAN 12 - Core services) + +**ZFS Pools**: +- `boot-pool`: 206GB (mirrored SSDs) - 4% used + - Mirror of 2x Kingston 240GB SSDs + - Last scrub: No errors +- `hdd-pool`: 29.1TB total (3-way mirror, 28.4TB used, 658GB free) - 97% capacity + - mirror-0: 2x 16TB Seagate ST16000NE000 + - mirror-1: 2x 8TB WD WD80EFBX + - mirror-2: 2x 8TB Seagate ST8000VN004 + - Last scrub: No errors +- `nvme-pool`: 1.81TB (single NVMe, 70.4GB used, 1.74TB free) - 3% capacity + - Single NVMe drive, no redundancy + - Last scrub: No errors + +**NFS Exports**: +- `/mnt/hdd-pool/media` - Media storage (exported to 10.69.0.0/16, used by Jellyfin) +- `/mnt/hdd-pool/virt/nfs-iso` - ISO storage for Proxmox +- `/mnt/hdd-pool/virt/kube-prod-pvc` - Kubernetes storage (deprecated) + +**Jails**: +TrueNAS runs several FreeBSD jails for media management: +- nzbget - Usenet downloader +- restic-rest - Restic REST server for backups +- radarr - Movie management +- sonarr - TV show management + +### Ubiquiti EdgeRouter + +**Purpose**: Primary router, gateway, firewall, inter-VLAN routing + +**Model**: EdgeRouter X 5-Port + +**Hardware**: +- Serial: F09FC20E1A4C + +**Management**: +- SSH: `ssh ubnt@10.69.10.1` +- Web UI: `https://10.69.10.1` +- Version: EdgeOS v2.0.9-hotfix.6 (build 5574651, 12/30/22) + +**WAN Connection**: +- Interface: eth0 +- Public IP: 84.213.73.123/20 +- Gateway: 84.213.64.1 + +**Interface Layout**: +- **eth0**: WAN (public IP) +- **eth1**: 10.69.31.1/24 - Clients network 2 +- **eth2**: Unused (down) +- **eth3**: 10.69.30.1/24 - Client network 1 +- **eth4**: Trunk port to Mikrotik switch (carries all VLANs) + - eth4.8: 10.69.8.1/24 - K8S (deprecated) + - eth4.10: 10.69.10.1/24 - TRUSTED (management access) + - eth4.12: 10.69.12.1/24 - SERVER (Proxmox, TrueNAS, core services) + - eth4.13: 10.69.13.1/24 - SVC (NixOS VMs) + - eth4.21: 10.69.21.1/24 - CLIENTS + - eth4.22: 10.69.22.1/24 - WLAN (wireless clients) + - eth4.23: 10.69.23.1/24 - IOT + - eth4.99: 10.69.99.1/24 - MGMT (device management) + +**Routing**: +- Default route: 0.0.0.0/0 via 84.213.64.1 (WAN gateway) +- Static route: 192.168.100.0/24 via eth0 +- All internal VLANs directly connected + +**DHCP Servers**: +Active DHCP pools on all networks: +- dhcp-8: VLAN 8 (K8S) - 91 addresses +- dhcp-12: VLAN 12 (SERVER) - 51 addresses +- dhcp-13: VLAN 13 (SVC) - 41 addresses +- dhcp-21: VLAN 21 (CLIENTS) - 141 addresses +- dhcp-22: VLAN 22 (WLAN) - 101 addresses +- dhcp-23: VLAN 23 (IOT) - 191 addresses +- dhcp-30: eth3 (Client network 1) - 101 addresses +- dhcp-31: eth1 (Clients network 2) - 21 addresses +- dhcp-mgmt: VLAN 99 (MGMT) - 51 addresses + +**NAT/Firewall**: +- Masquerading on WAN interface (eth0) + +### Mikrotik Switch + +**Purpose**: Core Layer 2/3 switching + +**Model**: MikroTik CRS326-24G-2S+ (24x 1GbE + 2x 10GbE SFP+) + +**Hardware**: +- CPU: ARMv7 @ 800MHz +- RAM: 512MB +- Uptime: 21+ weeks + +**Management**: +- Hostname: `sw1.home.2rjus.net` +- SSH access: `ssh admin@sw1.home.2rjus.net` (using gunter SSH key) +- Management IP: 10.69.99.2/24 (VLAN 99) +- Version: RouterOS 6.47.10 (long-term) + +**VLANs**: +- VLAN 8: Kubernetes (deprecated) +- VLAN 12: SERVERS - Core services subnet +- VLAN 13: SVC - Services subnet +- VLAN 21: CLIENTS +- VLAN 22: WLAN - Wireless network +- VLAN 23: IOT +- VLAN 99: MGMT - Management network + +**Port Layout** (active ports): +- **ether1**: Uplink to EdgeRouter (trunk, carries all VLANs) +- **ether11**: virt-mini1 (VLAN 12 - SERVERS) +- **ether12**: Home Assistant (VLAN 12 - SERVERS) +- **ether24**: Wireless AP (VLAN 22 - WLAN) +- **sfp-sfpplus1**: Media server/Jellyfin (VLAN 12) - 10Gbps, 7m copper DAC +- **sfp-sfpplus2**: TrueNAS (VLAN 12) - 10Gbps, 1m copper DAC + +**Bridge Configuration**: +- All ports bridged to main bridge interface +- Hardware offloading enabled +- VLAN filtering enabled on bridge + +## Backup & Disaster Recovery + +### Backup Strategy + +**NixOS VMs**: +- Declarative configurations in this git repository +- Secrets: SOPS-encrypted, backed up with repository +- State/data: Some hosts are backed up to nas host, but this should be improved and expanded to more hosts. + +**Proxmox**: +- VM backups: Not currently implemented + +**Critical Credentials**: + +TODO: Document this + +- OpenBao root token and unseal keys: _[offline secure storage location]_ +- Proxmox root password: _[secure storage]_ +- TrueNAS admin password: _[secure storage]_ +- Router admin credentials: _[secure storage]_ + +### Disaster Recovery Procedures + +**Total Infrastructure Loss**: +1. Restore Proxmox from installation media +2. Restore TrueNAS from installation media, import ZFS pools +3. Restore network configuration on EdgeRouter and Mikrotik +4. Rebuild NixOS VMs from this repository using Proxmox template +5. Restore stateful data from TrueNAS backups +6. Re-initialize OpenBao and restore from backup if needed + +**Individual VM Loss**: +1. Deploy new VM from template using OpenTofu (`terraform/`) +2. Run `nixos-rebuild` with appropriate flake configuration +3. Restore any stateful data from backups +4. For vault01: follow re-provisioning steps in `docs/vault/auto-unseal.md` + +**Network Device Failure**: +- EdgeRouter: _[config backup location, restoration procedure]_ +- Mikrotik: _[config backup location, restoration procedure]_ + +## Future Additions + +- Additional Proxmox nodes for clustering +- Backup Proxmox Backup Server +- Additional TrueNAS for replication + +## Maintenance Notes + +### Proxmox Updates + +- Update schedule: manual +- Pre-update checklist: yolo + +### TrueNAS Updates + +- Update schedule: manual + +### Network Device Updates + +- EdgeRouter: manual +- Mikrotik: manual + +## Monitoring + +**Infrastructure Monitoring**: + +TODO: Improve monitoring for physical hosts (proxmox, nas) +TODO: Improve monitoring for networking equipment + +All NixOS VMs ship metrics to monitoring01 via node-exporter and logs via Promtail. See `/services/monitoring/` for the observability stack configuration.