# TODO ## MQTT TLS Hardening (Production) - [ ] Enable TLS listener in `mosquitto/config/mosquitto.conf` (e.g., port 8883) while keeping 1883 only for temporary migration if needed. - [ ] Generate and deploy server certificate + private key for Mosquitto (CA-signed or internal PKI). - [ ] Add CA certificate distribution strategy for all clients and services (server, listener, scheduler, external monitors). - [ ] Set strict file permissions for cert/key material (`chmod 600` for keys, least-privilege ownership). - [ ] Update Docker Compose MQTT service to mount TLS cert/key/CA paths read-only. - [ ] Add environment variables for TLS in `.env` / `.env.example`: - `MQTT_TLS_ENABLED=true` - `MQTT_TLS_CA_CERT=` - `MQTT_TLS_CERTFILE=` (if mutual TLS used) - `MQTT_TLS_KEYFILE=` (if mutual TLS used) - `MQTT_TLS_INSECURE=false` - [ ] Switch internal services to TLS connection settings and verify authenticated reconnect behavior. - [ ] Decide policy: TLS-only auth (username/password over TLS) vs mutual TLS + username/password. - [ ] Disable non-TLS listener (1883) after all clients migrated. - [ ] Restrict MQTT firewall ingress to trusted source ranges only. - [ ] Add Mosquitto ACL file for topic-level permissions per role/client type. - [ ] Add cert rotation process (renewal schedule, rollout, rollback steps). - [ ] Add monitoring/alerting for certificate expiry and broker auth failures. - [ ] Add runbook section for external monitoring clients (how to connect with CA validation). - [ ] Perform a staged rollout (canary group first), then full migration. - [ ] Document final TLS contract in `MQTT_EVENT_PAYLOAD_GUIDE.md` and deployment docs. ## Client Recovery Paths ### Path 1 — Software running → restart via MQTT ✅ - Server-side fully implemented (`restart_app` action, command lifecycle, monitoring panel). - [ ] Client team: handle `restart_app` action in command handler (soft app restart, no reboot). ### Path 2 — Software crashed → MQTT unavailable - Robust solution is **systemd `Restart=always`** (or `Restart=on-failure`) on the client device — no server involvement, OS init system restarts the process automatically. - Server detects the crash via missing heartbeat (`process_status=crashed`), records it, and shows it in the monitoring panel. Recovery is confirmed when heartbeats resume. - [ ] Client team: ensure the infoscreen service unit has `Restart=always` and `RestartSec=` configured in its systemd unit file. - [ ] Evaluate whether MQTT `clean_session=False` + fixed `client_id` is worth adding for cases where the app crashes but the MQTT connection briefly survives (would allow QoS1 command delivery on reconnect). - Note: the existing scheduler crash recovery (`reboot_host` via MQTT) is unreliable for a fully crashed app unless the client uses a persistent MQTT session. Revisit if client team enables `clean_session=False`. ### Path 3 — OS crashed / hung → power cycle needed (customer-dependent) - No software-based recovery path is possible when the OS is unresponsive. - Recovery requires external hardware intervention; options depend on customer infrastructure: - Smart plug / PDU with API (e.g., Shelly, Tasmota, APC, Raritan) - IPMI / iDRAC / BMC (server-class hardware) - CEC power command from another device on the same HDMI chain - Wake-on-LAN after a scheduled power-cut (limited applicability) - [ ] Clarify with customer which hardware is available / acceptable. - [ ] If a smart plug or PDU API is chosen: design a server-side "hard power cycle" command type and integration (out of scope until hardware is confirmed). - [ ] Document chosen solution and integrate into monitoring runbook once decided. ## Optional Security Follow-ups - [ ] Move MQTT credentials to Docker secrets or a vault-backed secret source. - [ ] Rotate `MQTT_USER`/`MQTT_PASSWORD` on a fixed schedule. - [ ] Add fail2ban/rate-limiting protections for exposed broker ports.