# TODO

## MQTT TLS Hardening (Production)

- [ ] Enable TLS listener in `mosquitto/config/mosquitto.conf` (e.g., port 8883) while keeping 1883 only for temporary migration if needed.
- [ ] Generate and deploy server certificate + private key for Mosquitto (CA-signed or internal PKI).
- [ ] Add CA certificate distribution strategy for all clients and services (server, listener, scheduler, external monitors).
- [ ] Set strict file permissions for cert/key material (`chmod 600` for keys, least-privilege ownership).
- [ ] Update Docker Compose MQTT service to mount TLS cert/key/CA paths read-only.
- [ ] Add environment variables for TLS in `.env` / `.env.example`:
  - `MQTT_TLS_ENABLED=true`
  - `MQTT_TLS_CA_CERT=<path>`
  - `MQTT_TLS_CERTFILE=<path>` (if mutual TLS used)
  - `MQTT_TLS_KEYFILE=<path>` (if mutual TLS used)
  - `MQTT_TLS_INSECURE=false`
- [ ] Switch internal services to TLS connection settings and verify authenticated reconnect behavior.
- [ ] Decide policy: TLS-only auth (username/password over TLS) vs mutual TLS + username/password.
- [ ] Disable non-TLS listener (1883) after all clients migrated.
- [ ] Restrict MQTT firewall ingress to trusted source ranges only.
- [ ] Add Mosquitto ACL file for topic-level permissions per role/client type.
- [ ] Add cert rotation process (renewal schedule, rollout, rollback steps).
- [ ] Add monitoring/alerting for certificate expiry and broker auth failures.
- [ ] Add runbook section for external monitoring clients (how to connect with CA validation).
- [ ] Perform a staged rollout (canary group first), then full migration.
- [ ] Document final TLS contract in `MQTT_EVENT_PAYLOAD_GUIDE.md` and deployment docs.

## Client Recovery Paths

### Path 1 — Software running → restart via MQTT ✅
- Server-side fully implemented (`restart_app` action, command lifecycle, monitoring panel).
- [ ] Client team: handle `restart_app` action in command handler (soft app restart, no reboot).

### Path 2 — Software crashed → MQTT unavailable
- Robust solution is **systemd `Restart=always`** (or `Restart=on-failure`) on the client device — no server involvement, OS init system restarts the process automatically.
- Server detects the crash via missing heartbeat (`process_status=crashed`), records it, and shows it in the monitoring panel. Recovery is confirmed when heartbeats resume.
- [ ] Client team: ensure the infoscreen service unit has `Restart=always` and `RestartSec=<delay>` configured in its systemd unit file.
- [ ] Evaluate whether MQTT `clean_session=False` + fixed `client_id` is worth adding for cases where the app crashes but the MQTT connection briefly survives (would allow QoS1 command delivery on reconnect).
- Note: the existing scheduler crash recovery (`reboot_host` via MQTT) is unreliable for a fully crashed app unless the client uses a persistent MQTT session. Revisit if client team enables `clean_session=False`.

### Path 3 — OS crashed / hung → power cycle needed (customer-dependent)
- No software-based recovery path is possible when the OS is unresponsive.
- Recovery requires external hardware intervention; options depend on customer infrastructure:
  - Smart plug / PDU with API (e.g., Shelly, Tasmota, APC, Raritan)
  - IPMI / iDRAC / BMC (server-class hardware)
  - CEC power command from another device on the same HDMI chain
  - Wake-on-LAN after a scheduled power-cut (limited applicability)
- [ ] Clarify with customer which hardware is available / acceptable.
- [ ] If a smart plug or PDU API is chosen: design a server-side "hard power cycle" command type and integration (out of scope until hardware is confirmed).
- [ ] Document chosen solution and integrate into monitoring runbook once decided.

## Optional Security Follow-ups

- [ ] Move MQTT credentials to Docker secrets or a vault-backed secret source.
- [ ] Rotate `MQTT_USER`/`MQTT_PASSWORD` on a fixed schedule.
- [ ] Add fail2ban/rate-limiting protections for exposed broker ports.