Files
infoscreen/TODO.md
Olaf 03e3c11e90 feat: crash recovery, service_failed monitoring, broker health fields, command expiry sweep
- Add GET /api/clients/crashed endpoint (process_status=crashed or stale heartbeat)
- Add restart_app command action with same lifecycle + lockout as reboot_host
- Scheduler: crash auto-recovery loop (CRASH_RECOVERY_ENABLED flag, lockout, MQTT publish)
- Scheduler: unconditional command expiry sweep per poll cycle (sweep_expired_commands)
- Listener: subscribe to infoscreen/+/service_failed; persist service_failed_at + unit
- Listener: extract broker_connection block from health payload; persist reconnect_count + last_disconnect_at
- DB migration b1c2d3e4f5a6: service_failed_at, service_failed_unit, mqtt_reconnect_count, mqtt_last_disconnect_at on clients
- Add GET /api/clients/service_failed and POST /api/clients/<uuid>/clear_service_failed
- Monitoring overview API: include mqtt_reconnect_count + mqtt_last_disconnect_at per client
- Frontend: orange service-failed alert panel (hidden when empty, auto-refresh, quittieren action)
- Frontend: MQTT reconnect count + last disconnect in client detail panel
- MQTT auth hardening: listener/scheduler/server use env credentials; broker enforces allow_anonymous false
- Client command lifecycle foundation: ClientCommand model, reboot_host/shutdown_host, full ACK lifecycle
- Docs: TECH-CHANGELOG, DEV-CHANGELOG, MQTT_EVENT_PAYLOAD_GUIDE, copilot-instructions updated
- Add implementation-plans/, RESTART_VALIDATION_CHECKLIST.md, TODO.md
2026-04-05 10:17:56 +00:00

3.9 KiB

TODO

MQTT TLS Hardening (Production)

  • Enable TLS listener in mosquitto/config/mosquitto.conf (e.g., port 8883) while keeping 1883 only for temporary migration if needed.
  • Generate and deploy server certificate + private key for Mosquitto (CA-signed or internal PKI).
  • Add CA certificate distribution strategy for all clients and services (server, listener, scheduler, external monitors).
  • Set strict file permissions for cert/key material (chmod 600 for keys, least-privilege ownership).
  • Update Docker Compose MQTT service to mount TLS cert/key/CA paths read-only.
  • Add environment variables for TLS in .env / .env.example:
    • MQTT_TLS_ENABLED=true
    • MQTT_TLS_CA_CERT=<path>
    • MQTT_TLS_CERTFILE=<path> (if mutual TLS used)
    • MQTT_TLS_KEYFILE=<path> (if mutual TLS used)
    • MQTT_TLS_INSECURE=false
  • Switch internal services to TLS connection settings and verify authenticated reconnect behavior.
  • Decide policy: TLS-only auth (username/password over TLS) vs mutual TLS + username/password.
  • Disable non-TLS listener (1883) after all clients migrated.
  • Restrict MQTT firewall ingress to trusted source ranges only.
  • Add Mosquitto ACL file for topic-level permissions per role/client type.
  • Add cert rotation process (renewal schedule, rollout, rollback steps).
  • Add monitoring/alerting for certificate expiry and broker auth failures.
  • Add runbook section for external monitoring clients (how to connect with CA validation).
  • Perform a staged rollout (canary group first), then full migration.
  • Document final TLS contract in MQTT_EVENT_PAYLOAD_GUIDE.md and deployment docs.

Client Recovery Paths

Path 1 — Software running → restart via MQTT

  • Server-side fully implemented (restart_app action, command lifecycle, monitoring panel).
  • Client team: handle restart_app action in command handler (soft app restart, no reboot).

Path 2 — Software crashed → MQTT unavailable

  • Robust solution is systemd Restart=always (or Restart=on-failure) on the client device — no server involvement, OS init system restarts the process automatically.
  • Server detects the crash via missing heartbeat (process_status=crashed), records it, and shows it in the monitoring panel. Recovery is confirmed when heartbeats resume.
  • Client team: ensure the infoscreen service unit has Restart=always and RestartSec=<delay> configured in its systemd unit file.
  • Evaluate whether MQTT clean_session=False + fixed client_id is worth adding for cases where the app crashes but the MQTT connection briefly survives (would allow QoS1 command delivery on reconnect).
  • Note: the existing scheduler crash recovery (reboot_host via MQTT) is unreliable for a fully crashed app unless the client uses a persistent MQTT session. Revisit if client team enables clean_session=False.

Path 3 — OS crashed / hung → power cycle needed (customer-dependent)

  • No software-based recovery path is possible when the OS is unresponsive.
  • Recovery requires external hardware intervention; options depend on customer infrastructure:
    • Smart plug / PDU with API (e.g., Shelly, Tasmota, APC, Raritan)
    • IPMI / iDRAC / BMC (server-class hardware)
    • CEC power command from another device on the same HDMI chain
    • Wake-on-LAN after a scheduled power-cut (limited applicability)
  • Clarify with customer which hardware is available / acceptable.
  • If a smart plug or PDU API is chosen: design a server-side "hard power cycle" command type and integration (out of scope until hardware is confirmed).
  • Document chosen solution and integrate into monitoring runbook once decided.

Optional Security Follow-ups

  • Move MQTT credentials to Docker secrets or a vault-backed secret source.
  • Rotate MQTT_USER/MQTT_PASSWORD on a fixed schedule.
  • Add fail2ban/rate-limiting protections for exposed broker ports.