feat: crash recovery, service_failed monitoring, broker health fields, command expiry sweep

- Add GET /api/clients/crashed endpoint (process_status=crashed or stale heartbeat) - Add restart_app command action with same lifecycle + lockout as reboot_host - Scheduler: crash auto-recovery loop (CRASH_RECOVERY_ENABLED flag, lockout, MQTT publish) - Scheduler: unconditional command expiry sweep per poll cycle (sweep_expired_commands) - Listener: subscribe to infoscreen/+/service_failed; persist service_failed_at + unit - Listener: extract broker_connection block from health payload; persist reconnect_count + last_disconnect_at - DB migration b1c2d3e4f5a6: service_failed_at, service_failed_unit, mqtt_reconnect_count, mqtt_last_disconnect_at on clients - Add GET /api/clients/service_failed and POST /api/clients/<uuid>/clear_service_failed - Monitoring overview API: include mqtt_reconnect_count + mqtt_last_disconnect_at per client - Frontend: orange service-failed alert panel (hidden when empty, auto-refresh, quittieren action) - Frontend: MQTT reconnect count + last disconnect in client detail panel - MQTT auth hardening: listener/scheduler/server use env credentials; broker enforces allow_anonymous false - Client command lifecycle foundation: ClientCommand model, reboot_host/shutdown_host, full ACK lifecycle - Docs: TECH-CHANGELOG, DEV-CHANGELOG, MQTT_EVENT_PAYLOAD_GUIDE, copilot-instructions updated - Add implementation-plans/, RESTART_VALIDATION_CHECKLIST.md, TODO.md
2026-04-05 10:17:56 +00:00
parent 4d652f0554
commit 03e3c11e90
35 changed files with 2511 additions and 80 deletions
--- a/implementation-plans/server-team-actions.md
+++ b/implementation-plans/server-team-actions.md
@@ -0,0 +1,127 @@
+# Server Team Action Items — Infoscreen Client
+
+This document lists everything the server/infrastructure/frontend team must implement to complete the client integration. The client-side code is production-ready for all items listed here.
+
+---
+
+## 1. MQTT Broker Hardening (prerequisite for everything else)
+
+- Disable anonymous access on the broker.
+- Create one broker account **per client device**:
+  - Username convention: `infoscreen-client-<uuid-prefix>` (e.g. `infoscreen-client-9b8d1856`)
+  - Provision the password to the device `.env` as `MQTT_PASSWORD_BROKER=`
+- Create a **server/publisher account** (e.g. `infoscreen-server`) for all server-side publishes.
+- Enforce ACLs:
+
+| Topic | Publisher |
+|---|---|
+| `infoscreen/{uuid}/commands` | server only |
+| `infoscreen/{uuid}/command` (alias) | server only |
+| `infoscreen/{uuid}/group_id` | server only |
+| `infoscreen/events/{group_id}` | server only |
+| `infoscreen/groups/+/power/intent` | server only |
+| `infoscreen/{uuid}/commands/ack` | client only |
+| `infoscreen/{uuid}/command/ack` | client only |
+| `infoscreen/{uuid}/heartbeat` | client only |
+| `infoscreen/{uuid}/health` | client only |
+| `infoscreen/{uuid}/logs/#` | client only |
+| `infoscreen/{uuid}/service_failed` | client only |
+
+---
+
+## 2. Reboot / Shutdown Command — Ack Lifecycle
+
+Client publishes ack status updates to two topics per command (canonical + transitional alias):
+- `infoscreen/{uuid}/commands/ack`
+- `infoscreen/{uuid}/command/ack`
+
+**Ack payload schema (v1, frozen):**
+```json
+{
+  "command_id": "07aab032-53c2-45ef-a5a3-6aa58e9d9fae",
+  "status": "accepted | execution_started | completed | failed",
+  "error_code": null,
+  "error_message": null
+}
+```
+
+**Status lifecycle:**
+
+| Status | When | Notes |
+|---|---|---|
+| `accepted` | Command received and validated | Immediate |
+| `execution_started` | Helper invoked | Immediate after accepted |
+| `completed` | Execution confirmed | For `reboot_host`: arrives after reconnect (10–90 s after `execution_started`) |
+| `failed` | Helper returned error | `error_code` and `error_message` will be set |
+
+**Server must:**
+- Track `command_id` through the full lifecycle and update status in DB/UI.
+- Surface `failed` + `error_code` to the operator UI.
+- Expect `reboot_host` `completed` to arrive after a reconnect delay — do not treat the gap as a timeout.
+- Use `expires_at` from the original command to determine when to abandon waiting.
+
+---
+
+## 3. Health Dashboard — Broker Connection Fields (Gap 2)
+
+Every `infoscreen/{uuid}/health` payload now includes a `broker_connection` block:
+
+```json
+{
+  "timestamp": "2026-04-05T08:00:00.000000+00:00",
+  "expected_state": { "event_id": 42 },
+  "actual_state": {
+    "process": "display_manager",
+    "pid": 1234,
+    "status": "running"
+  },
+  "broker_connection": {
+    "broker_reachable": true,
+    "reconnect_count": 2,
+    "last_disconnect_at": "2026-04-04T10:30:00Z"
+  }
+}
+```
+
+**Server must:**
+- Display `reconnect_count` and `last_disconnect_at` per device in the health dashboard.
+- Implement alerting heuristic:
+  - **All** clients go silent simultaneously → likely broker outage, not device crash.
+  - **Single** client goes silent → device crash, network failure, or process hang.
+
+---
+
+## 4. Service-Failed MQTT Notification (Gap 3)
+
+When systemd gives up restarting a service after repeated crashes (`StartLimitBurst` exceeded), the client automatically publishes a **retained** message:
+
+**Topic:** `infoscreen/{uuid}/service_failed`
+
+**Payload:**
+```json
+{
+  "event": "service_failed",
+  "unit": "infoscreen-simclient.service",
+  "client_uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
+  "failed_at": "2026-04-05T08:00:00Z"
+}
+```
+
+**Server must:**
+- Subscribe to `infoscreen/+/service_failed` on startup (retained — message survives broker restart).
+- Alert the operator immediately when this topic receives a payload.
+- **Clear the retained message** once the device is acknowledged or recovered:
+  ```
+  mosquitto_pub -t "infoscreen/{uuid}/service_failed" -n --retain
+  ```
+
+---
+
+## 5. No Server Action Required
+
+These items are fully implemented client-side and require no server changes:
+
+- systemd watchdog (`WatchdogSec=60`) — hangs detected and process restarted automatically.
+- Command deduplication — `command_id` deduplicated with 24-hour TTL.
+- Ack retry backoff — client retries ack publish on broker disconnect until `expires_at`.
+- Mock helper / test mode (`COMMAND_MOCK_REBOOT_IMMEDIATE_COMPLETE`) — development only.