- Command intake (reboot/shutdown) on infoscreen/{uuid}/commands with ack lifecycle
- MQTT_USER/MQTT_PASSWORD_BROKER split from identity vars; configure_mqtt_security() updated
- infoscreen-simclient.service: Type=notify, WatchdogSec=60, Restart=on-failure
- infoscreen-notify-failure@.service + script: retained MQTT alert when systemd gives up (Gap 3)
- _sd_notify() watchdog keepalive in simclient main loop (Gap 1)
- broker_connection block in health payload: reconnect_count, last_disconnect_at (Gap 2)
- COMMAND_MOCK_REBOOT_IMMEDIATE_COMPLETE canary flag with safety guard
- SERVER_TEAM_ACTIONS.md: server-side integration action items
- Docs: README, CHANGELOG, src/README, copilot-instructions updated
- 43 tests passing
128 lines
4.4 KiB
Markdown
128 lines
4.4 KiB
Markdown
# Server Team Action Items — Infoscreen Client
|
||
|
||
This document lists everything the server/infrastructure/frontend team must implement to complete the client integration. The client-side code is production-ready for all items listed here.
|
||
|
||
---
|
||
|
||
## 1. MQTT Broker Hardening (prerequisite for everything else)
|
||
|
||
- Disable anonymous access on the broker.
|
||
- Create one broker account **per client device**:
|
||
- Username convention: `infoscreen-client-<uuid-prefix>` (e.g. `infoscreen-client-9b8d1856`)
|
||
- Provision the password to the device `.env` as `MQTT_PASSWORD_BROKER=`
|
||
- Create a **server/publisher account** (e.g. `infoscreen-server`) for all server-side publishes.
|
||
- Enforce ACLs:
|
||
|
||
| Topic | Publisher |
|
||
|---|---|
|
||
| `infoscreen/{uuid}/commands` | server only |
|
||
| `infoscreen/{uuid}/command` (alias) | server only |
|
||
| `infoscreen/{uuid}/group_id` | server only |
|
||
| `infoscreen/events/{group_id}` | server only |
|
||
| `infoscreen/groups/+/power/intent` | server only |
|
||
| `infoscreen/{uuid}/commands/ack` | client only |
|
||
| `infoscreen/{uuid}/command/ack` | client only |
|
||
| `infoscreen/{uuid}/heartbeat` | client only |
|
||
| `infoscreen/{uuid}/health` | client only |
|
||
| `infoscreen/{uuid}/logs/#` | client only |
|
||
| `infoscreen/{uuid}/service_failed` | client only |
|
||
|
||
---
|
||
|
||
## 2. Reboot / Shutdown Command — Ack Lifecycle
|
||
|
||
Client publishes ack status updates to two topics per command (canonical + transitional alias):
|
||
- `infoscreen/{uuid}/commands/ack`
|
||
- `infoscreen/{uuid}/command/ack`
|
||
|
||
**Ack payload schema (v1, frozen):**
|
||
```json
|
||
{
|
||
"command_id": "07aab032-53c2-45ef-a5a3-6aa58e9d9fae",
|
||
"status": "accepted | execution_started | completed | failed",
|
||
"error_code": null,
|
||
"error_message": null
|
||
}
|
||
```
|
||
|
||
**Status lifecycle:**
|
||
|
||
| Status | When | Notes |
|
||
|---|---|---|
|
||
| `accepted` | Command received and validated | Immediate |
|
||
| `execution_started` | Helper invoked | Immediate after accepted |
|
||
| `completed` | Execution confirmed | For `reboot_host`: arrives after reconnect (10–90 s after `execution_started`) |
|
||
| `failed` | Helper returned error | `error_code` and `error_message` will be set |
|
||
|
||
**Server must:**
|
||
- Track `command_id` through the full lifecycle and update status in DB/UI.
|
||
- Surface `failed` + `error_code` to the operator UI.
|
||
- Expect `reboot_host` `completed` to arrive after a reconnect delay — do not treat the gap as a timeout.
|
||
- Use `expires_at` from the original command to determine when to abandon waiting.
|
||
|
||
---
|
||
|
||
## 3. Health Dashboard — Broker Connection Fields (Gap 2)
|
||
|
||
Every `infoscreen/{uuid}/health` payload now includes a `broker_connection` block:
|
||
|
||
```json
|
||
{
|
||
"timestamp": "2026-04-05T08:00:00.000000+00:00",
|
||
"expected_state": { "event_id": 42 },
|
||
"actual_state": {
|
||
"process": "display_manager",
|
||
"pid": 1234,
|
||
"status": "running"
|
||
},
|
||
"broker_connection": {
|
||
"broker_reachable": true,
|
||
"reconnect_count": 2,
|
||
"last_disconnect_at": "2026-04-04T10:30:00Z"
|
||
}
|
||
}
|
||
```
|
||
|
||
**Server must:**
|
||
- Display `reconnect_count` and `last_disconnect_at` per device in the health dashboard.
|
||
- Implement alerting heuristic:
|
||
- **All** clients go silent simultaneously → likely broker outage, not device crash.
|
||
- **Single** client goes silent → device crash, network failure, or process hang.
|
||
|
||
---
|
||
|
||
## 4. Service-Failed MQTT Notification (Gap 3)
|
||
|
||
When systemd gives up restarting a service after repeated crashes (`StartLimitBurst` exceeded), the client automatically publishes a **retained** message:
|
||
|
||
**Topic:** `infoscreen/{uuid}/service_failed`
|
||
|
||
**Payload:**
|
||
```json
|
||
{
|
||
"event": "service_failed",
|
||
"unit": "infoscreen-simclient.service",
|
||
"client_uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
|
||
"failed_at": "2026-04-05T08:00:00Z"
|
||
}
|
||
```
|
||
|
||
**Server must:**
|
||
- Subscribe to `infoscreen/+/service_failed` on startup (retained — message survives broker restart).
|
||
- Alert the operator immediately when this topic receives a payload.
|
||
- **Clear the retained message** once the device is acknowledged or recovered:
|
||
```
|
||
mosquitto_pub -t "infoscreen/{uuid}/service_failed" -n --retain
|
||
```
|
||
|
||
---
|
||
|
||
## 5. No Server Action Required
|
||
|
||
These items are fully implemented client-side and require no server changes:
|
||
|
||
- systemd watchdog (`WatchdogSec=60`) — hangs detected and process restarted automatically.
|
||
- Command deduplication — `command_id` deduplicated with 24-hour TTL.
|
||
- Ack retry backoff — client retries ack publish on broker disconnect until `expires_at`.
|
||
- Mock helper / test mode (`COMMAND_MOCK_REBOOT_IMMEDIATE_COMPLETE`) — development only.
|