feat: crash recovery, service_failed monitoring, broker health fields, command expiry sweep
- Add GET /api/clients/crashed endpoint (process_status=crashed or stale heartbeat) - Add restart_app command action with same lifecycle + lockout as reboot_host - Scheduler: crash auto-recovery loop (CRASH_RECOVERY_ENABLED flag, lockout, MQTT publish) - Scheduler: unconditional command expiry sweep per poll cycle (sweep_expired_commands) - Listener: subscribe to infoscreen/+/service_failed; persist service_failed_at + unit - Listener: extract broker_connection block from health payload; persist reconnect_count + last_disconnect_at - DB migration b1c2d3e4f5a6: service_failed_at, service_failed_unit, mqtt_reconnect_count, mqtt_last_disconnect_at on clients - Add GET /api/clients/service_failed and POST /api/clients/<uuid>/clear_service_failed - Monitoring overview API: include mqtt_reconnect_count + mqtt_last_disconnect_at per client - Frontend: orange service-failed alert panel (hidden when empty, auto-refresh, quittieren action) - Frontend: MQTT reconnect count + last disconnect in client detail panel - MQTT auth hardening: listener/scheduler/server use env credentials; broker enforces allow_anonymous false - Client command lifecycle foundation: ClientCommand model, reboot_host/shutdown_host, full ACK lifecycle - Docs: TECH-CHANGELOG, DEV-CHANGELOG, MQTT_EVENT_PAYLOAD_GUIDE, copilot-instructions updated - Add implementation-plans/, RESTART_VALIDATION_CHECKLIST.md, TODO.md
This commit is contained in:
@@ -50,6 +50,91 @@ Contract notes:
|
||||
- Heartbeat republishes keep `intent_id` stable while refreshing `issued_at` and `expires_at`.
|
||||
- Expiry is poll-based: `max(3 x poll_interval_sec, 90)`.
|
||||
|
||||
### Service-Failed Notification (client → server, retained)
|
||||
- **Topic**: `infoscreen/{uuid}/service_failed`
|
||||
- **QoS**: 1
|
||||
- **Retained**: Yes
|
||||
- **Direction**: client → server
|
||||
- **Purpose**: Client signals that systemd has exhausted restart attempts (`StartLimitBurst` exceeded) — manual intervention is required.
|
||||
|
||||
Example payload:
|
||||
|
||||
```json
|
||||
{
|
||||
"event": "service_failed",
|
||||
"unit": "infoscreen-simclient.service",
|
||||
"client_uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
|
||||
"failed_at": "2026-04-05T08:00:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
Contract notes:
|
||||
- Message is retained so the server receives it even after a broker restart.
|
||||
- Server persists `service_failed_at` and `service_failed_unit` to the `clients` table.
|
||||
- To clear after resolution: `POST /api/clients/<uuid>/clear_service_failed` — clears the DB flag and publishes an empty retained payload to delete the retained message from the broker.
|
||||
- Empty payload (empty bytes) on this topic = retain-clear in transit; listener ignores it.
|
||||
|
||||
### Client Command Intent (Phase 1)
|
||||
- **Topic**: `infoscreen/{uuid}/commands`
|
||||
- **QoS**: 1
|
||||
- **Retained**: No
|
||||
- **Format**: JSON object
|
||||
- **Purpose**: Per-client control commands (currently `restart` and `shutdown`)
|
||||
|
||||
Compatibility note:
|
||||
- During restart transition, server also publishes legacy restart command to `clients/{uuid}/restart` with payload `{ "action": "restart" }`.
|
||||
- During topic naming transition, server also publishes command payload to `infoscreen/{uuid}/command`.
|
||||
|
||||
Example payload:
|
||||
|
||||
```json
|
||||
{
|
||||
"schema_version": "1.0",
|
||||
"command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
|
||||
"client_uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
|
||||
"action": "reboot_host",
|
||||
"issued_at": "2026-04-03T12:48:10Z",
|
||||
"expires_at": "2026-04-03T12:52:10Z",
|
||||
"requested_by": 1,
|
||||
"reason": "operator_request"
|
||||
}
|
||||
```
|
||||
|
||||
Contract notes:
|
||||
- Clients must reject stale commands where local UTC time is greater than `expires_at`.
|
||||
- Clients must deduplicate by `command_id` and never execute a duplicate command twice.
|
||||
- `schema_version` is required for forward-compatibility.
|
||||
- Allowed command action values in v1: `reboot_host`, `shutdown_host`, `restart_app`.
|
||||
- `restart_app` = soft app restart (no OS reboot); `reboot_host` = full OS reboot.
|
||||
- API mapping for operators: restart endpoint emits `reboot_host`; shutdown endpoint emits `shutdown_host`.
|
||||
|
||||
### Client Command Acknowledgements (Phase 1)
|
||||
- **Topic**: `infoscreen/{uuid}/commands/ack`
|
||||
- **QoS**: 1 (recommended)
|
||||
- **Retained**: No
|
||||
- **Format**: JSON object
|
||||
- **Purpose**: Client reports command lifecycle progression back to server
|
||||
|
||||
Compatibility note:
|
||||
- During topic naming transition, listener also accepts acknowledgements from `infoscreen/{uuid}/command/ack`.
|
||||
|
||||
Example payload:
|
||||
|
||||
```json
|
||||
{
|
||||
"command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
|
||||
"status": "execution_started",
|
||||
"error_code": null,
|
||||
"error_message": null
|
||||
}
|
||||
```
|
||||
|
||||
Allowed `status` values:
|
||||
- `accepted`
|
||||
- `execution_started`
|
||||
- `completed`
|
||||
- `failed`
|
||||
|
||||
## Message Structure
|
||||
|
||||
### General Principles
|
||||
|
||||
Reference in New Issue
Block a user