feat: crash recovery, service_failed monitoring, broker health fields, command expiry sweep
- Add GET /api/clients/crashed endpoint (process_status=crashed or stale heartbeat) - Add restart_app command action with same lifecycle + lockout as reboot_host - Scheduler: crash auto-recovery loop (CRASH_RECOVERY_ENABLED flag, lockout, MQTT publish) - Scheduler: unconditional command expiry sweep per poll cycle (sweep_expired_commands) - Listener: subscribe to infoscreen/+/service_failed; persist service_failed_at + unit - Listener: extract broker_connection block from health payload; persist reconnect_count + last_disconnect_at - DB migration b1c2d3e4f5a6: service_failed_at, service_failed_unit, mqtt_reconnect_count, mqtt_last_disconnect_at on clients - Add GET /api/clients/service_failed and POST /api/clients/<uuid>/clear_service_failed - Monitoring overview API: include mqtt_reconnect_count + mqtt_last_disconnect_at per client - Frontend: orange service-failed alert panel (hidden when empty, auto-refresh, quittieren action) - Frontend: MQTT reconnect count + last disconnect in client detail panel - MQTT auth hardening: listener/scheduler/server use env credentials; broker enforces allow_anonymous false - Client command lifecycle foundation: ClientCommand model, reboot_host/shutdown_host, full ACK lifecycle - Docs: TECH-CHANGELOG, DEV-CHANGELOG, MQTT_EVENT_PAYLOAD_GUIDE, copilot-instructions updated - Add implementation-plans/, RESTART_VALIDATION_CHECKLIST.md, TODO.md
This commit is contained in:
@@ -0,0 +1,146 @@
|
||||
## Client Team Implementation Spec (Raspberry Pi 5)
|
||||
|
||||
### Mission
|
||||
Implement client-side command handling for reliable restart and shutdown with strict validation, idempotency, acknowledgements, and reboot recovery continuity.
|
||||
|
||||
### Ownership Boundaries
|
||||
1. Client team owns command intake, execution, acknowledgement emission, and post-reboot continuity.
|
||||
2. Platform team owns command issuance, lifecycle aggregation, and server-side escalation logic.
|
||||
3. Client implementation must not assume managed PoE availability.
|
||||
|
||||
### Required Client Behaviors
|
||||
|
||||
### Frozen MQTT Topics and Schemas (v1)
|
||||
1. Canonical command topic: infoscreen/{client_uuid}/commands.
|
||||
2. Canonical ack topic: infoscreen/{client_uuid}/commands/ack.
|
||||
3. Transitional compatibility topics during migration:
|
||||
- infoscreen/{client_uuid}/command
|
||||
- infoscreen/{client_uuid}/command/ack
|
||||
4. QoS policy: command QoS 1, ack QoS 1 recommended.
|
||||
5. Retain policy: commands and acks are non-retained.
|
||||
|
||||
Frozen command payload schema:
|
||||
|
||||
```json
|
||||
{
|
||||
"schema_version": "1.0",
|
||||
"command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
|
||||
"client_uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
|
||||
"action": "reboot_host",
|
||||
"issued_at": "2026-04-03T12:48:10Z",
|
||||
"expires_at": "2026-04-03T12:52:10Z",
|
||||
"requested_by": 1,
|
||||
"reason": "operator_request"
|
||||
}
|
||||
```
|
||||
|
||||
Frozen ack payload schema:
|
||||
|
||||
```json
|
||||
{
|
||||
"command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
|
||||
"status": "execution_started",
|
||||
"error_code": null,
|
||||
"error_message": null
|
||||
}
|
||||
```
|
||||
|
||||
Allowed ack status values:
|
||||
1. accepted
|
||||
2. execution_started
|
||||
3. completed
|
||||
4. failed
|
||||
|
||||
Frozen command action values for v1:
|
||||
1. reboot_host
|
||||
2. shutdown_host
|
||||
|
||||
Reserved but not emitted by server in v1:
|
||||
1. restart_service
|
||||
|
||||
Validation snippets for helper scripts:
|
||||
1. Human-readable snippets: implementation-plans/reboot-command-payload-schemas.md
|
||||
2. Machine-validated JSON Schema: implementation-plans/reboot-command-payload-schemas.json
|
||||
|
||||
### 1. Command Intake
|
||||
1. Subscribe to the canonical command topic with QoS 1.
|
||||
2. Parse required fields: schema_version, command_id, action, issued_at, expires_at, reason, requested_by, target metadata.
|
||||
3. Reject invalid payloads with failed acknowledgement including error_code and diagnostic message.
|
||||
4. Reject stale commands when current time exceeds expires_at.
|
||||
5. Ignore already-processed command_id values.
|
||||
|
||||
### 2. Idempotency And Persistence
|
||||
1. Persist processed command_id and execution result on local storage.
|
||||
2. Persistence must survive service restart and full OS reboot.
|
||||
3. On restart, reload dedupe cache before processing newly delivered commands.
|
||||
|
||||
### 3. Acknowledgement Contract Behavior
|
||||
1. Emit accepted immediately after successful validation and dedupe pass.
|
||||
2. Emit execution_started immediately before invoking the command action.
|
||||
3. Emit completed only when local success condition is confirmed.
|
||||
4. Emit failed with structured error_code on validation or execution failure.
|
||||
5. If MQTT is temporarily unavailable, retry ack publish with bounded backoff until command expiry.
|
||||
|
||||
### 4. Execution Security Model
|
||||
1. Execute via systemd-managed privileged helper.
|
||||
2. Allow only whitelisted operations:
|
||||
- reboot_host
|
||||
- shutdown_host
|
||||
3. Optionally keep restart_service handler as reserved path, but do not require it for v1 conformance.
|
||||
4. Disallow arbitrary shell commands and untrusted arguments.
|
||||
5. Enforce per-command execution timeout and terminate hung child processes.
|
||||
|
||||
### 5. Reboot Recovery Continuity
|
||||
1. For reboot_host action:
|
||||
- send execution_started
|
||||
- trigger reboot promptly
|
||||
2. During startup:
|
||||
- emit heartbeat early
|
||||
- emit process-health once service is ready
|
||||
3. Keep last command execution state available after reboot for reconciliation.
|
||||
|
||||
### 6. Time And Timeout Semantics
|
||||
1. Use monotonic timers for local elapsed-time checks.
|
||||
2. Use UTC wall-clock only for protocol timestamps and expiry comparisons.
|
||||
3. Target reconnect baseline on Pi 5 USB-SATA SSD: 90 seconds.
|
||||
4. Accept cold-boot and USB enumeration ceiling up to 150 seconds.
|
||||
|
||||
### 7. Capability Reporting
|
||||
1. Report recovery capability class:
|
||||
- software_only
|
||||
- managed_poe_available
|
||||
- manual_only
|
||||
2. Report watchdog enabled status.
|
||||
3. Report boot-source metadata for diagnostics.
|
||||
|
||||
### 8. Error Codes Minimum Set
|
||||
1. invalid_schema
|
||||
2. missing_field
|
||||
3. stale_command
|
||||
4. duplicate_command
|
||||
5. permission_denied_local
|
||||
6. execution_timeout
|
||||
7. execution_failed
|
||||
8. broker_unavailable
|
||||
9. internal_error
|
||||
|
||||
### Acceptance Tests (Client Team)
|
||||
1. Invalid schema payload is rejected and failed ack emitted.
|
||||
2. Expired command is rejected and not executed.
|
||||
3. Duplicate command_id is not executed twice.
|
||||
4. reboot_host emits execution_started and reconnects with heartbeat in expected window.
|
||||
5. restart_service action completes without host reboot and emits completed.
|
||||
6. MQTT outage during ack path retries correctly without duplicate execution.
|
||||
7. Boot-loop protection cooperates with server-side lockout semantics.
|
||||
|
||||
### Delivery Artifacts
|
||||
1. Client protocol conformance checklist.
|
||||
2. Test evidence for all acceptance tests.
|
||||
3. Runtime logs showing full lifecycle for one restart and one reboot scenario.
|
||||
4. Known limitations list per image version.
|
||||
|
||||
### Definition Of Done
|
||||
1. All acceptance tests pass on Pi 5 USB-SATA SSD test devices.
|
||||
2. No duplicate execution observed under reconnect and retained-delivery edge cases.
|
||||
3. Acknowledgement sequence is complete and machine-parseable for server correlation.
|
||||
4. Reboot recovery continuity works without managed PoE dependencies.
|
||||
Reference in New Issue
Block a user