Files
infoscreen/implementation-plans/reboot-implementation-handoff-client-team.md
Olaf 03e3c11e90 feat: crash recovery, service_failed monitoring, broker health fields, command expiry sweep
- Add GET /api/clients/crashed endpoint (process_status=crashed or stale heartbeat)
- Add restart_app command action with same lifecycle + lockout as reboot_host
- Scheduler: crash auto-recovery loop (CRASH_RECOVERY_ENABLED flag, lockout, MQTT publish)
- Scheduler: unconditional command expiry sweep per poll cycle (sweep_expired_commands)
- Listener: subscribe to infoscreen/+/service_failed; persist service_failed_at + unit
- Listener: extract broker_connection block from health payload; persist reconnect_count + last_disconnect_at
- DB migration b1c2d3e4f5a6: service_failed_at, service_failed_unit, mqtt_reconnect_count, mqtt_last_disconnect_at on clients
- Add GET /api/clients/service_failed and POST /api/clients/<uuid>/clear_service_failed
- Monitoring overview API: include mqtt_reconnect_count + mqtt_last_disconnect_at per client
- Frontend: orange service-failed alert panel (hidden when empty, auto-refresh, quittieren action)
- Frontend: MQTT reconnect count + last disconnect in client detail panel
- MQTT auth hardening: listener/scheduler/server use env credentials; broker enforces allow_anonymous false
- Client command lifecycle foundation: ClientCommand model, reboot_host/shutdown_host, full ACK lifecycle
- Docs: TECH-CHANGELOG, DEV-CHANGELOG, MQTT_EVENT_PAYLOAD_GUIDE, copilot-instructions updated
- Add implementation-plans/, RESTART_VALIDATION_CHECKLIST.md, TODO.md
2026-04-05 10:17:56 +00:00

5.3 KiB

Client Team Implementation Spec (Raspberry Pi 5)

Mission

Implement client-side command handling for reliable restart and shutdown with strict validation, idempotency, acknowledgements, and reboot recovery continuity.

Ownership Boundaries

  1. Client team owns command intake, execution, acknowledgement emission, and post-reboot continuity.
  2. Platform team owns command issuance, lifecycle aggregation, and server-side escalation logic.
  3. Client implementation must not assume managed PoE availability.

Required Client Behaviors

Frozen MQTT Topics and Schemas (v1)

  1. Canonical command topic: infoscreen/{client_uuid}/commands.
  2. Canonical ack topic: infoscreen/{client_uuid}/commands/ack.
  3. Transitional compatibility topics during migration:
  • infoscreen/{client_uuid}/command
  • infoscreen/{client_uuid}/command/ack
  1. QoS policy: command QoS 1, ack QoS 1 recommended.
  2. Retain policy: commands and acks are non-retained.

Frozen command payload schema:

{
	"schema_version": "1.0",
	"command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
	"client_uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
	"action": "reboot_host",
	"issued_at": "2026-04-03T12:48:10Z",
	"expires_at": "2026-04-03T12:52:10Z",
	"requested_by": 1,
	"reason": "operator_request"
}

Frozen ack payload schema:

{
	"command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
	"status": "execution_started",
	"error_code": null,
	"error_message": null
}

Allowed ack status values:

  1. accepted
  2. execution_started
  3. completed
  4. failed

Frozen command action values for v1:

  1. reboot_host
  2. shutdown_host

Reserved but not emitted by server in v1:

  1. restart_service

Validation snippets for helper scripts:

  1. Human-readable snippets: implementation-plans/reboot-command-payload-schemas.md
  2. Machine-validated JSON Schema: implementation-plans/reboot-command-payload-schemas.json

1. Command Intake

  1. Subscribe to the canonical command topic with QoS 1.
  2. Parse required fields: schema_version, command_id, action, issued_at, expires_at, reason, requested_by, target metadata.
  3. Reject invalid payloads with failed acknowledgement including error_code and diagnostic message.
  4. Reject stale commands when current time exceeds expires_at.
  5. Ignore already-processed command_id values.

2. Idempotency And Persistence

  1. Persist processed command_id and execution result on local storage.
  2. Persistence must survive service restart and full OS reboot.
  3. On restart, reload dedupe cache before processing newly delivered commands.

3. Acknowledgement Contract Behavior

  1. Emit accepted immediately after successful validation and dedupe pass.
  2. Emit execution_started immediately before invoking the command action.
  3. Emit completed only when local success condition is confirmed.
  4. Emit failed with structured error_code on validation or execution failure.
  5. If MQTT is temporarily unavailable, retry ack publish with bounded backoff until command expiry.

4. Execution Security Model

  1. Execute via systemd-managed privileged helper.
  2. Allow only whitelisted operations:
  • reboot_host
  • shutdown_host
  1. Optionally keep restart_service handler as reserved path, but do not require it for v1 conformance.
  2. Disallow arbitrary shell commands and untrusted arguments.
  3. Enforce per-command execution timeout and terminate hung child processes.

5. Reboot Recovery Continuity

  1. For reboot_host action:
  • send execution_started
  • trigger reboot promptly
  1. During startup:
  • emit heartbeat early
  • emit process-health once service is ready
  1. Keep last command execution state available after reboot for reconciliation.

6. Time And Timeout Semantics

  1. Use monotonic timers for local elapsed-time checks.
  2. Use UTC wall-clock only for protocol timestamps and expiry comparisons.
  3. Target reconnect baseline on Pi 5 USB-SATA SSD: 90 seconds.
  4. Accept cold-boot and USB enumeration ceiling up to 150 seconds.

7. Capability Reporting

  1. Report recovery capability class:
  • software_only
  • managed_poe_available
  • manual_only
  1. Report watchdog enabled status.
  2. Report boot-source metadata for diagnostics.

8. Error Codes Minimum Set

  1. invalid_schema
  2. missing_field
  3. stale_command
  4. duplicate_command
  5. permission_denied_local
  6. execution_timeout
  7. execution_failed
  8. broker_unavailable
  9. internal_error

Acceptance Tests (Client Team)

  1. Invalid schema payload is rejected and failed ack emitted.
  2. Expired command is rejected and not executed.
  3. Duplicate command_id is not executed twice.
  4. reboot_host emits execution_started and reconnects with heartbeat in expected window.
  5. restart_service action completes without host reboot and emits completed.
  6. MQTT outage during ack path retries correctly without duplicate execution.
  7. Boot-loop protection cooperates with server-side lockout semantics.

Delivery Artifacts

  1. Client protocol conformance checklist.
  2. Test evidence for all acceptance tests.
  3. Runtime logs showing full lifecycle for one restart and one reboot scenario.
  4. Known limitations list per image version.

Definition Of Done

  1. All acceptance tests pass on Pi 5 USB-SATA SSD test devices.
  2. No duplicate execution observed under reconnect and retained-delivery edge cases.
  3. Acknowledgement sequence is complete and machine-parseable for server correlation.
  4. Reboot recovery continuity works without managed PoE dependencies.