- Command intake (reboot/shutdown) on infoscreen/{uuid}/commands with ack lifecycle
- MQTT_USER/MQTT_PASSWORD_BROKER split from identity vars; configure_mqtt_security() updated
- infoscreen-simclient.service: Type=notify, WatchdogSec=60, Restart=on-failure
- infoscreen-notify-failure@.service + script: retained MQTT alert when systemd gives up (Gap 3)
- _sd_notify() watchdog keepalive in simclient main loop (Gap 1)
- broker_connection block in health payload: reconnect_count, last_disconnect_at (Gap 2)
- COMMAND_MOCK_REBOOT_IMMEDIATE_COMPLETE canary flag with safety guard
- SERVER_TEAM_ACTIONS.md: server-side integration action items
- Docs: README, CHANGELOG, src/README, copilot-instructions updated
- 43 tests passing
7.2 KiB
7.2 KiB
Client Team Implementation Spec (Raspberry Pi 5)
Mission
Implement client-side command handling for reliable restart and shutdown with strict validation, idempotency, acknowledgements, and reboot recovery continuity.
Ownership Boundaries
- Client team owns command intake, execution, acknowledgement emission, and post-reboot continuity.
- Platform team owns command issuance, lifecycle aggregation, and server-side escalation logic.
- Client implementation must not assume managed PoE availability.
Required Client Behaviors
Frozen MQTT Topics and Schemas (v1)
- Canonical command topic: infoscreen/{client_uuid}/commands.
- Canonical ack topic: infoscreen/{client_uuid}/commands/ack.
- Transitional compatibility topics during migration:
- infoscreen/{client_uuid}/command
- infoscreen/{client_uuid}/command/ack
- QoS policy: command QoS 1, ack QoS 1 recommended.
- Retain policy: commands and acks are non-retained.
- Client migration behavior: subscribe to both command topics and publish to both ack topics during migration.
Frozen command payload schema:
{
"schema_version": "1.0",
"command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
"client_uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
"action": "reboot_host",
"issued_at": "2026-04-03T12:48:10Z",
"expires_at": "2026-04-03T12:52:10Z",
"requested_by": 1,
"reason": "operator_request"
}
Frozen ack payload schema:
{
"command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
"status": "execution_started",
"error_code": null,
"error_message": null
}
Allowed ack status values:
- accepted
- execution_started
- completed
- failed
Frozen command action values for v1:
- reboot_host
- shutdown_host
Reserved but not emitted by server in v1:
- restart_service
Client Decision Defaults (v1)
- Privileged helper invocation: sudoers + local helper script (
sudo /usr/local/bin/infoscreen-cmd-helper.sh). - Dedupe retention: keep processed command IDs for 24 hours and cap store size to 5000 newest entries.
- Ack retry schedule while broker unavailable: 0.5s, 1s, 2s, 4s, then 5s cap until expires_at.
- Boot-loop handling: server remains authority for safety lockout; client enforces idempotency by command_id and reports local execution outcomes.
MQTT Auth Hardening (Current Priority)
- Client must support authenticated MQTT connections for both command and event intake.
- Client must remain compatible with broker ACLs that restrict publish/subscribe rights per topic.
- Client should support TLS broker connections from environment configuration when certificates are provided.
- URL/domain allowlisting for web and webuntis events is explicitly deferred and tracked separately in TODO.md.
- Client credentials are loaded from the local /.env, not from tracked docs or templates.
Server-side prerequisites for this client work:
- Broker credentials must be provisioned for clients.
- Broker ACLs must allow each client to subscribe only to its own command topics and assigned event topics.
- Broker ACLs must allow each client to publish only its own ack, heartbeat, health, dashboard, and telemetry topics.
- Server-side publishers must move to authenticated broker access before production rollout.
Validation snippets for helper scripts:
- Human-readable snippets: implementation-plans/reboot-command-payload-schemas.md
- Machine-validated JSON Schema: implementation-plans/reboot-command-payload-schemas.json
1. Command Intake
- Subscribe to canonical and transitional command topics with QoS 1.
- Parse required fields exactly: schema_version, command_id, client_uuid, action, issued_at, expires_at, requested_by, reason.
- Reject invalid payloads with failed acknowledgement including error_code and diagnostic message.
- Reject stale commands when current time exceeds expires_at.
- Reject already-processed command_id values without re-execution.
2. Idempotency And Persistence
- Persist processed command_id and execution result on local storage.
- Persistence must survive service restart and full OS reboot.
- On restart, reload dedupe cache before processing newly delivered commands.
3. Acknowledgement Contract Behavior
- Emit accepted immediately after successful validation and dedupe pass.
- Emit execution_started immediately before invoking the command action.
- Emit completed only when local success condition is confirmed.
- Emit failed with structured error_code on validation or execution failure.
- If MQTT is temporarily unavailable, retry ack publish with bounded backoff until command expiry.
- Ack payload fields are strict: command_id, status, error_code, error_message (no additional fields).
- For status failed, error_code and error_message must be non-null, non-empty strings.
4. Execution Security Model
- Execute via systemd-managed privileged helper.
- Allow only whitelisted operations:
- reboot_host
- shutdown_host
- Do not execute restart_service in v1.
- Disallow arbitrary shell commands and untrusted arguments.
- Enforce per-command execution timeout and terminate hung child processes.
5. Reboot Recovery Continuity
- For reboot_host action:
- send execution_started
- trigger reboot promptly
- During startup:
- emit heartbeat early
- emit process-health once service is ready
- Keep last command execution state available after reboot for reconciliation.
6. Time And Timeout Semantics
- Use monotonic timers for local elapsed-time checks.
- Use UTC wall-clock only for protocol timestamps and expiry comparisons.
- Target reconnect baseline on Pi 5 USB-SATA SSD: 90 seconds.
- Accept cold-boot and USB enumeration ceiling up to 150 seconds.
7. Capability Reporting
- Report recovery capability class:
- software_only
- managed_poe_available
- manual_only
- Report watchdog enabled status.
- Report boot-source metadata for diagnostics.
8. Error Codes Minimum Set
- invalid_schema
- missing_field
- stale_command
- duplicate_command
- permission_denied_local
- execution_timeout
- execution_failed
- broker_unavailable
- internal_error
Acceptance Tests (Client Team)
- Invalid schema payload is rejected and failed ack emitted.
- Expired command is rejected and not executed.
- Duplicate command_id is not executed twice.
- reboot_host emits execution_started and reconnects with heartbeat in expected window.
- shutdown_host action is accepted and invokes local privileged helper without accepting non-whitelisted actions.
- MQTT outage during ack path retries correctly without duplicate execution.
- Client idempotency cooperates with server-side lockout semantics (no local reboot-rate policy).
- Client connects successfully to an authenticated broker and still receives commands and event topics permitted by ACLs.
Delivery Artifacts
- Client protocol conformance checklist.
- Test evidence for all acceptance tests.
- Runtime logs showing full lifecycle for one shutdown and one reboot scenario.
- Known limitations list per image version.
Definition Of Done
- All acceptance tests pass on Pi 5 USB-SATA SSD test devices.
- No duplicate execution observed under reconnect and retained-delivery edge cases.
- Acknowledgement sequence is complete and machine-parseable for server correlation.
- Reboot recovery continuity works without managed PoE dependencies.