Files
infoscreen-dev/implementation-plans/reboot-implementation-handoff-client-team.md
RobbStarkAustria 0cd0d95612 feat: remote commands, systemd units, process observability, broker auth split
- Command intake (reboot/shutdown) on infoscreen/{uuid}/commands with ack lifecycle
- MQTT_USER/MQTT_PASSWORD_BROKER split from identity vars; configure_mqtt_security() updated
- infoscreen-simclient.service: Type=notify, WatchdogSec=60, Restart=on-failure
- infoscreen-notify-failure@.service + script: retained MQTT alert when systemd gives up (Gap 3)
- _sd_notify() watchdog keepalive in simclient main loop (Gap 1)
- broker_connection block in health payload: reconnect_count, last_disconnect_at (Gap 2)
- COMMAND_MOCK_REBOOT_IMMEDIATE_COMPLETE canary flag with safety guard
- SERVER_TEAM_ACTIONS.md: server-side integration action items
- Docs: README, CHANGELOG, src/README, copilot-instructions updated
- 43 tests passing
2026-04-05 08:36:50 +02:00

7.2 KiB

Client Team Implementation Spec (Raspberry Pi 5)

Mission

Implement client-side command handling for reliable restart and shutdown with strict validation, idempotency, acknowledgements, and reboot recovery continuity.

Ownership Boundaries

  1. Client team owns command intake, execution, acknowledgement emission, and post-reboot continuity.
  2. Platform team owns command issuance, lifecycle aggregation, and server-side escalation logic.
  3. Client implementation must not assume managed PoE availability.

Required Client Behaviors

Frozen MQTT Topics and Schemas (v1)

  1. Canonical command topic: infoscreen/{client_uuid}/commands.
  2. Canonical ack topic: infoscreen/{client_uuid}/commands/ack.
  3. Transitional compatibility topics during migration:
  • infoscreen/{client_uuid}/command
  • infoscreen/{client_uuid}/command/ack
  1. QoS policy: command QoS 1, ack QoS 1 recommended.
  2. Retain policy: commands and acks are non-retained.
  3. Client migration behavior: subscribe to both command topics and publish to both ack topics during migration.

Frozen command payload schema:

{
	"schema_version": "1.0",
	"command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
	"client_uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
	"action": "reboot_host",
	"issued_at": "2026-04-03T12:48:10Z",
	"expires_at": "2026-04-03T12:52:10Z",
	"requested_by": 1,
	"reason": "operator_request"
}

Frozen ack payload schema:

{
	"command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
	"status": "execution_started",
	"error_code": null,
	"error_message": null
}

Allowed ack status values:

  1. accepted
  2. execution_started
  3. completed
  4. failed

Frozen command action values for v1:

  1. reboot_host
  2. shutdown_host

Reserved but not emitted by server in v1:

  1. restart_service

Client Decision Defaults (v1)

  1. Privileged helper invocation: sudoers + local helper script (sudo /usr/local/bin/infoscreen-cmd-helper.sh).
  2. Dedupe retention: keep processed command IDs for 24 hours and cap store size to 5000 newest entries.
  3. Ack retry schedule while broker unavailable: 0.5s, 1s, 2s, 4s, then 5s cap until expires_at.
  4. Boot-loop handling: server remains authority for safety lockout; client enforces idempotency by command_id and reports local execution outcomes.

MQTT Auth Hardening (Current Priority)

  1. Client must support authenticated MQTT connections for both command and event intake.
  2. Client must remain compatible with broker ACLs that restrict publish/subscribe rights per topic.
  3. Client should support TLS broker connections from environment configuration when certificates are provided.
  4. URL/domain allowlisting for web and webuntis events is explicitly deferred and tracked separately in TODO.md.
  5. Client credentials are loaded from the local /.env, not from tracked docs or templates.

Server-side prerequisites for this client work:

  1. Broker credentials must be provisioned for clients.
  2. Broker ACLs must allow each client to subscribe only to its own command topics and assigned event topics.
  3. Broker ACLs must allow each client to publish only its own ack, heartbeat, health, dashboard, and telemetry topics.
  4. Server-side publishers must move to authenticated broker access before production rollout.

Validation snippets for helper scripts:

  1. Human-readable snippets: implementation-plans/reboot-command-payload-schemas.md
  2. Machine-validated JSON Schema: implementation-plans/reboot-command-payload-schemas.json

1. Command Intake

  1. Subscribe to canonical and transitional command topics with QoS 1.
  2. Parse required fields exactly: schema_version, command_id, client_uuid, action, issued_at, expires_at, requested_by, reason.
  3. Reject invalid payloads with failed acknowledgement including error_code and diagnostic message.
  4. Reject stale commands when current time exceeds expires_at.
  5. Reject already-processed command_id values without re-execution.

2. Idempotency And Persistence

  1. Persist processed command_id and execution result on local storage.
  2. Persistence must survive service restart and full OS reboot.
  3. On restart, reload dedupe cache before processing newly delivered commands.

3. Acknowledgement Contract Behavior

  1. Emit accepted immediately after successful validation and dedupe pass.
  2. Emit execution_started immediately before invoking the command action.
  3. Emit completed only when local success condition is confirmed.
  4. Emit failed with structured error_code on validation or execution failure.
  5. If MQTT is temporarily unavailable, retry ack publish with bounded backoff until command expiry.
  6. Ack payload fields are strict: command_id, status, error_code, error_message (no additional fields).
  7. For status failed, error_code and error_message must be non-null, non-empty strings.

4. Execution Security Model

  1. Execute via systemd-managed privileged helper.
  2. Allow only whitelisted operations:
  • reboot_host
  • shutdown_host
  1. Do not execute restart_service in v1.
  2. Disallow arbitrary shell commands and untrusted arguments.
  3. Enforce per-command execution timeout and terminate hung child processes.

5. Reboot Recovery Continuity

  1. For reboot_host action:
  • send execution_started
  • trigger reboot promptly
  1. During startup:
  • emit heartbeat early
  • emit process-health once service is ready
  1. Keep last command execution state available after reboot for reconciliation.

6. Time And Timeout Semantics

  1. Use monotonic timers for local elapsed-time checks.
  2. Use UTC wall-clock only for protocol timestamps and expiry comparisons.
  3. Target reconnect baseline on Pi 5 USB-SATA SSD: 90 seconds.
  4. Accept cold-boot and USB enumeration ceiling up to 150 seconds.

7. Capability Reporting

  1. Report recovery capability class:
  • software_only
  • managed_poe_available
  • manual_only
  1. Report watchdog enabled status.
  2. Report boot-source metadata for diagnostics.

8. Error Codes Minimum Set

  1. invalid_schema
  2. missing_field
  3. stale_command
  4. duplicate_command
  5. permission_denied_local
  6. execution_timeout
  7. execution_failed
  8. broker_unavailable
  9. internal_error

Acceptance Tests (Client Team)

  1. Invalid schema payload is rejected and failed ack emitted.
  2. Expired command is rejected and not executed.
  3. Duplicate command_id is not executed twice.
  4. reboot_host emits execution_started and reconnects with heartbeat in expected window.
  5. shutdown_host action is accepted and invokes local privileged helper without accepting non-whitelisted actions.
  6. MQTT outage during ack path retries correctly without duplicate execution.
  7. Client idempotency cooperates with server-side lockout semantics (no local reboot-rate policy).
  8. Client connects successfully to an authenticated broker and still receives commands and event topics permitted by ACLs.

Delivery Artifacts

  1. Client protocol conformance checklist.
  2. Test evidence for all acceptance tests.
  3. Runtime logs showing full lifecycle for one shutdown and one reboot scenario.
  4. Known limitations list per image version.

Definition Of Done

  1. All acceptance tests pass on Pi 5 USB-SATA SSD test devices.
  2. No duplicate execution observed under reconnect and retained-delivery edge cases.
  3. Acknowledgement sequence is complete and machine-parseable for server correlation.
  4. Reboot recovery continuity works without managed PoE dependencies.