## Remote Reboot Reliability Handoff (Share Document) ### Purpose This document defines the agreed implementation scope for reliable remote reboot and shutdown of Raspberry Pi 5 clients, with monitoring-first visibility and safe escalation paths. ### Scope 1. In scope: restart and shutdown command reliability. 2. In scope: full lifecycle monitoring and audit visibility. 3. In scope: capability-tier recovery model with optional managed PoE escalation. 4. Out of scope: broader maintenance module in client-management for this cycle. 5. Out of scope: mandatory dependency on customer-managed power switching. ### Agreed Operating Model 1. Command delivery is asynchronous and lifecycle-tracked, not fire-and-forget. 2. Commands use idempotent command_id semantics with stale-command rejection by expires_at. 3. Monitoring is authoritative for operational state and escalation decisions. 4. Recovery must function even when no managed power switching is available. ### Frozen Contract v1 (Effective Immediately) 1. Canonical command topic: infoscreen/{client_uuid}/commands. 2. Canonical ack topic: infoscreen/{client_uuid}/commands/ack. 3. Transitional compatibility topics accepted during migration: - infoscreen/{client_uuid}/command - infoscreen/{client_uuid}/command/ack 4. QoS policy: command QoS 1, ack QoS 1 recommended. 5. Retain policy: commands and acks are non-retained. Command payload schema (frozen): ```json { "schema_version": "1.0", "command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4", "client_uuid": "9b8d1856-ff34-4864-a726-12de072d0f77", "action": "reboot_host", "issued_at": "2026-04-03T12:48:10Z", "expires_at": "2026-04-03T12:52:10Z", "requested_by": 1, "reason": "operator_request" } ``` Ack payload schema (frozen): ```json { "command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4", "status": "execution_started", "error_code": null, "error_message": null } ``` Allowed ack status values: 1. accepted 2. execution_started 3. completed 4. failed Frozen command action values: 1. reboot_host 2. shutdown_host API endpoint mapping: 1. POST /api/clients/{uuid}/restart -> action reboot_host 2. POST /api/clients/{uuid}/shutdown -> action shutdown_host Validation snippets: 1. Human-readable snippets: implementation-plans/reboot-command-payload-schemas.md 2. Machine-validated JSON Schema: implementation-plans/reboot-command-payload-schemas.json ### Command Lifecycle States 1. queued 2. publish_in_progress 3. published 4. ack_received 5. execution_started 6. awaiting_reconnect 7. recovered 8. completed 9. failed 10. expired 11. timed_out 12. canceled 13. blocked_safety 14. manual_intervention_required ### Timeout Defaults (Pi 5, USB-SATA SSD baseline) 1. queued to publish_in_progress: immediate, timeout 5 seconds. 2. publish_in_progress to published: timeout 8 seconds. 3. published to ack_received: timeout 20 seconds. 4. ack_received to execution_started: 15 seconds for service restart, 25 seconds for host reboot. 5. execution_started to awaiting_reconnect: timeout 10 seconds. 6. awaiting_reconnect to recovered: baseline 90 seconds after validation, cold-boot ceiling 150 seconds. 7. recovered to completed: 15 to 20 seconds based on fleet stability. 8. command expires_at default: 240 seconds, bounded 180 to 360 seconds. ### Recovery Tiers 1. Tier 0 baseline, always required: watchdog, systemd auto-restart, lifecycle tracking, manual intervention fallback. 2. Tier 1 optional: managed PoE per-port power-cycle escalation where customer infrastructure supports it. 3. Tier 2 no remote power control: direct manual intervention workflow. ### Governance And Safety 1. Role access: admin and superadmin. 2. Bulk actions require reason capture. 3. Safety lockout: maximum 3 reboot commands per client in 15 minutes. 4. Escalation cooldown: 60 seconds before automatic move to manual_intervention_required. ### MQTT Auth Hardening (Phase 1, Required Before Broad Rollout) 1. Intranet-only deployment is not sufficient protection for privileged MQTT actions by itself. 2. Phase 1 hardening scope is broker authentication, authorization, and network restriction; payload URL allowlisting is deferred to a later client/server feature. 3. MQTT broker must disable anonymous publish/subscribe access in production. 4. MQTT broker must require authenticated identities for server-side publishers and client devices. 5. MQTT broker must enforce ACLs so that: - only server-side services can publish to `infoscreen/{client_uuid}/commands` - only server-side services can publish scheduler event topics - each client can subscribe only to its own command topics and assigned event topics - each client can publish only its own ack, heartbeat, health, dashboard, and telemetry topics 6. Broker port exposure must be restricted to the management network and approved hosts only. 7. TLS support is strongly recommended in this phase and should be enabled when operationally feasible. ### Server Team Actions For Auth Hardening 1. Provision broker credentials for command/event publishers and for client devices. 2. Configure Mosquitto or equivalent broker ACLs for per-topic publish and subscribe restrictions. 3. Disable anonymous access on production brokers. 4. Restrict broker network exposure with firewall rules, VLAN policy, or equivalent network controls. 5. Update server/frontend deployment to publish MQTT with authenticated credentials. 6. Validate that server-side event publishing and reboot/shutdown command publishing still work under the new ACL policy. 7. Coordinate credential distribution and rotation with the client deployment process. ### MQTT ACL Matrix (Canonical Baseline) | Actor | Topic Pattern | Publish | Subscribe | Notes | | --- | --- | --- | --- | --- | | scheduler-service | infoscreen/events/+ | Yes | No | Publishes retained active event list per group. | | api-command-publisher | infoscreen/+/commands | Yes | No | Publishes canonical reboot/shutdown commands. | | api-command-publisher | infoscreen/+/command | Yes | No | Transitional compatibility publish only. | | api-group-assignment | infoscreen/+/group_id | Yes | No | Publishes retained client-to-group assignment. | | listener-service | infoscreen/+/commands/ack | No | Yes | Consumes canonical client command acknowledgements. | | listener-service | infoscreen/+/command/ack | No | Yes | Consumes transitional compatibility acknowledgements. | | listener-service | infoscreen/+/heartbeat | No | Yes | Consumes heartbeat telemetry. | | listener-service | infoscreen/+/health | No | Yes | Consumes health telemetry. | | listener-service | infoscreen/+/dashboard | No | Yes | Consumes dashboard screenshot payloads. | | listener-service | infoscreen/+/screenshot | No | Yes | Consumes screenshot payloads (if enabled). | | listener-service | infoscreen/+/logs/error | No | Yes | Consumes client error logs. | | listener-service | infoscreen/+/logs/warn | No | Yes | Consumes client warn logs. | | listener-service | infoscreen/+/logs/info | No | Yes | Consumes client info logs. | | listener-service | infoscreen/discovery | No | Yes | Consumes discovery announcements. | | listener-service | infoscreen/+/discovery_ack | Yes | No | Publishes discovery acknowledgements. | | client- | infoscreen//commands | No | Yes | Canonical command intake for this client only. | | client- | infoscreen//command | No | Yes | Transitional compatibility intake for this client only. | | client- | infoscreen/events/ | No | Yes | Assigned group event feed only; dynamic per assignment. | | client- | infoscreen//commands/ack | Yes | No | Canonical command acknowledgements for this client only. | | client- | infoscreen//command/ack | Yes | No | Transitional compatibility acknowledgements for this client only. | | client- | infoscreen//heartbeat | Yes | No | Heartbeat telemetry. | | client- | infoscreen//health | Yes | No | Health telemetry. | | client- | infoscreen//dashboard | Yes | No | Dashboard status and screenshot payloads. | | client- | infoscreen//screenshot | Yes | No | Screenshot payloads (if enabled). | | client- | infoscreen//logs/error | Yes | No | Error log stream. | | client- | infoscreen//logs/warn | Yes | No | Warning log stream. | | client- | infoscreen//logs/info | Yes | No | Info log stream. | | client- | infoscreen/discovery | Yes | No | Discovery announcement. | | client- | infoscreen//discovery_ack | No | Yes | Discovery acknowledgment from listener. | ACL implementation notes: 1. Use per-client identities; client ACLs must be scoped to exact client UUID and must not allow wildcard access to other clients. 2. Event topic subscription (`infoscreen/events/`) should be managed via broker-side ACL provisioning that updates when group assignment changes. 3. Transitional singular command topics are temporary and should be removed after migration cutover. 4. Deny by default: any topic not explicitly listed above should be blocked for each actor. ### Credential Management Guidance 1. Real MQTT passwords must not be stored in tracked documentation or committed templates. 2. Each client device should receive a unique broker username and password, stored only in its local [/.env](.env). 3. Server-side publisher credentials should be stored in the server team's secret-management path, not in source control. 4. Recommended naming convention for client broker users: `infoscreen-client-`. 5. Client passwords should be random, at least 20 characters, and rotated through deployment tooling or broker administration procedures. 6. The server/infrastructure team owns broker-side user creation, ACL assignment, rotation, and revocation. 7. The client team owns loading credentials from local env files and validating connection behavior against the secured broker. ### Client Team Actions For Auth Hardening 1. Add MQTT username/password support in the client connection setup. 2. Add client-side TLS configuration support from environment when certificates are provided. 3. Update local test helpers to support authenticated MQTT publishing and subscription. 4. Validate command and event intake against the authenticated broker configuration before canary rollout. ### Ready For Server/Frontend Team (Auth Phase) 1. Client implementation is ready to connect with MQTT auth from local `.env` (`MQTT_USERNAME`, `MQTT_PASSWORD`, optional TLS settings). 2. Client command/event intake and client ack/telemetry publishing run over the authenticated MQTT session. 3. Server/frontend team must now complete broker-side enforcement and publisher migration. Server/frontend done criteria: 1. Anonymous broker access is disabled in production. 2. Server-side publishers use authenticated broker credentials. 3. ACLs are active and validated for command, event, and client telemetry topics. 4. At least one canary client proves end-to-end flow under ACLs: - server publishes command/event with authenticated publisher - client receives payload - client sends ack/telemetry successfully 5. Revocation test passes: disabling one client credential blocks only that client without impacting others. Operational note: 1. Client-side auth support is necessary but not sufficient by itself; broker ACL/auth enforcement is the security control that must be enabled by the server/infrastructure team. ### Rollout Plan 1. Contract freeze and sign-off. 2. Platform and client implementation against frozen schemas. 3. One-group canary. 4. Rollback if failed plus timed_out exceeds 5 percent. 5. Expand only after 7 days below intervention threshold. ### Success Criteria 1. Deterministic command lifecycle visibility from enqueue to completion. 2. No duplicate execution under reconnect or delayed-delivery conditions. 3. Stable Pi 5 SSD reconnect behavior within defined baseline. 4. Clear and actionable manual intervention states when automatic recovery is exhausted.