- Command intake (reboot/shutdown) on infoscreen/{uuid}/commands with ack lifecycle
- MQTT_USER/MQTT_PASSWORD_BROKER split from identity vars; configure_mqtt_security() updated
- infoscreen-simclient.service: Type=notify, WatchdogSec=60, Restart=on-failure
- infoscreen-notify-failure@.service + script: retained MQTT alert when systemd gives up (Gap 3)
- _sd_notify() watchdog keepalive in simclient main loop (Gap 1)
- broker_connection block in health payload: reconnect_count, last_disconnect_at (Gap 2)
- COMMAND_MOCK_REBOOT_IMMEDIATE_COMPLETE canary flag with safety guard
- SERVER_TEAM_ACTIONS.md: server-side integration action items
- Docs: README, CHANGELOG, src/README, copilot-instructions updated
- 43 tests passing
176 lines
8.3 KiB
Markdown
176 lines
8.3 KiB
Markdown
## Remote Reboot Reliability Handoff (Share Document)
|
|
|
|
### Purpose
|
|
This document defines the agreed implementation scope for reliable remote reboot and shutdown of Raspberry Pi 5 clients, with monitoring-first visibility and safe escalation paths.
|
|
|
|
### Scope
|
|
1. In scope: restart and shutdown command reliability.
|
|
2. In scope: full lifecycle monitoring and audit visibility.
|
|
3. In scope: capability-tier recovery model with optional managed PoE escalation.
|
|
4. Out of scope: broader maintenance module in client-management for this cycle.
|
|
5. Out of scope: mandatory dependency on customer-managed power switching.
|
|
|
|
### Agreed Operating Model
|
|
1. Command delivery is asynchronous and lifecycle-tracked, not fire-and-forget.
|
|
2. Commands use idempotent command_id semantics with stale-command rejection by expires_at.
|
|
3. Monitoring is authoritative for operational state and escalation decisions.
|
|
4. Recovery must function even when no managed power switching is available.
|
|
|
|
### Frozen Contract v1 (Effective Immediately)
|
|
1. Canonical command topic: infoscreen/{client_uuid}/commands.
|
|
2. Canonical ack topic: infoscreen/{client_uuid}/commands/ack.
|
|
3. Transitional compatibility topics accepted during migration:
|
|
- infoscreen/{client_uuid}/command
|
|
- infoscreen/{client_uuid}/command/ack
|
|
4. QoS policy: command QoS 1, ack QoS 1 recommended.
|
|
5. Retain policy: commands and acks are non-retained.
|
|
|
|
Command payload schema (frozen):
|
|
|
|
```json
|
|
{
|
|
"schema_version": "1.0",
|
|
"command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
|
|
"client_uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
|
|
"action": "reboot_host",
|
|
"issued_at": "2026-04-03T12:48:10Z",
|
|
"expires_at": "2026-04-03T12:52:10Z",
|
|
"requested_by": 1,
|
|
"reason": "operator_request"
|
|
}
|
|
```
|
|
|
|
Ack payload schema (frozen):
|
|
|
|
```json
|
|
{
|
|
"command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
|
|
"status": "execution_started",
|
|
"error_code": null,
|
|
"error_message": null
|
|
}
|
|
```
|
|
|
|
Allowed ack status values:
|
|
1. accepted
|
|
2. execution_started
|
|
3. completed
|
|
4. failed
|
|
|
|
Frozen command action values:
|
|
1. reboot_host
|
|
2. shutdown_host
|
|
|
|
API endpoint mapping:
|
|
1. POST /api/clients/{uuid}/restart -> action reboot_host
|
|
2. POST /api/clients/{uuid}/shutdown -> action shutdown_host
|
|
|
|
Validation snippets:
|
|
1. Human-readable snippets: implementation-plans/reboot-command-payload-schemas.md
|
|
2. Machine-validated JSON Schema: implementation-plans/reboot-command-payload-schemas.json
|
|
|
|
### Command Lifecycle States
|
|
1. queued
|
|
2. publish_in_progress
|
|
3. published
|
|
4. ack_received
|
|
5. execution_started
|
|
6. awaiting_reconnect
|
|
7. recovered
|
|
8. completed
|
|
9. failed
|
|
10. expired
|
|
11. timed_out
|
|
12. canceled
|
|
13. blocked_safety
|
|
14. manual_intervention_required
|
|
|
|
### Timeout Defaults (Pi 5, USB-SATA SSD baseline)
|
|
1. queued to publish_in_progress: immediate, timeout 5 seconds.
|
|
2. publish_in_progress to published: timeout 8 seconds.
|
|
3. published to ack_received: timeout 20 seconds.
|
|
4. ack_received to execution_started: 15 seconds for service restart, 25 seconds for host reboot.
|
|
5. execution_started to awaiting_reconnect: timeout 10 seconds.
|
|
6. awaiting_reconnect to recovered: baseline 90 seconds after validation, cold-boot ceiling 150 seconds.
|
|
7. recovered to completed: 15 to 20 seconds based on fleet stability.
|
|
8. command expires_at default: 240 seconds, bounded 180 to 360 seconds.
|
|
|
|
### Recovery Tiers
|
|
1. Tier 0 baseline, always required: watchdog, systemd auto-restart, lifecycle tracking, manual intervention fallback.
|
|
2. Tier 1 optional: managed PoE per-port power-cycle escalation where customer infrastructure supports it.
|
|
3. Tier 2 no remote power control: direct manual intervention workflow.
|
|
|
|
### Governance And Safety
|
|
1. Role access: admin and superadmin.
|
|
2. Bulk actions require reason capture.
|
|
3. Safety lockout: maximum 3 reboot commands per client in 15 minutes.
|
|
4. Escalation cooldown: 60 seconds before automatic move to manual_intervention_required.
|
|
|
|
### MQTT Auth Hardening (Phase 1, Required Before Broad Rollout)
|
|
1. Intranet-only deployment is not sufficient protection for privileged MQTT actions by itself.
|
|
2. Phase 1 hardening scope is broker authentication, authorization, and network restriction; payload URL allowlisting is deferred to a later client/server feature.
|
|
3. MQTT broker must disable anonymous publish/subscribe access in production.
|
|
4. MQTT broker must require authenticated identities for server-side publishers and client devices.
|
|
5. MQTT broker must enforce ACLs so that:
|
|
- only server-side services can publish to `infoscreen/{client_uuid}/commands`
|
|
- only server-side services can publish scheduler event topics
|
|
- each client can subscribe only to its own command topics and assigned event topics
|
|
- each client can publish only its own ack, heartbeat, health, dashboard, and telemetry topics
|
|
6. Broker port exposure must be restricted to the management network and approved hosts only.
|
|
7. TLS support is strongly recommended in this phase and should be enabled when operationally feasible.
|
|
|
|
### Server Team Actions For Auth Hardening
|
|
1. Provision broker credentials for command/event publishers and for client devices.
|
|
2. Configure Mosquitto or equivalent broker ACLs for per-topic publish and subscribe restrictions.
|
|
3. Disable anonymous access on production brokers.
|
|
4. Restrict broker network exposure with firewall rules, VLAN policy, or equivalent network controls.
|
|
5. Update server/frontend deployment to publish MQTT with authenticated credentials.
|
|
6. Validate that server-side event publishing and reboot/shutdown command publishing still work under the new ACL policy.
|
|
7. Coordinate credential distribution and rotation with the client deployment process.
|
|
|
|
### Credential Management Guidance
|
|
1. Real MQTT passwords must not be stored in tracked documentation or committed templates.
|
|
2. Each client device should receive a unique broker username and password, stored only in its local [/.env](.env).
|
|
3. Server-side publisher credentials should be stored in the server team's secret-management path, not in source control.
|
|
4. Recommended naming convention for client broker users: `infoscreen-client-<client-uuid-prefix>`.
|
|
5. Client passwords should be random, at least 20 characters, and rotated through deployment tooling or broker administration procedures.
|
|
6. The server/infrastructure team owns broker-side user creation, ACL assignment, rotation, and revocation.
|
|
7. The client team owns loading credentials from local env files and validating connection behavior against the secured broker.
|
|
|
|
### Client Team Actions For Auth Hardening
|
|
1. Add MQTT username/password support in the client connection setup.
|
|
2. Add client-side TLS configuration support from environment when certificates are provided.
|
|
3. Update local test helpers to support authenticated MQTT publishing and subscription.
|
|
4. Validate command and event intake against the authenticated broker configuration before canary rollout.
|
|
|
|
### Ready For Server/Frontend Team (Auth Phase)
|
|
1. Client implementation is ready to connect with MQTT auth from local `.env` (`MQTT_USERNAME`, `MQTT_PASSWORD`, optional TLS settings).
|
|
2. Client command/event intake and client ack/telemetry publishing run over the authenticated MQTT session.
|
|
3. Server/frontend team must now complete broker-side enforcement and publisher migration.
|
|
|
|
Server/frontend done criteria:
|
|
1. Anonymous broker access is disabled in production.
|
|
2. Server-side publishers use authenticated broker credentials.
|
|
3. ACLs are active and validated for command, event, and client telemetry topics.
|
|
4. At least one canary client proves end-to-end flow under ACLs:
|
|
- server publishes command/event with authenticated publisher
|
|
- client receives payload
|
|
- client sends ack/telemetry successfully
|
|
5. Revocation test passes: disabling one client credential blocks only that client without impacting others.
|
|
|
|
Operational note:
|
|
1. Client-side auth support is necessary but not sufficient by itself; broker ACL/auth enforcement is the security control that must be enabled by the server/infrastructure team.
|
|
|
|
### Rollout Plan
|
|
1. Contract freeze and sign-off.
|
|
2. Platform and client implementation against frozen schemas.
|
|
3. One-group canary.
|
|
4. Rollback if failed plus timed_out exceeds 5 percent.
|
|
5. Expand only after 7 days below intervention threshold.
|
|
|
|
### Success Criteria
|
|
1. Deterministic command lifecycle visibility from enqueue to completion.
|
|
2. No duplicate execution under reconnect or delayed-delivery conditions.
|
|
3. Stable Pi 5 SSD reconnect behavior within defined baseline.
|
|
4. Clear and actionable manual intervention states when automatic recovery is exhausted.
|