- Add GET /api/clients/crashed endpoint (process_status=crashed or stale heartbeat) - Add restart_app command action with same lifecycle + lockout as reboot_host - Scheduler: crash auto-recovery loop (CRASH_RECOVERY_ENABLED flag, lockout, MQTT publish) - Scheduler: unconditional command expiry sweep per poll cycle (sweep_expired_commands) - Listener: subscribe to infoscreen/+/service_failed; persist service_failed_at + unit - Listener: extract broker_connection block from health payload; persist reconnect_count + last_disconnect_at - DB migration b1c2d3e4f5a6: service_failed_at, service_failed_unit, mqtt_reconnect_count, mqtt_last_disconnect_at on clients - Add GET /api/clients/service_failed and POST /api/clients/<uuid>/clear_service_failed - Monitoring overview API: include mqtt_reconnect_count + mqtt_last_disconnect_at per client - Frontend: orange service-failed alert panel (hidden when empty, auto-refresh, quittieren action) - Frontend: MQTT reconnect count + last disconnect in client detail panel - MQTT auth hardening: listener/scheduler/server use env credentials; broker enforces allow_anonymous false - Client command lifecycle foundation: ClientCommand model, reboot_host/shutdown_host, full ACK lifecycle - Docs: TECH-CHANGELOG, DEV-CHANGELOG, MQTT_EVENT_PAYLOAD_GUIDE, copilot-instructions updated - Add implementation-plans/, RESTART_VALIDATION_CHECKLIST.md, TODO.md
12 KiB
12 KiB
Remote Reboot Reliability Handoff (Share Document)
Purpose
This document defines the agreed implementation scope for reliable remote reboot and shutdown of Raspberry Pi 5 clients, with monitoring-first visibility and safe escalation paths.
Scope
- In scope: restart and shutdown command reliability.
- In scope: full lifecycle monitoring and audit visibility.
- In scope: capability-tier recovery model with optional managed PoE escalation.
- Out of scope: broader maintenance module in client-management for this cycle.
- Out of scope: mandatory dependency on customer-managed power switching.
Agreed Operating Model
- Command delivery is asynchronous and lifecycle-tracked, not fire-and-forget.
- Commands use idempotent command_id semantics with stale-command rejection by expires_at.
- Monitoring is authoritative for operational state and escalation decisions.
- Recovery must function even when no managed power switching is available.
Frozen Contract v1 (Effective Immediately)
- Canonical command topic: infoscreen/{client_uuid}/commands.
- Canonical ack topic: infoscreen/{client_uuid}/commands/ack.
- Transitional compatibility topics accepted during migration:
- infoscreen/{client_uuid}/command
- infoscreen/{client_uuid}/command/ack
- QoS policy: command QoS 1, ack QoS 1 recommended.
- Retain policy: commands and acks are non-retained.
Command payload schema (frozen):
{
"schema_version": "1.0",
"command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
"client_uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
"action": "reboot_host",
"issued_at": "2026-04-03T12:48:10Z",
"expires_at": "2026-04-03T12:52:10Z",
"requested_by": 1,
"reason": "operator_request"
}
Ack payload schema (frozen):
{
"command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
"status": "execution_started",
"error_code": null,
"error_message": null
}
Allowed ack status values:
- accepted
- execution_started
- completed
- failed
Frozen command action values:
- reboot_host
- shutdown_host
API endpoint mapping:
- POST /api/clients/{uuid}/restart -> action reboot_host
- POST /api/clients/{uuid}/shutdown -> action shutdown_host
Validation snippets:
- Human-readable snippets: implementation-plans/reboot-command-payload-schemas.md
- Machine-validated JSON Schema: implementation-plans/reboot-command-payload-schemas.json
Command Lifecycle States
- queued
- publish_in_progress
- published
- ack_received
- execution_started
- awaiting_reconnect
- recovered
- completed
- failed
- expired
- timed_out
- canceled
- blocked_safety
- manual_intervention_required
Timeout Defaults (Pi 5, USB-SATA SSD baseline)
- queued to publish_in_progress: immediate, timeout 5 seconds.
- publish_in_progress to published: timeout 8 seconds.
- published to ack_received: timeout 20 seconds.
- ack_received to execution_started: 15 seconds for service restart, 25 seconds for host reboot.
- execution_started to awaiting_reconnect: timeout 10 seconds.
- awaiting_reconnect to recovered: baseline 90 seconds after validation, cold-boot ceiling 150 seconds.
- recovered to completed: 15 to 20 seconds based on fleet stability.
- command expires_at default: 240 seconds, bounded 180 to 360 seconds.
Recovery Tiers
- Tier 0 baseline, always required: watchdog, systemd auto-restart, lifecycle tracking, manual intervention fallback.
- Tier 1 optional: managed PoE per-port power-cycle escalation where customer infrastructure supports it.
- Tier 2 no remote power control: direct manual intervention workflow.
Governance And Safety
- Role access: admin and superadmin.
- Bulk actions require reason capture.
- Safety lockout: maximum 3 reboot commands per client in 15 minutes.
- Escalation cooldown: 60 seconds before automatic move to manual_intervention_required.
MQTT Auth Hardening (Phase 1, Required Before Broad Rollout)
- Intranet-only deployment is not sufficient protection for privileged MQTT actions by itself.
- Phase 1 hardening scope is broker authentication, authorization, and network restriction; payload URL allowlisting is deferred to a later client/server feature.
- MQTT broker must disable anonymous publish/subscribe access in production.
- MQTT broker must require authenticated identities for server-side publishers and client devices.
- MQTT broker must enforce ACLs so that:
- only server-side services can publish to
infoscreen/{client_uuid}/commands - only server-side services can publish scheduler event topics
- each client can subscribe only to its own command topics and assigned event topics
- each client can publish only its own ack, heartbeat, health, dashboard, and telemetry topics
- Broker port exposure must be restricted to the management network and approved hosts only.
- TLS support is strongly recommended in this phase and should be enabled when operationally feasible.
Server Team Actions For Auth Hardening
- Provision broker credentials for command/event publishers and for client devices.
- Configure Mosquitto or equivalent broker ACLs for per-topic publish and subscribe restrictions.
- Disable anonymous access on production brokers.
- Restrict broker network exposure with firewall rules, VLAN policy, or equivalent network controls.
- Update server/frontend deployment to publish MQTT with authenticated credentials.
- Validate that server-side event publishing and reboot/shutdown command publishing still work under the new ACL policy.
- Coordinate credential distribution and rotation with the client deployment process.
MQTT ACL Matrix (Canonical Baseline)
| Actor | Topic Pattern | Publish | Subscribe | Notes |
|---|---|---|---|---|
| scheduler-service | infoscreen/events/+ | Yes | No | Publishes retained active event list per group. |
| api-command-publisher | infoscreen/+/commands | Yes | No | Publishes canonical reboot/shutdown commands. |
| api-command-publisher | infoscreen/+/command | Yes | No | Transitional compatibility publish only. |
| api-group-assignment | infoscreen/+/group_id | Yes | No | Publishes retained client-to-group assignment. |
| listener-service | infoscreen/+/commands/ack | No | Yes | Consumes canonical client command acknowledgements. |
| listener-service | infoscreen/+/command/ack | No | Yes | Consumes transitional compatibility acknowledgements. |
| listener-service | infoscreen/+/heartbeat | No | Yes | Consumes heartbeat telemetry. |
| listener-service | infoscreen/+/health | No | Yes | Consumes health telemetry. |
| listener-service | infoscreen/+/dashboard | No | Yes | Consumes dashboard screenshot payloads. |
| listener-service | infoscreen/+/screenshot | No | Yes | Consumes screenshot payloads (if enabled). |
| listener-service | infoscreen/+/logs/error | No | Yes | Consumes client error logs. |
| listener-service | infoscreen/+/logs/warn | No | Yes | Consumes client warn logs. |
| listener-service | infoscreen/+/logs/info | No | Yes | Consumes client info logs. |
| listener-service | infoscreen/discovery | No | Yes | Consumes discovery announcements. |
| listener-service | infoscreen/+/discovery_ack | Yes | No | Publishes discovery acknowledgements. |
| client- | infoscreen//commands | No | Yes | Canonical command intake for this client only. |
| client- | infoscreen//command | No | Yes | Transitional compatibility intake for this client only. |
| client- | infoscreen/events/<group_id> | No | Yes | Assigned group event feed only; dynamic per assignment. |
| client- | infoscreen//commands/ack | Yes | No | Canonical command acknowledgements for this client only. |
| client- | infoscreen//command/ack | Yes | No | Transitional compatibility acknowledgements for this client only. |
| client- | infoscreen//heartbeat | Yes | No | Heartbeat telemetry. |
| client- | infoscreen//health | Yes | No | Health telemetry. |
| client- | infoscreen//dashboard | Yes | No | Dashboard status and screenshot payloads. |
| client- | infoscreen//screenshot | Yes | No | Screenshot payloads (if enabled). |
| client- | infoscreen//logs/error | Yes | No | Error log stream. |
| client- | infoscreen//logs/warn | Yes | No | Warning log stream. |
| client- | infoscreen//logs/info | Yes | No | Info log stream. |
| client- | infoscreen/discovery | Yes | No | Discovery announcement. |
| client- | infoscreen//discovery_ack | No | Yes | Discovery acknowledgment from listener. |
ACL implementation notes:
- Use per-client identities; client ACLs must be scoped to exact client UUID and must not allow wildcard access to other clients.
- Event topic subscription (
infoscreen/events/<group_id>) should be managed via broker-side ACL provisioning that updates when group assignment changes. - Transitional singular command topics are temporary and should be removed after migration cutover.
- Deny by default: any topic not explicitly listed above should be blocked for each actor.
Credential Management Guidance
- Real MQTT passwords must not be stored in tracked documentation or committed templates.
- Each client device should receive a unique broker username and password, stored only in its local /.env.
- Server-side publisher credentials should be stored in the server team's secret-management path, not in source control.
- Recommended naming convention for client broker users:
infoscreen-client-<client-uuid-prefix>. - Client passwords should be random, at least 20 characters, and rotated through deployment tooling or broker administration procedures.
- The server/infrastructure team owns broker-side user creation, ACL assignment, rotation, and revocation.
- The client team owns loading credentials from local env files and validating connection behavior against the secured broker.
Client Team Actions For Auth Hardening
- Add MQTT username/password support in the client connection setup.
- Add client-side TLS configuration support from environment when certificates are provided.
- Update local test helpers to support authenticated MQTT publishing and subscription.
- Validate command and event intake against the authenticated broker configuration before canary rollout.
Ready For Server/Frontend Team (Auth Phase)
- Client implementation is ready to connect with MQTT auth from local
.env(MQTT_USERNAME,MQTT_PASSWORD, optional TLS settings). - Client command/event intake and client ack/telemetry publishing run over the authenticated MQTT session.
- Server/frontend team must now complete broker-side enforcement and publisher migration.
Server/frontend done criteria:
- Anonymous broker access is disabled in production.
- Server-side publishers use authenticated broker credentials.
- ACLs are active and validated for command, event, and client telemetry topics.
- At least one canary client proves end-to-end flow under ACLs:
- server publishes command/event with authenticated publisher
- client receives payload
- client sends ack/telemetry successfully
- Revocation test passes: disabling one client credential blocks only that client without impacting others.
Operational note:
- Client-side auth support is necessary but not sufficient by itself; broker ACL/auth enforcement is the security control that must be enabled by the server/infrastructure team.
Rollout Plan
- Contract freeze and sign-off.
- Platform and client implementation against frozen schemas.
- One-group canary.
- Rollback if failed plus timed_out exceeds 5 percent.
- Expand only after 7 days below intervention threshold.
Success Criteria
- Deterministic command lifecycle visibility from enqueue to completion.
- No duplicate execution under reconnect or delayed-delivery conditions.
- Stable Pi 5 SSD reconnect behavior within defined baseline.
- Clear and actionable manual intervention states when automatic recovery is exhausted.