Files
infoscreen/implementation-plans/reboot-implementation-handoff-share.md
Olaf 03e3c11e90 feat: crash recovery, service_failed monitoring, broker health fields, command expiry sweep
- Add GET /api/clients/crashed endpoint (process_status=crashed or stale heartbeat)
- Add restart_app command action with same lifecycle + lockout as reboot_host
- Scheduler: crash auto-recovery loop (CRASH_RECOVERY_ENABLED flag, lockout, MQTT publish)
- Scheduler: unconditional command expiry sweep per poll cycle (sweep_expired_commands)
- Listener: subscribe to infoscreen/+/service_failed; persist service_failed_at + unit
- Listener: extract broker_connection block from health payload; persist reconnect_count + last_disconnect_at
- DB migration b1c2d3e4f5a6: service_failed_at, service_failed_unit, mqtt_reconnect_count, mqtt_last_disconnect_at on clients
- Add GET /api/clients/service_failed and POST /api/clients/<uuid>/clear_service_failed
- Monitoring overview API: include mqtt_reconnect_count + mqtt_last_disconnect_at per client
- Frontend: orange service-failed alert panel (hidden when empty, auto-refresh, quittieren action)
- Frontend: MQTT reconnect count + last disconnect in client detail panel
- MQTT auth hardening: listener/scheduler/server use env credentials; broker enforces allow_anonymous false
- Client command lifecycle foundation: ClientCommand model, reboot_host/shutdown_host, full ACK lifecycle
- Docs: TECH-CHANGELOG, DEV-CHANGELOG, MQTT_EVENT_PAYLOAD_GUIDE, copilot-instructions updated
- Add implementation-plans/, RESTART_VALIDATION_CHECKLIST.md, TODO.md
2026-04-05 10:17:56 +00:00

12 KiB

Remote Reboot Reliability Handoff (Share Document)

Purpose

This document defines the agreed implementation scope for reliable remote reboot and shutdown of Raspberry Pi 5 clients, with monitoring-first visibility and safe escalation paths.

Scope

  1. In scope: restart and shutdown command reliability.
  2. In scope: full lifecycle monitoring and audit visibility.
  3. In scope: capability-tier recovery model with optional managed PoE escalation.
  4. Out of scope: broader maintenance module in client-management for this cycle.
  5. Out of scope: mandatory dependency on customer-managed power switching.

Agreed Operating Model

  1. Command delivery is asynchronous and lifecycle-tracked, not fire-and-forget.
  2. Commands use idempotent command_id semantics with stale-command rejection by expires_at.
  3. Monitoring is authoritative for operational state and escalation decisions.
  4. Recovery must function even when no managed power switching is available.

Frozen Contract v1 (Effective Immediately)

  1. Canonical command topic: infoscreen/{client_uuid}/commands.
  2. Canonical ack topic: infoscreen/{client_uuid}/commands/ack.
  3. Transitional compatibility topics accepted during migration:
  • infoscreen/{client_uuid}/command
  • infoscreen/{client_uuid}/command/ack
  1. QoS policy: command QoS 1, ack QoS 1 recommended.
  2. Retain policy: commands and acks are non-retained.

Command payload schema (frozen):

{
	"schema_version": "1.0",
	"command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
	"client_uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
	"action": "reboot_host",
	"issued_at": "2026-04-03T12:48:10Z",
	"expires_at": "2026-04-03T12:52:10Z",
	"requested_by": 1,
	"reason": "operator_request"
}

Ack payload schema (frozen):

{
	"command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
	"status": "execution_started",
	"error_code": null,
	"error_message": null
}

Allowed ack status values:

  1. accepted
  2. execution_started
  3. completed
  4. failed

Frozen command action values:

  1. reboot_host
  2. shutdown_host

API endpoint mapping:

  1. POST /api/clients/{uuid}/restart -> action reboot_host
  2. POST /api/clients/{uuid}/shutdown -> action shutdown_host

Validation snippets:

  1. Human-readable snippets: implementation-plans/reboot-command-payload-schemas.md
  2. Machine-validated JSON Schema: implementation-plans/reboot-command-payload-schemas.json

Command Lifecycle States

  1. queued
  2. publish_in_progress
  3. published
  4. ack_received
  5. execution_started
  6. awaiting_reconnect
  7. recovered
  8. completed
  9. failed
  10. expired
  11. timed_out
  12. canceled
  13. blocked_safety
  14. manual_intervention_required

Timeout Defaults (Pi 5, USB-SATA SSD baseline)

  1. queued to publish_in_progress: immediate, timeout 5 seconds.
  2. publish_in_progress to published: timeout 8 seconds.
  3. published to ack_received: timeout 20 seconds.
  4. ack_received to execution_started: 15 seconds for service restart, 25 seconds for host reboot.
  5. execution_started to awaiting_reconnect: timeout 10 seconds.
  6. awaiting_reconnect to recovered: baseline 90 seconds after validation, cold-boot ceiling 150 seconds.
  7. recovered to completed: 15 to 20 seconds based on fleet stability.
  8. command expires_at default: 240 seconds, bounded 180 to 360 seconds.

Recovery Tiers

  1. Tier 0 baseline, always required: watchdog, systemd auto-restart, lifecycle tracking, manual intervention fallback.
  2. Tier 1 optional: managed PoE per-port power-cycle escalation where customer infrastructure supports it.
  3. Tier 2 no remote power control: direct manual intervention workflow.

Governance And Safety

  1. Role access: admin and superadmin.
  2. Bulk actions require reason capture.
  3. Safety lockout: maximum 3 reboot commands per client in 15 minutes.
  4. Escalation cooldown: 60 seconds before automatic move to manual_intervention_required.

MQTT Auth Hardening (Phase 1, Required Before Broad Rollout)

  1. Intranet-only deployment is not sufficient protection for privileged MQTT actions by itself.
  2. Phase 1 hardening scope is broker authentication, authorization, and network restriction; payload URL allowlisting is deferred to a later client/server feature.
  3. MQTT broker must disable anonymous publish/subscribe access in production.
  4. MQTT broker must require authenticated identities for server-side publishers and client devices.
  5. MQTT broker must enforce ACLs so that:
  • only server-side services can publish to infoscreen/{client_uuid}/commands
  • only server-side services can publish scheduler event topics
  • each client can subscribe only to its own command topics and assigned event topics
  • each client can publish only its own ack, heartbeat, health, dashboard, and telemetry topics
  1. Broker port exposure must be restricted to the management network and approved hosts only.
  2. TLS support is strongly recommended in this phase and should be enabled when operationally feasible.

Server Team Actions For Auth Hardening

  1. Provision broker credentials for command/event publishers and for client devices.
  2. Configure Mosquitto or equivalent broker ACLs for per-topic publish and subscribe restrictions.
  3. Disable anonymous access on production brokers.
  4. Restrict broker network exposure with firewall rules, VLAN policy, or equivalent network controls.
  5. Update server/frontend deployment to publish MQTT with authenticated credentials.
  6. Validate that server-side event publishing and reboot/shutdown command publishing still work under the new ACL policy.
  7. Coordinate credential distribution and rotation with the client deployment process.

MQTT ACL Matrix (Canonical Baseline)

Actor Topic Pattern Publish Subscribe Notes
scheduler-service infoscreen/events/+ Yes No Publishes retained active event list per group.
api-command-publisher infoscreen/+/commands Yes No Publishes canonical reboot/shutdown commands.
api-command-publisher infoscreen/+/command Yes No Transitional compatibility publish only.
api-group-assignment infoscreen/+/group_id Yes No Publishes retained client-to-group assignment.
listener-service infoscreen/+/commands/ack No Yes Consumes canonical client command acknowledgements.
listener-service infoscreen/+/command/ack No Yes Consumes transitional compatibility acknowledgements.
listener-service infoscreen/+/heartbeat No Yes Consumes heartbeat telemetry.
listener-service infoscreen/+/health No Yes Consumes health telemetry.
listener-service infoscreen/+/dashboard No Yes Consumes dashboard screenshot payloads.
listener-service infoscreen/+/screenshot No Yes Consumes screenshot payloads (if enabled).
listener-service infoscreen/+/logs/error No Yes Consumes client error logs.
listener-service infoscreen/+/logs/warn No Yes Consumes client warn logs.
listener-service infoscreen/+/logs/info No Yes Consumes client info logs.
listener-service infoscreen/discovery No Yes Consumes discovery announcements.
listener-service infoscreen/+/discovery_ack Yes No Publishes discovery acknowledgements.
client- infoscreen//commands No Yes Canonical command intake for this client only.
client- infoscreen//command No Yes Transitional compatibility intake for this client only.
client- infoscreen/events/<group_id> No Yes Assigned group event feed only; dynamic per assignment.
client- infoscreen//commands/ack Yes No Canonical command acknowledgements for this client only.
client- infoscreen//command/ack Yes No Transitional compatibility acknowledgements for this client only.
client- infoscreen//heartbeat Yes No Heartbeat telemetry.
client- infoscreen//health Yes No Health telemetry.
client- infoscreen//dashboard Yes No Dashboard status and screenshot payloads.
client- infoscreen//screenshot Yes No Screenshot payloads (if enabled).
client- infoscreen//logs/error Yes No Error log stream.
client- infoscreen//logs/warn Yes No Warning log stream.
client- infoscreen//logs/info Yes No Info log stream.
client- infoscreen/discovery Yes No Discovery announcement.
client- infoscreen//discovery_ack No Yes Discovery acknowledgment from listener.

ACL implementation notes:

  1. Use per-client identities; client ACLs must be scoped to exact client UUID and must not allow wildcard access to other clients.
  2. Event topic subscription (infoscreen/events/<group_id>) should be managed via broker-side ACL provisioning that updates when group assignment changes.
  3. Transitional singular command topics are temporary and should be removed after migration cutover.
  4. Deny by default: any topic not explicitly listed above should be blocked for each actor.

Credential Management Guidance

  1. Real MQTT passwords must not be stored in tracked documentation or committed templates.
  2. Each client device should receive a unique broker username and password, stored only in its local /.env.
  3. Server-side publisher credentials should be stored in the server team's secret-management path, not in source control.
  4. Recommended naming convention for client broker users: infoscreen-client-<client-uuid-prefix>.
  5. Client passwords should be random, at least 20 characters, and rotated through deployment tooling or broker administration procedures.
  6. The server/infrastructure team owns broker-side user creation, ACL assignment, rotation, and revocation.
  7. The client team owns loading credentials from local env files and validating connection behavior against the secured broker.

Client Team Actions For Auth Hardening

  1. Add MQTT username/password support in the client connection setup.
  2. Add client-side TLS configuration support from environment when certificates are provided.
  3. Update local test helpers to support authenticated MQTT publishing and subscription.
  4. Validate command and event intake against the authenticated broker configuration before canary rollout.

Ready For Server/Frontend Team (Auth Phase)

  1. Client implementation is ready to connect with MQTT auth from local .env (MQTT_USERNAME, MQTT_PASSWORD, optional TLS settings).
  2. Client command/event intake and client ack/telemetry publishing run over the authenticated MQTT session.
  3. Server/frontend team must now complete broker-side enforcement and publisher migration.

Server/frontend done criteria:

  1. Anonymous broker access is disabled in production.
  2. Server-side publishers use authenticated broker credentials.
  3. ACLs are active and validated for command, event, and client telemetry topics.
  4. At least one canary client proves end-to-end flow under ACLs:
  • server publishes command/event with authenticated publisher
  • client receives payload
  • client sends ack/telemetry successfully
  1. Revocation test passes: disabling one client credential blocks only that client without impacting others.

Operational note:

  1. Client-side auth support is necessary but not sufficient by itself; broker ACL/auth enforcement is the security control that must be enabled by the server/infrastructure team.

Rollout Plan

  1. Contract freeze and sign-off.
  2. Platform and client implementation against frozen schemas.
  3. One-group canary.
  4. Rollback if failed plus timed_out exceeds 5 percent.
  5. Expand only after 7 days below intervention threshold.

Success Criteria

  1. Deterministic command lifecycle visibility from enqueue to completion.
  2. No duplicate execution under reconnect or delayed-delivery conditions.
  3. Stable Pi 5 SSD reconnect behavior within defined baseline.
  4. Clear and actionable manual intervention states when automatic recovery is exhausted.