- Command intake (reboot/shutdown) on infoscreen/{uuid}/commands with ack lifecycle
- MQTT_USER/MQTT_PASSWORD_BROKER split from identity vars; configure_mqtt_security() updated
- infoscreen-simclient.service: Type=notify, WatchdogSec=60, Restart=on-failure
- infoscreen-notify-failure@.service + script: retained MQTT alert when systemd gives up (Gap 3)
- _sd_notify() watchdog keepalive in simclient main loop (Gap 1)
- broker_connection block in health payload: reconnect_count, last_disconnect_at (Gap 2)
- COMMAND_MOCK_REBOOT_IMMEDIATE_COMPLETE canary flag with safety guard
- SERVER_TEAM_ACTIONS.md: server-side integration action items
- Docs: README, CHANGELOG, src/README, copilot-instructions updated
- 43 tests passing
2.9 KiB
2.9 KiB
Reboot Reliability Kickoff Summary
Objective
Ship a reliable, observable restart and shutdown workflow for Raspberry Pi 5 clients, with safe escalation and clear operator outcomes.
What Is Included
- Asynchronous command lifecycle with idempotent command_id handling.
- Monitoring-first state visibility from queued to terminal outcomes.
- Client acknowledgements for accepted, execution_started, completed, and failed.
- Pi 5 USB-SATA SSD timeout baseline and tuning rules.
- Capability-tier recovery with optional managed PoE escalation.
What Is Not Included
- Full maintenance module in client-management.
- Required managed power-switch integration.
- Production Wake-on-LAN rollout.
Team Split
- Platform team: API command lifecycle, safety controls, listener ack ingestion.
- Web team: lifecycle-aware UX and command status display.
- Client team: strict validation, dedupe, ack sequence, secure execution helper, reboot continuity.
Ownership Matrix
| Team | Primary Plan File | Main Deliverables |
|---|---|---|
| Platform team | implementation-plans/reboot-implementation-handoff-share.md | Command lifecycle backend, policy enforcement, listener ack mapping, safety lockout and escalation |
| Web team | implementation-plans/reboot-implementation-handoff-share.md | Lifecycle UI states, bulk safety UX, capability visibility, command status polling |
| Client team | implementation-plans/reboot-implementation-handoff-client-team.md | Command validation, dedupe persistence, ack sequence, secure execution helper, reboot continuity |
| Project coordination | implementation-plans/reboot-kickoff-summary.md | Phase sequencing, canary gates, rollback thresholds, cross-team sign-off tracking |
Baseline Operational Defaults
- Safety lockout: 3 reboot commands per client in rolling 15 minutes.
- Escalation cooldown: 60 seconds.
- Reconnect target on Pi 5 SSD: 90 seconds baseline, 150 seconds cold-boot ceiling.
- Rollback canary trigger: failed plus timed_out above 5 percent.
Frozen Contract Snapshot
- Canonical command topic: infoscreen/{client_uuid}/commands.
- Canonical ack topic: infoscreen/{client_uuid}/commands/ack.
- Transitional compatibility topics during migration:
- infoscreen/{client_uuid}/command
- infoscreen/{client_uuid}/command/ack
- Command schema version: 1.0.
- Allowed command actions: reboot_host, shutdown_host.
- Allowed ack status values: accepted, execution_started, completed, failed.
- Validation snippets:
- implementation-plans/reboot-command-payload-schemas.md
- implementation-plans/reboot-command-payload-schemas.json
Immediate Next Steps
- Continue implementation in parallel by team against frozen contract.
- Client team validates dedupe and expiry handling on canonical topics.
- Platform team verifies ack-state transitions for accepted, execution_started, completed, failed.
- Execute one-group canary and validate timing plus failure drills.