feat: crash recovery, service_failed monitoring, broker health fields, command expiry sweep
- Add GET /api/clients/crashed endpoint (process_status=crashed or stale heartbeat) - Add restart_app command action with same lifecycle + lockout as reboot_host - Scheduler: crash auto-recovery loop (CRASH_RECOVERY_ENABLED flag, lockout, MQTT publish) - Scheduler: unconditional command expiry sweep per poll cycle (sweep_expired_commands) - Listener: subscribe to infoscreen/+/service_failed; persist service_failed_at + unit - Listener: extract broker_connection block from health payload; persist reconnect_count + last_disconnect_at - DB migration b1c2d3e4f5a6: service_failed_at, service_failed_unit, mqtt_reconnect_count, mqtt_last_disconnect_at on clients - Add GET /api/clients/service_failed and POST /api/clients/<uuid>/clear_service_failed - Monitoring overview API: include mqtt_reconnect_count + mqtt_last_disconnect_at per client - Frontend: orange service-failed alert panel (hidden when empty, auto-refresh, quittieren action) - Frontend: MQTT reconnect count + last disconnect in client detail panel - MQTT auth hardening: listener/scheduler/server use env credentials; broker enforces allow_anonymous false - Client command lifecycle foundation: ClientCommand model, reboot_host/shutdown_host, full ACK lifecycle - Docs: TECH-CHANGELOG, DEV-CHANGELOG, MQTT_EVENT_PAYLOAD_GUIDE, copilot-instructions updated - Add implementation-plans/, RESTART_VALIDATION_CHECKLIST.md, TODO.md
This commit is contained in:
54
implementation-plans/reboot-kickoff-summary.md
Normal file
54
implementation-plans/reboot-kickoff-summary.md
Normal file
@@ -0,0 +1,54 @@
|
||||
## Reboot Reliability Kickoff Summary
|
||||
|
||||
### Objective
|
||||
Ship a reliable, observable restart and shutdown workflow for Raspberry Pi 5 clients, with safe escalation and clear operator outcomes.
|
||||
|
||||
### What Is Included
|
||||
1. Asynchronous command lifecycle with idempotent command_id handling.
|
||||
2. Monitoring-first state visibility from queued to terminal outcomes.
|
||||
3. Client acknowledgements for accepted, execution_started, completed, and failed.
|
||||
4. Pi 5 USB-SATA SSD timeout baseline and tuning rules.
|
||||
5. Capability-tier recovery with optional managed PoE escalation.
|
||||
|
||||
### What Is Not Included
|
||||
1. Full maintenance module in client-management.
|
||||
2. Required managed power-switch integration.
|
||||
3. Production Wake-on-LAN rollout.
|
||||
|
||||
### Team Split
|
||||
1. Platform team: API command lifecycle, safety controls, listener ack ingestion.
|
||||
2. Web team: lifecycle-aware UX and command status display.
|
||||
3. Client team: strict validation, dedupe, ack sequence, secure execution helper, reboot continuity.
|
||||
|
||||
### Ownership Matrix
|
||||
| Team | Primary Plan File | Main Deliverables |
|
||||
| --- | --- | --- |
|
||||
| Platform team | implementation-plans/reboot-implementation-handoff-share.md | Command lifecycle backend, policy enforcement, listener ack mapping, safety lockout and escalation |
|
||||
| Web team | implementation-plans/reboot-implementation-handoff-share.md | Lifecycle UI states, bulk safety UX, capability visibility, command status polling |
|
||||
| Client team | implementation-plans/reboot-implementation-handoff-client-team.md | Command validation, dedupe persistence, ack sequence, secure execution helper, reboot continuity |
|
||||
| Project coordination | implementation-plans/reboot-kickoff-summary.md | Phase sequencing, canary gates, rollback thresholds, cross-team sign-off tracking |
|
||||
|
||||
### Baseline Operational Defaults
|
||||
1. Safety lockout: 3 reboot commands per client in rolling 15 minutes.
|
||||
2. Escalation cooldown: 60 seconds.
|
||||
3. Reconnect target on Pi 5 SSD: 90 seconds baseline, 150 seconds cold-boot ceiling.
|
||||
4. Rollback canary trigger: failed plus timed_out above 5 percent.
|
||||
|
||||
### Frozen Contract Snapshot
|
||||
1. Canonical command topic: infoscreen/{client_uuid}/commands.
|
||||
2. Canonical ack topic: infoscreen/{client_uuid}/commands/ack.
|
||||
3. Transitional compatibility topics during migration:
|
||||
- infoscreen/{client_uuid}/command
|
||||
- infoscreen/{client_uuid}/command/ack
|
||||
4. Command schema version: 1.0.
|
||||
5. Allowed command actions: reboot_host, shutdown_host.
|
||||
6. Allowed ack status values: accepted, execution_started, completed, failed.
|
||||
7. Validation snippets:
|
||||
- implementation-plans/reboot-command-payload-schemas.md
|
||||
- implementation-plans/reboot-command-payload-schemas.json
|
||||
|
||||
### Immediate Next Steps
|
||||
1. Continue implementation in parallel by team against frozen contract.
|
||||
2. Client team validates dedupe and expiry handling on canonical topics.
|
||||
3. Platform team verifies ack-state transitions for accepted, execution_started, completed, failed.
|
||||
4. Execute one-group canary and validate timing plus failure drills.
|
||||
Reference in New Issue
Block a user