feat: crash recovery, service_failed monitoring, broker health fields, command expiry sweep

- Add GET /api/clients/crashed endpoint (process_status=crashed or stale heartbeat)
- Add restart_app command action with same lifecycle + lockout as reboot_host
- Scheduler: crash auto-recovery loop (CRASH_RECOVERY_ENABLED flag, lockout, MQTT publish)
- Scheduler: unconditional command expiry sweep per poll cycle (sweep_expired_commands)
- Listener: subscribe to infoscreen/+/service_failed; persist service_failed_at + unit
- Listener: extract broker_connection block from health payload; persist reconnect_count + last_disconnect_at
- DB migration b1c2d3e4f5a6: service_failed_at, service_failed_unit, mqtt_reconnect_count, mqtt_last_disconnect_at on clients
- Add GET /api/clients/service_failed and POST /api/clients/<uuid>/clear_service_failed
- Monitoring overview API: include mqtt_reconnect_count + mqtt_last_disconnect_at per client
- Frontend: orange service-failed alert panel (hidden when empty, auto-refresh, quittieren action)
- Frontend: MQTT reconnect count + last disconnect in client detail panel
- MQTT auth hardening: listener/scheduler/server use env credentials; broker enforces allow_anonymous false
- Client command lifecycle foundation: ClientCommand model, reboot_host/shutdown_host, full ACK lifecycle
- Docs: TECH-CHANGELOG, DEV-CHANGELOG, MQTT_EVENT_PAYLOAD_GUIDE, copilot-instructions updated
- Add implementation-plans/, RESTART_VALIDATION_CHECKLIST.md, TODO.md
This commit is contained in:
2026-04-05 10:17:56 +00:00
parent 4d652f0554
commit 03e3c11e90
35 changed files with 2511 additions and 80 deletions

View File

@@ -13,15 +13,16 @@ It is not a changelog and not a full architecture handbook.
- Keep changes minimal, match existing patterns, and update docs in the same commit when behavior changes.
## Fast file map
- `scheduler/scheduler.py` - scheduler loop, MQTT event publishing, TV power intent publishing
- `scheduler/db_utils.py` - event formatting and power-intent helper logic
- `listener/listener.py` - discovery/heartbeat/log/screenshot MQTT consumption
- `scheduler/scheduler.py` - scheduler loop, MQTT event publishing, TV power intent publishing, crash auto-recovery, command expiry sweep
- `scheduler/db_utils.py` - event formatting, power-intent helpers, crash recovery helpers, command expiry sweep
- `listener/listener.py` - discovery/heartbeat/log/screenshot/service_failed MQTT consumption
- `server/init_academic_periods.py` - idempotent academic-period seeding + auto-activation for current date
- `server/initialize_database.py` - migration + bootstrap orchestration for local/manual setup
- `server/routes/events.py` - event CRUD, recurrence handling, UTC normalization
- `server/routes/eventmedia.py` - file manager, media upload/stream endpoints
- `server/routes/groups.py` - group lifecycle, alive status, order persistence
- `server/routes/system_settings.py` - system settings CRUD and supplement-table endpoint
- `server/routes/clients.py` - client metadata, restart/shutdown/restart_app command issuing, command status, crashed/service_failed alert endpoints
- `dashboard/src/settings.tsx` - settings UX and system-defaults integration
- `dashboard/src/components/CustomEventModal.tsx` - event creation/editing UX
- `dashboard/src/monitoring.tsx` - superadmin monitoring page
@@ -54,6 +55,9 @@ It is not a changelog and not a full architecture handbook.
- Logs topic family: `infoscreen/{uuid}/logs/{error|warn|info}`
- Health topic: `infoscreen/{uuid}/health`
- Dashboard screenshot topic: `infoscreen/{uuid}/dashboard`
- Client command topic (QoS1, non-retained): `infoscreen/{uuid}/commands` (compat alias: `infoscreen/{uuid}/command`)
- Client command ack topic (QoS1, non-retained): `infoscreen/{uuid}/commands/ack` (compat alias: `infoscreen/{uuid}/command/ack`)
- Service-failed topic (retained, client→server): `infoscreen/{uuid}/service_failed`
- TV power intent Phase 1 topic (retained, QoS1): `infoscreen/groups/{group_id}/power/intent`
TV power intent Phase 1 rules:
@@ -82,7 +86,9 @@ TV power intent Phase 1 rules:
- Scheduler: `POLL_INTERVAL_SECONDS`, `REFRESH_SECONDS`
- Power intent: `POWER_INTENT_PUBLISH_ENABLED`, `POWER_INTENT_HEARTBEAT_ENABLED`, `POWER_INTENT_EXPIRY_MULTIPLIER`, `POWER_INTENT_MIN_EXPIRY_SECONDS`
- Monitoring: `PRIORITY_SCREENSHOT_TTL_SECONDS`
- Crash recovery: `CRASH_RECOVERY_ENABLED`, `CRASH_RECOVERY_GRACE_SECONDS`, `CRASH_RECOVERY_LOCKOUT_MINUTES`, `CRASH_RECOVERY_COMMAND_EXPIRY_SECONDS`
- Core: `DB_CONN`, `DB_USER`, `DB_PASSWORD`, `DB_HOST`, `DB_NAME`, `ENV`
- MQTT auth/connectivity: `MQTT_BROKER_HOST`, `MQTT_BROKER_PORT`, `MQTT_USER`, `MQTT_PASSWORD` (listener/scheduler/server should use authenticated broker access)
## Edit guardrails
- Do not edit generated assets in `dashboard/dist/`.