feat: crash recovery, service_failed monitoring, broker health fields, command expiry sweep
- Add GET /api/clients/crashed endpoint (process_status=crashed or stale heartbeat) - Add restart_app command action with same lifecycle + lockout as reboot_host - Scheduler: crash auto-recovery loop (CRASH_RECOVERY_ENABLED flag, lockout, MQTT publish) - Scheduler: unconditional command expiry sweep per poll cycle (sweep_expired_commands) - Listener: subscribe to infoscreen/+/service_failed; persist service_failed_at + unit - Listener: extract broker_connection block from health payload; persist reconnect_count + last_disconnect_at - DB migration b1c2d3e4f5a6: service_failed_at, service_failed_unit, mqtt_reconnect_count, mqtt_last_disconnect_at on clients - Add GET /api/clients/service_failed and POST /api/clients/<uuid>/clear_service_failed - Monitoring overview API: include mqtt_reconnect_count + mqtt_last_disconnect_at per client - Frontend: orange service-failed alert panel (hidden when empty, auto-refresh, quittieren action) - Frontend: MQTT reconnect count + last disconnect in client detail panel - MQTT auth hardening: listener/scheduler/server use env credentials; broker enforces allow_anonymous false - Client command lifecycle foundation: ClientCommand model, reboot_host/shutdown_host, full ACK lifecycle - Docs: TECH-CHANGELOG, DEV-CHANGELOG, MQTT_EVENT_PAYLOAD_GUIDE, copilot-instructions updated - Add implementation-plans/, RESTART_VALIDATION_CHECKLIST.md, TODO.md
This commit is contained in:
@@ -5,6 +5,42 @@
|
||||
|
||||
This changelog documents technical and developer-relevant changes included in public releases. For development workspace changes, see DEV-CHANGELOG.md. Not all changes here are reflected in the user-facing changelog (`program-info.json`), and not all UI/feature changes are repeated here. Some changes (e.g., backend refactoring, API adjustments, infrastructure, developer tooling, or internal logic) may only appear in TECH-CHANGELOG.md. For UI/feature changes, see `dashboard/public/program-info.json`.
|
||||
|
||||
## Unreleased
|
||||
- <20> **Crash detection, auto-recovery, and service_failed monitoring (2026-04-05)**:
|
||||
- Added `GET /api/clients/crashed` endpoint: returns active clients with `process_status=crashed` or stale heartbeat beyond grace period, with `crash_reason` field.
|
||||
- Added `restart_app` command action alongside existing `reboot_host`/`shutdown_host`; registered in `server/routes/clients.py` with same safety lockout.
|
||||
- Scheduler: Added crash auto-recovery loop (feature-flagged via `CRASH_RECOVERY_ENABLED`): scans candidates via `get_crash_recovery_candidates()`, issues `reboot_host` command per client, publishes to primary + compat MQTT topics, updates command lifecycle.
|
||||
- Scheduler: Added unconditional command expiry sweep each poll cycle via `sweep_expired_commands()` in `scheduler/db_utils.py`: marks non-terminal `ClientCommand` rows with `expires_at < now` as `expired`.
|
||||
- Added `service_failed` topic ingestion in `listener/listener.py`: subscribe to `infoscreen/+/service_failed` on every connect; persist `service_failed_at` and `service_failed_unit` on Client; empty payload (retain clear) ignored.
|
||||
- Added `broker_connection` block extraction in health payload handler: persists `mqtt_reconnect_count` and `mqtt_last_disconnect_at` from `infoscreen/{uuid}/health`.
|
||||
- Added four new DB columns to `clients` table via migration `b1c2d3e4f5a6`: `service_failed_at`, `service_failed_unit`, `mqtt_reconnect_count`, `mqtt_last_disconnect_at`.
|
||||
- Added `GET /api/clients/service_failed` endpoint: lists clients with `service_failed_at` set, ordered by event time desc.
|
||||
- Added `POST /api/clients/<uuid>/clear_service_failed` endpoint: clears DB flag and publishes empty retained MQTT message to clear `infoscreen/{uuid}/service_failed`.
|
||||
- Monitoring overview API (`GET /api/client-logs/monitoring-overview`) now includes `mqtt_reconnect_count` and `mqtt_last_disconnect_at` per client.
|
||||
- Frontend: Added orange service-failed alert panel to monitoring page (hidden when empty, auto-refresh 15s, per-row Quittieren button with loading/success/error states).
|
||||
- Frontend: Client detail panel in monitoring now shows MQTT reconnect count and last disconnect timestamp.
|
||||
- Frontend: Added `ServiceFailedClient`, `ServiceFailedClientsResponse` types; `fetchServiceFailedClients()` and `clearServiceFailed()` API helpers in `dashboard/src/apiClients.ts`.
|
||||
- Added `service_failed` topic contract to `MQTT_EVENT_PAYLOAD_GUIDE.md`.
|
||||
- <20>🔐 **MQTT auth hardening for server-side services (2026-04-03)**:
|
||||
- `listener/listener.py` now uses env-based broker connectivity for host/port and credentials (`MQTT_BROKER_HOST`, `MQTT_BROKER_PORT`, `MQTT_USER`, `MQTT_PASSWORD`) instead of anonymous fixed `mqtt:1883`.
|
||||
- `scheduler/scheduler.py` now uses the same env-based MQTT auth path and optional TLS toggles (`MQTT_TLS_ENABLED`, `MQTT_TLS_CA_CERT`, `MQTT_TLS_CERTFILE`, `MQTT_TLS_KEYFILE`, `MQTT_TLS_INSECURE`).
|
||||
- `docker-compose.yml` and `docker-compose.override.yml` now pass MQTT credentials into listener and scheduler containers for consistent authenticated connections.
|
||||
- Mosquitto is now configured for authenticated access (`allow_anonymous false`, `password_file /mosquitto/config/passwd`) and bootstraps credentials from env at container startup.
|
||||
- MQTT healthcheck publish now authenticates with configured broker credentials.
|
||||
- 🔁 **Client command lifecycle foundation (restart/shutdown) (2026-04-03)**:
|
||||
- Added persistent command tracking model `ClientCommand` in `models/models.py` and Alembic migration `aa12bb34cc56_add_client_commands_table.py`.
|
||||
- Upgraded `POST /api/clients/<uuid>/restart` from fire-and-forget publish to lifecycle-aware command issuance with command metadata (`command_id`, `issued_at`, `expires_at`, `reason`, `requested_by`).
|
||||
- Added `POST /api/clients/<uuid>/shutdown` endpoint with the same lifecycle contract.
|
||||
- Added `GET /api/clients/commands/<command_id>` status endpoint for command-state polling.
|
||||
- Added restart safety lockout in API path: max 3 restart commands per client in rolling 15 minutes, returning `blocked_safety` when threshold is exceeded.
|
||||
- Added command MQTT publish to `infoscreen/{uuid}/commands` (QoS1, non-retained) and temporary legacy restart compatibility publish to `clients/{uuid}/restart`.
|
||||
- Added temporary topic compatibility publish to `infoscreen/{uuid}/command` and listener acceptance of `infoscreen/{uuid}/command/ack` to bridge singular/plural naming assumptions.
|
||||
- Canonicalized command payload action values to host-level semantics: `reboot_host` and `shutdown_host` (API routes remain `/restart` and `/shutdown` for operator UX compatibility).
|
||||
- Added frozen payload validation snippets for integration/client tooling in `implementation-plans/reboot-command-payload-schemas.md` and `implementation-plans/reboot-command-payload-schemas.json`.
|
||||
- Listener now subscribes to `infoscreen/{uuid}/commands/ack` and maps client acknowledgements into command lifecycle states (`ack_received`, `execution_started`, `completed`, `failed`).
|
||||
- Initial lifecycle statuses implemented server-side: `queued`, `publish_in_progress`, `published`, `failed`, and `blocked_safety`.
|
||||
- Frontend API helper extended in `dashboard/src/apiClients.ts` with `ClientCommand` typing plus command APIs for shutdown and status polling preparation.
|
||||
|
||||
## 2026.1.0-alpha.16 (2026-04-02)
|
||||
- 🐛 **Dashboard holiday banner refactoring and state fix (`dashboard/src/dashboard.tsx`)**:
|
||||
- **Motivation — unstable fetch function:** `loadHolidayStatus` had `location.pathname` in its `useCallback` dependency array, causing a new function reference to be created on every navigation event. The `useEffect` depending on that reference then re-fired, producing overlapping API calls at mount that cancelled each other via the request-sequence guard, leaving the banner unresolved.
|
||||
|
||||
Reference in New Issue
Block a user