Files

Olaf 03e3c11e90 feat: crash recovery, service_failed monitoring, broker health fields, command expiry sweep

- Add GET /api/clients/crashed endpoint (process_status=crashed or stale heartbeat)
- Add restart_app command action with same lifecycle + lockout as reboot_host
- Scheduler: crash auto-recovery loop (CRASH_RECOVERY_ENABLED flag, lockout, MQTT publish)
- Scheduler: unconditional command expiry sweep per poll cycle (sweep_expired_commands)
- Listener: subscribe to infoscreen/+/service_failed; persist service_failed_at + unit
- Listener: extract broker_connection block from health payload; persist reconnect_count + last_disconnect_at
- DB migration b1c2d3e4f5a6: service_failed_at, service_failed_unit, mqtt_reconnect_count, mqtt_last_disconnect_at on clients
- Add GET /api/clients/service_failed and POST /api/clients/<uuid>/clear_service_failed
- Monitoring overview API: include mqtt_reconnect_count + mqtt_last_disconnect_at per client
- Frontend: orange service-failed alert panel (hidden when empty, auto-refresh, quittieren action)
- Frontend: MQTT reconnect count + last disconnect in client detail panel
- MQTT auth hardening: listener/scheduler/server use env credentials; broker enforces allow_anonymous false
- Client command lifecycle foundation: ClientCommand model, reboot_host/shutdown_host, full ACK lifecycle
- Docs: TECH-CHANGELOG, DEV-CHANGELOG, MQTT_EVENT_PAYLOAD_GUIDE, copilot-instructions updated
- Add implementation-plans/, RESTART_VALIDATION_CHECKLIST.md, TODO.md

2026-04-05 10:17:56 +00:00

4.4 KiB

Raw Blame History

Server Team Action Items — Infoscreen Client

This document lists everything the server/infrastructure/frontend team must implement to complete the client integration. The client-side code is production-ready for all items listed here.

1. MQTT Broker Hardening (prerequisite for everything else)

Disable anonymous access on the broker.
Create one broker account per client device:
- Username convention: infoscreen-client-<uuid-prefix> (e.g. infoscreen-client-9b8d1856)
- Provision the password to the device .env as MQTT_PASSWORD_BROKER=
Create a server/publisher account (e.g. infoscreen-server) for all server-side publishes.
Enforce ACLs:

Topic	Publisher
`infoscreen/{uuid}/commands`	server only
`infoscreen/{uuid}/command` (alias)	server only
`infoscreen/{uuid}/group_id`	server only
`infoscreen/events/{group_id}`	server only
`infoscreen/groups/+/power/intent`	server only
`infoscreen/{uuid}/commands/ack`	client only
`infoscreen/{uuid}/command/ack`	client only
`infoscreen/{uuid}/heartbeat`	client only
`infoscreen/{uuid}/health`	client only
`infoscreen/{uuid}/logs/#`	client only
`infoscreen/{uuid}/service_failed`	client only

2. Reboot / Shutdown Command — Ack Lifecycle

Client publishes ack status updates to two topics per command (canonical + transitional alias):

infoscreen/{uuid}/commands/ack
infoscreen/{uuid}/command/ack

Ack payload schema (v1, frozen):

{
  "command_id": "07aab032-53c2-45ef-a5a3-6aa58e9d9fae",
  "status": "accepted | execution_started | completed | failed",
  "error_code": null,
  "error_message": null
}

Status lifecycle:

Status	When	Notes
`accepted`	Command received and validated	Immediate
`execution_started`	Helper invoked	Immediate after accepted
`completed`	Execution confirmed	For `reboot_host`: arrives after reconnect (10–90 s after `execution_started`)
`failed`	Helper returned error	`error_code` and `error_message` will be set

Server must:

Track command_id through the full lifecycle and update status in DB/UI.
Surface failed + error_code to the operator UI.
Expect reboot_host completed to arrive after a reconnect delay — do not treat the gap as a timeout.
Use expires_at from the original command to determine when to abandon waiting.

3. Health Dashboard — Broker Connection Fields (Gap 2)

Every infoscreen/{uuid}/health payload now includes a broker_connection block:

{
  "timestamp": "2026-04-05T08:00:00.000000+00:00",
  "expected_state": { "event_id": 42 },
  "actual_state": {
    "process": "display_manager",
    "pid": 1234,
    "status": "running"
  },
  "broker_connection": {
    "broker_reachable": true,
    "reconnect_count": 2,
    "last_disconnect_at": "2026-04-04T10:30:00Z"
  }
}

Server must:

Display reconnect_count and last_disconnect_at per device in the health dashboard.
Implement alerting heuristic:
- All clients go silent simultaneously → likely broker outage, not device crash.
- Single client goes silent → device crash, network failure, or process hang.

4. Service-Failed MQTT Notification (Gap 3)

When systemd gives up restarting a service after repeated crashes (StartLimitBurst exceeded), the client automatically publishes a retained message:

Topic: infoscreen/{uuid}/service_failed

Payload:

{
  "event": "service_failed",
  "unit": "infoscreen-simclient.service",
  "client_uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
  "failed_at": "2026-04-05T08:00:00Z"
}

Server must:

Subscribe to infoscreen/+/service_failed on startup (retained — message survives broker restart).
Alert the operator immediately when this topic receives a payload.
Clear the retained message once the device is acknowledged or recovered:
```
mosquitto_pub -t "infoscreen/{uuid}/service_failed" -n --retain
```

5. No Server Action Required

These items are fully implemented client-side and require no server changes:

systemd watchdog (WatchdogSec=60) — hangs detected and process restarted automatically.
Command deduplication — command_id deduplicated with 24-hour TTL.
Ack retry backoff — client retries ack publish on broker disconnect until expires_at.
Mock helper / test mode (COMMAND_MOCK_REBOOT_IMMEDIATE_COMPLETE) — development only.

4.4 KiB Raw Blame History Unescape Escape

Server Team Action Items — Infoscreen Client

1. MQTT Broker Hardening (prerequisite for everything else)

2. Reboot / Shutdown Command — Ack Lifecycle

3. Health Dashboard — Broker Connection Fields (Gap 2)

4. Service-Failed MQTT Notification (Gap 3)

5. No Server Action Required

4.4 KiB

Raw Blame History