feat: remote commands, systemd units, process observability, broker auth split

- Command intake (reboot/shutdown) on infoscreen/{uuid}/commands with ack lifecycle
- MQTT_USER/MQTT_PASSWORD_BROKER split from identity vars; configure_mqtt_security() updated
- infoscreen-simclient.service: Type=notify, WatchdogSec=60, Restart=on-failure
- infoscreen-notify-failure@.service + script: retained MQTT alert when systemd gives up (Gap 3)
- _sd_notify() watchdog keepalive in simclient main loop (Gap 1)
- broker_connection block in health payload: reconnect_count, last_disconnect_at (Gap 2)
- COMMAND_MOCK_REBOOT_IMMEDIATE_COMPLETE canary flag with safety guard
- SERVER_TEAM_ACTIONS.md: server-side integration action items
- Docs: README, CHANGELOG, src/README, copilot-instructions updated
- 43 tests passing
This commit is contained in:
RobbStarkAustria
2026-04-05 08:36:50 +02:00
parent 82f43f75ba
commit 0cd0d95612
28 changed files with 2487 additions and 36 deletions

88
TODO.md
View File

@@ -25,10 +25,98 @@ This file tracks higher-level todos and design notes for the infoscreen client.
- `set_volume()` issues appropriate CEC commands and returns success/failure.
- Document any platform limitations (some TVs don't support absolute volume via CEC).
## Systemd crash recovery (server team recommendation)
Reliable restart-on-crash for both processes must be handled by **systemd**, not by in-process watchdogs or ad-hoc shell scripts.
### What needs to be done
- `display_manager`: already has `scripts/infoscreen-display.service` with `Restart=on-failure` / `RestartSec=10`.
- Review `RestartSec` — may want a short backoff (e.g. 515 s) and `StartLimitIntervalSec` + `StartLimitBurst` to prevent thrash loops.
- `simclient`: **no service unit exists yet**.
- Create `scripts/infoscreen-simclient.service` modelled on the display service.
- Use `Restart=on-failure` and `RestartSec=10`.
- Wire `EnvironmentFile=/home/olafn/infoscreen-dev/.env` so the unit picks up `.env` variables automatically.
- Set `After=network-online.target` so MQTT connection is not attempted before the network is ready.
- Both units should be installed and enabled via `src/pi-setup.sh` (`systemctl enable --now`).
- After enabling, verify crash recovery with `kill -9 <pid>` and confirm systemd restarts the process within `RestartSec`.
### Acceptance criteria
- Both `simclient` and `display_manager` restart automatically within 15 s of any non-intentional exit.
- `systemctl status` shows `active (running)` after a crash-induced restart.
- `journalctl -u infoscreen-simclient` captures all process output (stdout + stderr).
- `pi-setup.sh` idempotently installs and enables both units.
### Notes
- Use `Restart=on-failure` — restarts on crashes and signals but not on clean `systemctl stop`, preserving operator control during deployments.
- The reboot/shutdown command flow publishes `execution_started` and then exits intentionally; systemd will restart simclient, and the recovery logic in the heartbeat loop will emit `completed` on reconnect. This is the intended lifecycle.
## Process health observability gaps
Two scenarios are currently undetected or ambiguous from the server/frontend perspective.
### Gap 1: Hung / deadlocked process ✅ implemented
**Solution implemented:** Zero-dependency `_sd_notify()` helper writes directly to `NOTIFY_SOCKET` (raw Unix socket, no extra package). `READY=1` is sent when the heartbeat loop starts; `WATCHDOG=1` is sent every 5 s in the main loop iteration. The service unit uses `Type=notify` + `WatchdogSec=60` — if the main loop freezes for 60 s, systemd kills and restarts the process automatically.
**To apply on device:**
```bash
sudo cp ~/infoscreen-dev/scripts/infoscreen-simclient.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl restart infoscreen-simclient
```
### Gap 2: MQTT broker unreachable vs. simclient dead ✅ implemented (client side)
**Solution implemented:** `connection_state` dict expanded with `reconnect_count` and `connect_count`. `publish_health_message()` now accepts `connection_state` and appends a `broker_connection` block to every health payload:
```json
"broker_connection": {
"broker_reachable": true,
"reconnect_count": 2,
"last_disconnect_at": "2026-04-04T10:00:00Z"
}
```
`broker_reachable` = `true` when MQTT is connected at publish time.
`reconnect_count` increments on every reconnection (first connect does not count).
`last_disconnect_at` is the UTC timestamp of the most recent disconnect.
**Server-side action still needed:**
- Display `reconnect_count` and `last_disconnect_at` in the frontend health dashboard.
- Alert heuristic: if **all** clients go silent simultaneously → likely broker issue; if only one → likely device issue.
### Gap 3: systemd gives up (StartLimitBurst exceeded) ✅ implemented
**Solution implemented:** `scripts/infoscreen-notify-failure@.service` (template unit) + `scripts/infoscreen-notify-failure.sh`. Both main units have `OnFailure=infoscreen-notify-failure@%n.service`. When systemd marks a service `failed`, the notifier runs once, reads broker credentials from `.env`, reads `client_uuid.txt`, and publishes a retained JSON payload to `infoscreen/{uuid}/service_failed` via `mosquitto_pub`.
**To apply on device:**
```bash
sudo cp ~/infoscreen-dev/scripts/infoscreen-notify-failure@.service /etc/systemd/system/
sudo cp ~/infoscreen-dev/scripts/infoscreen-simclient.service /etc/systemd/system/
sudo cp ~/infoscreen-dev/scripts/infoscreen-display.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl restart infoscreen-simclient infoscreen-display
```
**Topic:** `infoscreen/{client_uuid}/service_failed` (retained)
**Payload:** `{"event":"service_failed","unit":"infoscreen-simclient.service","client_uuid":"...","failed_at":"2026-..."}`
## Next-high-level items
- Add environment-controlled libVLC hw-accel toggle (`VLC_HW_ACCEL=1|0`) to `display_manager.py` so software decode can be forced when necessary.
- Add automated tests for video start/stop lifecycle (mock python-vlc) to ensure resources are released on event end.
- Add allowlist validation for `website` / `webuntis` event URLs
- Goal: restrict browser-based events to approved hosts and schemes even if an authenticated publisher sends an unsafe URL.
- Ideas / approaches:
- Add env-configurable allowlists for general website hosts and WebUntis hosts.
- Allow only `https` by default and reject `file:`, `data:`, `javascript:`, loopback, and private-address URLs unless explicitly allowed.
- Enforce the same validation on both server-side payload generation and client-side execution in `display_manager.py`.
- Acceptance criteria:
- Unsafe or unapproved URLs are rejected before Chromium is launched.
- WebUntis and approved website events still work with explicit allowlist configuration.
## Notes