feat: crash recovery, service_failed monitoring, broker health fields, command expiry sweep
- Add GET /api/clients/crashed endpoint (process_status=crashed or stale heartbeat) - Add restart_app command action with same lifecycle + lockout as reboot_host - Scheduler: crash auto-recovery loop (CRASH_RECOVERY_ENABLED flag, lockout, MQTT publish) - Scheduler: unconditional command expiry sweep per poll cycle (sweep_expired_commands) - Listener: subscribe to infoscreen/+/service_failed; persist service_failed_at + unit - Listener: extract broker_connection block from health payload; persist reconnect_count + last_disconnect_at - DB migration b1c2d3e4f5a6: service_failed_at, service_failed_unit, mqtt_reconnect_count, mqtt_last_disconnect_at on clients - Add GET /api/clients/service_failed and POST /api/clients/<uuid>/clear_service_failed - Monitoring overview API: include mqtt_reconnect_count + mqtt_last_disconnect_at per client - Frontend: orange service-failed alert panel (hidden when empty, auto-refresh, quittieren action) - Frontend: MQTT reconnect count + last disconnect in client detail panel - MQTT auth hardening: listener/scheduler/server use env credentials; broker enforces allow_anonymous false - Client command lifecycle foundation: ClientCommand model, reboot_host/shutdown_host, full ACK lifecycle - Docs: TECH-CHANGELOG, DEV-CHANGELOG, MQTT_EVENT_PAYLOAD_GUIDE, copilot-instructions updated - Add implementation-plans/, RESTART_VALIDATION_CHECKLIST.md, TODO.md
This commit is contained in:
149
RESTART_VALIDATION_CHECKLIST.md
Normal file
149
RESTART_VALIDATION_CHECKLIST.md
Normal file
@@ -0,0 +1,149 @@
|
||||
# Restart Validation Checklist
|
||||
|
||||
Purpose: Validate end-to-end restart command flow after MQTT auth hardening.
|
||||
|
||||
## Scope
|
||||
|
||||
- API command issue route: `POST /api/clients/{uuid}/restart`
|
||||
- MQTT command topic: `infoscreen/{uuid}/commands` (compat: `infoscreen/{uuid}/command`)
|
||||
- MQTT ACK topic: `infoscreen/{uuid}/commands/ack` (compat: `infoscreen/{uuid}/command/ack`)
|
||||
- Status API: `GET /api/clients/commands/{command_id}`
|
||||
|
||||
## Preconditions
|
||||
|
||||
- Stack is up and healthy (`db`, `mqtt`, `server`, `listener`, `scheduler`).
|
||||
- You have an `admin` or `superadmin` account.
|
||||
- At least one canary client is online and can process restart commands.
|
||||
- `.env` has valid `MQTT_USER` / `MQTT_PASSWORD`.
|
||||
|
||||
## 1) Open Monitoring Session (MQTT)
|
||||
|
||||
On host/server:
|
||||
|
||||
```bash
|
||||
set -a
|
||||
. ./.env
|
||||
set +a
|
||||
|
||||
mosquitto_sub -h 127.0.0.1 -p 1883 \
|
||||
-u "$MQTT_USER" -P "$MQTT_PASSWORD" \
|
||||
-t "infoscreen/+/commands" \
|
||||
-t "infoscreen/+/commands/ack" \
|
||||
-t "infoscreen/+/command" \
|
||||
-t "infoscreen/+/command/ack" \
|
||||
-v
|
||||
```
|
||||
|
||||
Expected:
|
||||
- Command publish appears on `infoscreen/{uuid}/commands`.
|
||||
- ACK(s) appear on `infoscreen/{uuid}/commands/ack`.
|
||||
|
||||
## 2) Login and Keep Session Cookie
|
||||
|
||||
```bash
|
||||
API_BASE="http://127.0.0.1:8000"
|
||||
USER="<admin_or_superadmin_username>"
|
||||
PASS="<password>"
|
||||
|
||||
curl -sS -X POST "$API_BASE/api/auth/login" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d "{\"username\":\"$USER\",\"password\":\"$PASS\"}" \
|
||||
-c /tmp/infoscreen-cookies.txt
|
||||
```
|
||||
|
||||
Expected:
|
||||
- Login success response.
|
||||
- Cookie jar file created at `/tmp/infoscreen-cookies.txt`.
|
||||
|
||||
## 3) Pick Target Client UUID
|
||||
|
||||
Option A: Use known canary UUID.
|
||||
|
||||
Option B: query alive clients:
|
||||
|
||||
```bash
|
||||
curl -sS "$API_BASE/api/clients/with_alive_status" -b /tmp/infoscreen-cookies.txt
|
||||
```
|
||||
|
||||
Choose one `uuid` where `is_alive` is `true`.
|
||||
|
||||
## 4) Issue Restart Command
|
||||
|
||||
```bash
|
||||
CLIENT_UUID="<target_uuid>"
|
||||
|
||||
curl -sS -X POST "$API_BASE/api/clients/$CLIENT_UUID/restart" \
|
||||
-H "Content-Type: application/json" \
|
||||
-b /tmp/infoscreen-cookies.txt \
|
||||
-d '{"reason":"canary_restart_validation"}'
|
||||
```
|
||||
|
||||
Expected:
|
||||
- HTTP `202` on success.
|
||||
- JSON includes `command.commandId` and initial status around `published`.
|
||||
- In MQTT monitor, a command payload with:
|
||||
- `schema_version: "1.0"`
|
||||
- `action: "reboot_host"`
|
||||
- matching `command_id`.
|
||||
|
||||
## 5) Poll Command Lifecycle Until Terminal
|
||||
|
||||
```bash
|
||||
COMMAND_ID="<command_id_from_previous_step>"
|
||||
|
||||
for i in $(seq 1 20); do
|
||||
curl -sS "$API_BASE/api/clients/commands/$COMMAND_ID" -b /tmp/infoscreen-cookies.txt
|
||||
echo
|
||||
sleep 3
|
||||
done
|
||||
```
|
||||
|
||||
Expected status progression (typical):
|
||||
- `queued` -> `publish_in_progress` -> `published` -> `ack_received` -> `execution_started` -> `completed`
|
||||
|
||||
Failure/alternate terminal states:
|
||||
- `failed` (check `errorCode` / `errorMessage`)
|
||||
- `blocked_safety` (reboot lockout triggered)
|
||||
|
||||
## 6) Validate Offline/Timeout Behavior
|
||||
|
||||
- Repeat step 4 for an offline client (or stop client process first).
|
||||
- Confirm command does not falsely end as `completed`.
|
||||
- Confirm status remains non-success and has usable failure diagnostics.
|
||||
|
||||
## 7) Validate Safety Lockout
|
||||
|
||||
Current lockout in API route:
|
||||
- Threshold: 3 reboot commands
|
||||
- Window: 15 minutes
|
||||
|
||||
Test:
|
||||
- Send 4 restart commands quickly for same `uuid`.
|
||||
|
||||
Expected:
|
||||
- One request returns HTTP `429`.
|
||||
- Command entry state `blocked_safety` with lockout error details.
|
||||
|
||||
## 8) Service Log Spot Check
|
||||
|
||||
```bash
|
||||
docker compose logs --tail=150 server listener mqtt
|
||||
```
|
||||
|
||||
Expected:
|
||||
- No MQTT auth errors (`Not authorized`, `Connection Refused: not authorised`).
|
||||
- Listener logs show ACK processing for `command_id`.
|
||||
|
||||
## 9) Acceptance Criteria
|
||||
|
||||
- Restart command publish is visible on MQTT.
|
||||
- ACK is received and mapped by listener.
|
||||
- Status endpoint reaches correct terminal state.
|
||||
- Safety lockout works under repeated restart attempts.
|
||||
- No auth regression in broker/service logs.
|
||||
|
||||
## Cleanup
|
||||
|
||||
```bash
|
||||
rm -f /tmp/infoscreen-cookies.txt
|
||||
```
|
||||
Reference in New Issue
Block a user