Files
infoscreen/RESTART_VALIDATION_CHECKLIST.md
Olaf 03e3c11e90 feat: crash recovery, service_failed monitoring, broker health fields, command expiry sweep
- Add GET /api/clients/crashed endpoint (process_status=crashed or stale heartbeat)
- Add restart_app command action with same lifecycle + lockout as reboot_host
- Scheduler: crash auto-recovery loop (CRASH_RECOVERY_ENABLED flag, lockout, MQTT publish)
- Scheduler: unconditional command expiry sweep per poll cycle (sweep_expired_commands)
- Listener: subscribe to infoscreen/+/service_failed; persist service_failed_at + unit
- Listener: extract broker_connection block from health payload; persist reconnect_count + last_disconnect_at
- DB migration b1c2d3e4f5a6: service_failed_at, service_failed_unit, mqtt_reconnect_count, mqtt_last_disconnect_at on clients
- Add GET /api/clients/service_failed and POST /api/clients/<uuid>/clear_service_failed
- Monitoring overview API: include mqtt_reconnect_count + mqtt_last_disconnect_at per client
- Frontend: orange service-failed alert panel (hidden when empty, auto-refresh, quittieren action)
- Frontend: MQTT reconnect count + last disconnect in client detail panel
- MQTT auth hardening: listener/scheduler/server use env credentials; broker enforces allow_anonymous false
- Client command lifecycle foundation: ClientCommand model, reboot_host/shutdown_host, full ACK lifecycle
- Docs: TECH-CHANGELOG, DEV-CHANGELOG, MQTT_EVENT_PAYLOAD_GUIDE, copilot-instructions updated
- Add implementation-plans/, RESTART_VALIDATION_CHECKLIST.md, TODO.md
2026-04-05 10:17:56 +00:00

3.8 KiB

Restart Validation Checklist

Purpose: Validate end-to-end restart command flow after MQTT auth hardening.

Scope

  • API command issue route: POST /api/clients/{uuid}/restart
  • MQTT command topic: infoscreen/{uuid}/commands (compat: infoscreen/{uuid}/command)
  • MQTT ACK topic: infoscreen/{uuid}/commands/ack (compat: infoscreen/{uuid}/command/ack)
  • Status API: GET /api/clients/commands/{command_id}

Preconditions

  • Stack is up and healthy (db, mqtt, server, listener, scheduler).
  • You have an admin or superadmin account.
  • At least one canary client is online and can process restart commands.
  • .env has valid MQTT_USER / MQTT_PASSWORD.

1) Open Monitoring Session (MQTT)

On host/server:

set -a
. ./.env
set +a

mosquitto_sub -h 127.0.0.1 -p 1883 \
  -u "$MQTT_USER" -P "$MQTT_PASSWORD" \
  -t "infoscreen/+/commands" \
  -t "infoscreen/+/commands/ack" \
  -t "infoscreen/+/command" \
  -t "infoscreen/+/command/ack" \
  -v

Expected:

  • Command publish appears on infoscreen/{uuid}/commands.
  • ACK(s) appear on infoscreen/{uuid}/commands/ack.
API_BASE="http://127.0.0.1:8000"
USER="<admin_or_superadmin_username>"
PASS="<password>"

curl -sS -X POST "$API_BASE/api/auth/login" \
  -H "Content-Type: application/json" \
  -d "{\"username\":\"$USER\",\"password\":\"$PASS\"}" \
  -c /tmp/infoscreen-cookies.txt

Expected:

  • Login success response.
  • Cookie jar file created at /tmp/infoscreen-cookies.txt.

3) Pick Target Client UUID

Option A: Use known canary UUID.

Option B: query alive clients:

curl -sS "$API_BASE/api/clients/with_alive_status" -b /tmp/infoscreen-cookies.txt

Choose one uuid where is_alive is true.

4) Issue Restart Command

CLIENT_UUID="<target_uuid>"

curl -sS -X POST "$API_BASE/api/clients/$CLIENT_UUID/restart" \
  -H "Content-Type: application/json" \
  -b /tmp/infoscreen-cookies.txt \
  -d '{"reason":"canary_restart_validation"}'

Expected:

  • HTTP 202 on success.
  • JSON includes command.commandId and initial status around published.
  • In MQTT monitor, a command payload with:
    • schema_version: "1.0"
    • action: "reboot_host"
    • matching command_id.

5) Poll Command Lifecycle Until Terminal

COMMAND_ID="<command_id_from_previous_step>"

for i in $(seq 1 20); do
  curl -sS "$API_BASE/api/clients/commands/$COMMAND_ID" -b /tmp/infoscreen-cookies.txt
  echo
  sleep 3
done

Expected status progression (typical):

  • queued -> publish_in_progress -> published -> ack_received -> execution_started -> completed

Failure/alternate terminal states:

  • failed (check errorCode / errorMessage)
  • blocked_safety (reboot lockout triggered)

6) Validate Offline/Timeout Behavior

  • Repeat step 4 for an offline client (or stop client process first).
  • Confirm command does not falsely end as completed.
  • Confirm status remains non-success and has usable failure diagnostics.

7) Validate Safety Lockout

Current lockout in API route:

  • Threshold: 3 reboot commands
  • Window: 15 minutes

Test:

  • Send 4 restart commands quickly for same uuid.

Expected:

  • One request returns HTTP 429.
  • Command entry state blocked_safety with lockout error details.

8) Service Log Spot Check

docker compose logs --tail=150 server listener mqtt

Expected:

  • No MQTT auth errors (Not authorized, Connection Refused: not authorised).
  • Listener logs show ACK processing for command_id.

9) Acceptance Criteria

  • Restart command publish is visible on MQTT.
  • ACK is received and mapped by listener.
  • Status endpoint reaches correct terminal state.
  • Safety lockout works under repeated restart attempts.
  • No auth regression in broker/service logs.

Cleanup

rm -f /tmp/infoscreen-cookies.txt