feat: crash recovery, service_failed monitoring, broker health fields, command expiry sweep

- Add GET /api/clients/crashed endpoint (process_status=crashed or stale heartbeat)
- Add restart_app command action with same lifecycle + lockout as reboot_host
- Scheduler: crash auto-recovery loop (CRASH_RECOVERY_ENABLED flag, lockout, MQTT publish)
- Scheduler: unconditional command expiry sweep per poll cycle (sweep_expired_commands)
- Listener: subscribe to infoscreen/+/service_failed; persist service_failed_at + unit
- Listener: extract broker_connection block from health payload; persist reconnect_count + last_disconnect_at
- DB migration b1c2d3e4f5a6: service_failed_at, service_failed_unit, mqtt_reconnect_count, mqtt_last_disconnect_at on clients
- Add GET /api/clients/service_failed and POST /api/clients/<uuid>/clear_service_failed
- Monitoring overview API: include mqtt_reconnect_count + mqtt_last_disconnect_at per client
- Frontend: orange service-failed alert panel (hidden when empty, auto-refresh, quittieren action)
- Frontend: MQTT reconnect count + last disconnect in client detail panel
- MQTT auth hardening: listener/scheduler/server use env credentials; broker enforces allow_anonymous false
- Client command lifecycle foundation: ClientCommand model, reboot_host/shutdown_host, full ACK lifecycle
- Docs: TECH-CHANGELOG, DEV-CHANGELOG, MQTT_EVENT_PAYLOAD_GUIDE, copilot-instructions updated
- Add implementation-plans/, RESTART_VALIDATION_CHECKLIST.md, TODO.md
This commit is contained in:
2026-04-05 10:17:56 +00:00
parent 4d652f0554
commit 03e3c11e90
35 changed files with 2511 additions and 80 deletions

View File

@@ -0,0 +1,149 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://infoscreen.local/schemas/reboot-command-payload-schemas.json",
"title": "Infoscreen Reboot Command Payload Schemas",
"description": "Frozen v1 schemas for per-client command and command acknowledgement payloads.",
"$defs": {
"commandPayloadV1": {
"type": "object",
"additionalProperties": false,
"required": [
"schema_version",
"command_id",
"client_uuid",
"action",
"issued_at",
"expires_at",
"requested_by",
"reason"
],
"properties": {
"schema_version": {
"type": "string",
"const": "1.0"
},
"command_id": {
"type": "string",
"format": "uuid"
},
"client_uuid": {
"type": "string",
"format": "uuid"
},
"action": {
"type": "string",
"enum": [
"reboot_host",
"shutdown_host"
]
},
"issued_at": {
"type": "string",
"format": "date-time"
},
"expires_at": {
"type": "string",
"format": "date-time"
},
"requested_by": {
"type": [
"integer",
"null"
],
"minimum": 1
},
"reason": {
"type": [
"string",
"null"
],
"maxLength": 2000
}
}
},
"commandAckPayloadV1": {
"type": "object",
"additionalProperties": false,
"required": [
"command_id",
"status",
"error_code",
"error_message"
],
"properties": {
"command_id": {
"type": "string",
"format": "uuid"
},
"status": {
"type": "string",
"enum": [
"accepted",
"execution_started",
"completed",
"failed"
]
},
"error_code": {
"type": [
"string",
"null"
],
"maxLength": 128
},
"error_message": {
"type": [
"string",
"null"
],
"maxLength": 4000
}
},
"allOf": [
{
"if": {
"properties": {
"status": {
"const": "failed"
}
}
},
"then": {
"properties": {
"error_code": {
"type": "string",
"minLength": 1
},
"error_message": {
"type": "string",
"minLength": 1
}
}
}
}
]
}
},
"examples": [
{
"commandPayloadV1": {
"schema_version": "1.0",
"command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
"client_uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
"action": "reboot_host",
"issued_at": "2026-04-03T12:48:10Z",
"expires_at": "2026-04-03T12:52:10Z",
"requested_by": 1,
"reason": "operator_request"
}
},
{
"commandAckPayloadV1": {
"command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
"status": "execution_started",
"error_code": null,
"error_message": null
}
}
]
}

View File

@@ -0,0 +1,59 @@
## Reboot Command Payload Schema Snippets
This file provides copy-ready validation snippets for client and integration test helpers.
### Canonical Topics (v1)
1. Command topic: infoscreen/{client_uuid}/commands
2. Ack topic: infoscreen/{client_uuid}/commands/ack
### Transitional Compatibility Topics
1. Command topic alias: infoscreen/{client_uuid}/command
2. Ack topic alias: infoscreen/{client_uuid}/command/ack
### Canonical Action Values
1. reboot_host
2. shutdown_host
### Ack Status Values
1. accepted
2. execution_started
3. completed
4. failed
### JSON Schema Source
Use this file for machine validation:
1. implementation-plans/reboot-command-payload-schemas.json
### Minimal Command Schema Snippet
```json
{
"type": "object",
"additionalProperties": false,
"required": ["schema_version", "command_id", "client_uuid", "action", "issued_at", "expires_at", "requested_by", "reason"],
"properties": {
"schema_version": { "const": "1.0" },
"command_id": { "type": "string", "format": "uuid" },
"client_uuid": { "type": "string", "format": "uuid" },
"action": { "enum": ["reboot_host", "shutdown_host"] },
"issued_at": { "type": "string", "format": "date-time" },
"expires_at": { "type": "string", "format": "date-time" },
"requested_by": { "type": ["integer", "null"] },
"reason": { "type": ["string", "null"] }
}
}
```
### Minimal Ack Schema Snippet
```json
{
"type": "object",
"additionalProperties": false,
"required": ["command_id", "status", "error_code", "error_message"],
"properties": {
"command_id": { "type": "string", "format": "uuid" },
"status": { "enum": ["accepted", "execution_started", "completed", "failed"] },
"error_code": { "type": ["string", "null"] },
"error_message": { "type": ["string", "null"] }
}
}
```

View File

@@ -0,0 +1,146 @@
## Client Team Implementation Spec (Raspberry Pi 5)
### Mission
Implement client-side command handling for reliable restart and shutdown with strict validation, idempotency, acknowledgements, and reboot recovery continuity.
### Ownership Boundaries
1. Client team owns command intake, execution, acknowledgement emission, and post-reboot continuity.
2. Platform team owns command issuance, lifecycle aggregation, and server-side escalation logic.
3. Client implementation must not assume managed PoE availability.
### Required Client Behaviors
### Frozen MQTT Topics and Schemas (v1)
1. Canonical command topic: infoscreen/{client_uuid}/commands.
2. Canonical ack topic: infoscreen/{client_uuid}/commands/ack.
3. Transitional compatibility topics during migration:
- infoscreen/{client_uuid}/command
- infoscreen/{client_uuid}/command/ack
4. QoS policy: command QoS 1, ack QoS 1 recommended.
5. Retain policy: commands and acks are non-retained.
Frozen command payload schema:
```json
{
"schema_version": "1.0",
"command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
"client_uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
"action": "reboot_host",
"issued_at": "2026-04-03T12:48:10Z",
"expires_at": "2026-04-03T12:52:10Z",
"requested_by": 1,
"reason": "operator_request"
}
```
Frozen ack payload schema:
```json
{
"command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
"status": "execution_started",
"error_code": null,
"error_message": null
}
```
Allowed ack status values:
1. accepted
2. execution_started
3. completed
4. failed
Frozen command action values for v1:
1. reboot_host
2. shutdown_host
Reserved but not emitted by server in v1:
1. restart_service
Validation snippets for helper scripts:
1. Human-readable snippets: implementation-plans/reboot-command-payload-schemas.md
2. Machine-validated JSON Schema: implementation-plans/reboot-command-payload-schemas.json
### 1. Command Intake
1. Subscribe to the canonical command topic with QoS 1.
2. Parse required fields: schema_version, command_id, action, issued_at, expires_at, reason, requested_by, target metadata.
3. Reject invalid payloads with failed acknowledgement including error_code and diagnostic message.
4. Reject stale commands when current time exceeds expires_at.
5. Ignore already-processed command_id values.
### 2. Idempotency And Persistence
1. Persist processed command_id and execution result on local storage.
2. Persistence must survive service restart and full OS reboot.
3. On restart, reload dedupe cache before processing newly delivered commands.
### 3. Acknowledgement Contract Behavior
1. Emit accepted immediately after successful validation and dedupe pass.
2. Emit execution_started immediately before invoking the command action.
3. Emit completed only when local success condition is confirmed.
4. Emit failed with structured error_code on validation or execution failure.
5. If MQTT is temporarily unavailable, retry ack publish with bounded backoff until command expiry.
### 4. Execution Security Model
1. Execute via systemd-managed privileged helper.
2. Allow only whitelisted operations:
- reboot_host
- shutdown_host
3. Optionally keep restart_service handler as reserved path, but do not require it for v1 conformance.
4. Disallow arbitrary shell commands and untrusted arguments.
5. Enforce per-command execution timeout and terminate hung child processes.
### 5. Reboot Recovery Continuity
1. For reboot_host action:
- send execution_started
- trigger reboot promptly
2. During startup:
- emit heartbeat early
- emit process-health once service is ready
3. Keep last command execution state available after reboot for reconciliation.
### 6. Time And Timeout Semantics
1. Use monotonic timers for local elapsed-time checks.
2. Use UTC wall-clock only for protocol timestamps and expiry comparisons.
3. Target reconnect baseline on Pi 5 USB-SATA SSD: 90 seconds.
4. Accept cold-boot and USB enumeration ceiling up to 150 seconds.
### 7. Capability Reporting
1. Report recovery capability class:
- software_only
- managed_poe_available
- manual_only
2. Report watchdog enabled status.
3. Report boot-source metadata for diagnostics.
### 8. Error Codes Minimum Set
1. invalid_schema
2. missing_field
3. stale_command
4. duplicate_command
5. permission_denied_local
6. execution_timeout
7. execution_failed
8. broker_unavailable
9. internal_error
### Acceptance Tests (Client Team)
1. Invalid schema payload is rejected and failed ack emitted.
2. Expired command is rejected and not executed.
3. Duplicate command_id is not executed twice.
4. reboot_host emits execution_started and reconnects with heartbeat in expected window.
5. restart_service action completes without host reboot and emits completed.
6. MQTT outage during ack path retries correctly without duplicate execution.
7. Boot-loop protection cooperates with server-side lockout semantics.
### Delivery Artifacts
1. Client protocol conformance checklist.
2. Test evidence for all acceptance tests.
3. Runtime logs showing full lifecycle for one restart and one reboot scenario.
4. Known limitations list per image version.
### Definition Of Done
1. All acceptance tests pass on Pi 5 USB-SATA SSD test devices.
2. No duplicate execution observed under reconnect and retained-delivery edge cases.
3. Acknowledgement sequence is complete and machine-parseable for server correlation.
4. Reboot recovery continuity works without managed PoE dependencies.

View File

@@ -0,0 +1,214 @@
## Remote Reboot Reliability Handoff (Share Document)
### Purpose
This document defines the agreed implementation scope for reliable remote reboot and shutdown of Raspberry Pi 5 clients, with monitoring-first visibility and safe escalation paths.
### Scope
1. In scope: restart and shutdown command reliability.
2. In scope: full lifecycle monitoring and audit visibility.
3. In scope: capability-tier recovery model with optional managed PoE escalation.
4. Out of scope: broader maintenance module in client-management for this cycle.
5. Out of scope: mandatory dependency on customer-managed power switching.
### Agreed Operating Model
1. Command delivery is asynchronous and lifecycle-tracked, not fire-and-forget.
2. Commands use idempotent command_id semantics with stale-command rejection by expires_at.
3. Monitoring is authoritative for operational state and escalation decisions.
4. Recovery must function even when no managed power switching is available.
### Frozen Contract v1 (Effective Immediately)
1. Canonical command topic: infoscreen/{client_uuid}/commands.
2. Canonical ack topic: infoscreen/{client_uuid}/commands/ack.
3. Transitional compatibility topics accepted during migration:
- infoscreen/{client_uuid}/command
- infoscreen/{client_uuid}/command/ack
4. QoS policy: command QoS 1, ack QoS 1 recommended.
5. Retain policy: commands and acks are non-retained.
Command payload schema (frozen):
```json
{
"schema_version": "1.0",
"command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
"client_uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
"action": "reboot_host",
"issued_at": "2026-04-03T12:48:10Z",
"expires_at": "2026-04-03T12:52:10Z",
"requested_by": 1,
"reason": "operator_request"
}
```
Ack payload schema (frozen):
```json
{
"command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
"status": "execution_started",
"error_code": null,
"error_message": null
}
```
Allowed ack status values:
1. accepted
2. execution_started
3. completed
4. failed
Frozen command action values:
1. reboot_host
2. shutdown_host
API endpoint mapping:
1. POST /api/clients/{uuid}/restart -> action reboot_host
2. POST /api/clients/{uuid}/shutdown -> action shutdown_host
Validation snippets:
1. Human-readable snippets: implementation-plans/reboot-command-payload-schemas.md
2. Machine-validated JSON Schema: implementation-plans/reboot-command-payload-schemas.json
### Command Lifecycle States
1. queued
2. publish_in_progress
3. published
4. ack_received
5. execution_started
6. awaiting_reconnect
7. recovered
8. completed
9. failed
10. expired
11. timed_out
12. canceled
13. blocked_safety
14. manual_intervention_required
### Timeout Defaults (Pi 5, USB-SATA SSD baseline)
1. queued to publish_in_progress: immediate, timeout 5 seconds.
2. publish_in_progress to published: timeout 8 seconds.
3. published to ack_received: timeout 20 seconds.
4. ack_received to execution_started: 15 seconds for service restart, 25 seconds for host reboot.
5. execution_started to awaiting_reconnect: timeout 10 seconds.
6. awaiting_reconnect to recovered: baseline 90 seconds after validation, cold-boot ceiling 150 seconds.
7. recovered to completed: 15 to 20 seconds based on fleet stability.
8. command expires_at default: 240 seconds, bounded 180 to 360 seconds.
### Recovery Tiers
1. Tier 0 baseline, always required: watchdog, systemd auto-restart, lifecycle tracking, manual intervention fallback.
2. Tier 1 optional: managed PoE per-port power-cycle escalation where customer infrastructure supports it.
3. Tier 2 no remote power control: direct manual intervention workflow.
### Governance And Safety
1. Role access: admin and superadmin.
2. Bulk actions require reason capture.
3. Safety lockout: maximum 3 reboot commands per client in 15 minutes.
4. Escalation cooldown: 60 seconds before automatic move to manual_intervention_required.
### MQTT Auth Hardening (Phase 1, Required Before Broad Rollout)
1. Intranet-only deployment is not sufficient protection for privileged MQTT actions by itself.
2. Phase 1 hardening scope is broker authentication, authorization, and network restriction; payload URL allowlisting is deferred to a later client/server feature.
3. MQTT broker must disable anonymous publish/subscribe access in production.
4. MQTT broker must require authenticated identities for server-side publishers and client devices.
5. MQTT broker must enforce ACLs so that:
- only server-side services can publish to `infoscreen/{client_uuid}/commands`
- only server-side services can publish scheduler event topics
- each client can subscribe only to its own command topics and assigned event topics
- each client can publish only its own ack, heartbeat, health, dashboard, and telemetry topics
6. Broker port exposure must be restricted to the management network and approved hosts only.
7. TLS support is strongly recommended in this phase and should be enabled when operationally feasible.
### Server Team Actions For Auth Hardening
1. Provision broker credentials for command/event publishers and for client devices.
2. Configure Mosquitto or equivalent broker ACLs for per-topic publish and subscribe restrictions.
3. Disable anonymous access on production brokers.
4. Restrict broker network exposure with firewall rules, VLAN policy, or equivalent network controls.
5. Update server/frontend deployment to publish MQTT with authenticated credentials.
6. Validate that server-side event publishing and reboot/shutdown command publishing still work under the new ACL policy.
7. Coordinate credential distribution and rotation with the client deployment process.
### MQTT ACL Matrix (Canonical Baseline)
| Actor | Topic Pattern | Publish | Subscribe | Notes |
| --- | --- | --- | --- | --- |
| scheduler-service | infoscreen/events/+ | Yes | No | Publishes retained active event list per group. |
| api-command-publisher | infoscreen/+/commands | Yes | No | Publishes canonical reboot/shutdown commands. |
| api-command-publisher | infoscreen/+/command | Yes | No | Transitional compatibility publish only. |
| api-group-assignment | infoscreen/+/group_id | Yes | No | Publishes retained client-to-group assignment. |
| listener-service | infoscreen/+/commands/ack | No | Yes | Consumes canonical client command acknowledgements. |
| listener-service | infoscreen/+/command/ack | No | Yes | Consumes transitional compatibility acknowledgements. |
| listener-service | infoscreen/+/heartbeat | No | Yes | Consumes heartbeat telemetry. |
| listener-service | infoscreen/+/health | No | Yes | Consumes health telemetry. |
| listener-service | infoscreen/+/dashboard | No | Yes | Consumes dashboard screenshot payloads. |
| listener-service | infoscreen/+/screenshot | No | Yes | Consumes screenshot payloads (if enabled). |
| listener-service | infoscreen/+/logs/error | No | Yes | Consumes client error logs. |
| listener-service | infoscreen/+/logs/warn | No | Yes | Consumes client warn logs. |
| listener-service | infoscreen/+/logs/info | No | Yes | Consumes client info logs. |
| listener-service | infoscreen/discovery | No | Yes | Consumes discovery announcements. |
| listener-service | infoscreen/+/discovery_ack | Yes | No | Publishes discovery acknowledgements. |
| client-<uuid> | infoscreen/<uuid>/commands | No | Yes | Canonical command intake for this client only. |
| client-<uuid> | infoscreen/<uuid>/command | No | Yes | Transitional compatibility intake for this client only. |
| client-<uuid> | infoscreen/events/<group_id> | No | Yes | Assigned group event feed only; dynamic per assignment. |
| client-<uuid> | infoscreen/<uuid>/commands/ack | Yes | No | Canonical command acknowledgements for this client only. |
| client-<uuid> | infoscreen/<uuid>/command/ack | Yes | No | Transitional compatibility acknowledgements for this client only. |
| client-<uuid> | infoscreen/<uuid>/heartbeat | Yes | No | Heartbeat telemetry. |
| client-<uuid> | infoscreen/<uuid>/health | Yes | No | Health telemetry. |
| client-<uuid> | infoscreen/<uuid>/dashboard | Yes | No | Dashboard status and screenshot payloads. |
| client-<uuid> | infoscreen/<uuid>/screenshot | Yes | No | Screenshot payloads (if enabled). |
| client-<uuid> | infoscreen/<uuid>/logs/error | Yes | No | Error log stream. |
| client-<uuid> | infoscreen/<uuid>/logs/warn | Yes | No | Warning log stream. |
| client-<uuid> | infoscreen/<uuid>/logs/info | Yes | No | Info log stream. |
| client-<uuid> | infoscreen/discovery | Yes | No | Discovery announcement. |
| client-<uuid> | infoscreen/<uuid>/discovery_ack | No | Yes | Discovery acknowledgment from listener. |
ACL implementation notes:
1. Use per-client identities; client ACLs must be scoped to exact client UUID and must not allow wildcard access to other clients.
2. Event topic subscription (`infoscreen/events/<group_id>`) should be managed via broker-side ACL provisioning that updates when group assignment changes.
3. Transitional singular command topics are temporary and should be removed after migration cutover.
4. Deny by default: any topic not explicitly listed above should be blocked for each actor.
### Credential Management Guidance
1. Real MQTT passwords must not be stored in tracked documentation or committed templates.
2. Each client device should receive a unique broker username and password, stored only in its local [/.env](.env).
3. Server-side publisher credentials should be stored in the server team's secret-management path, not in source control.
4. Recommended naming convention for client broker users: `infoscreen-client-<client-uuid-prefix>`.
5. Client passwords should be random, at least 20 characters, and rotated through deployment tooling or broker administration procedures.
6. The server/infrastructure team owns broker-side user creation, ACL assignment, rotation, and revocation.
7. The client team owns loading credentials from local env files and validating connection behavior against the secured broker.
### Client Team Actions For Auth Hardening
1. Add MQTT username/password support in the client connection setup.
2. Add client-side TLS configuration support from environment when certificates are provided.
3. Update local test helpers to support authenticated MQTT publishing and subscription.
4. Validate command and event intake against the authenticated broker configuration before canary rollout.
### Ready For Server/Frontend Team (Auth Phase)
1. Client implementation is ready to connect with MQTT auth from local `.env` (`MQTT_USERNAME`, `MQTT_PASSWORD`, optional TLS settings).
2. Client command/event intake and client ack/telemetry publishing run over the authenticated MQTT session.
3. Server/frontend team must now complete broker-side enforcement and publisher migration.
Server/frontend done criteria:
1. Anonymous broker access is disabled in production.
2. Server-side publishers use authenticated broker credentials.
3. ACLs are active and validated for command, event, and client telemetry topics.
4. At least one canary client proves end-to-end flow under ACLs:
- server publishes command/event with authenticated publisher
- client receives payload
- client sends ack/telemetry successfully
5. Revocation test passes: disabling one client credential blocks only that client without impacting others.
Operational note:
1. Client-side auth support is necessary but not sufficient by itself; broker ACL/auth enforcement is the security control that must be enabled by the server/infrastructure team.
### Rollout Plan
1. Contract freeze and sign-off.
2. Platform and client implementation against frozen schemas.
3. One-group canary.
4. Rollback if failed plus timed_out exceeds 5 percent.
5. Expand only after 7 days below intervention threshold.
### Success Criteria
1. Deterministic command lifecycle visibility from enqueue to completion.
2. No duplicate execution under reconnect or delayed-delivery conditions.
3. Stable Pi 5 SSD reconnect behavior within defined baseline.
4. Clear and actionable manual intervention states when automatic recovery is exhausted.

View File

@@ -0,0 +1,54 @@
## Reboot Reliability Kickoff Summary
### Objective
Ship a reliable, observable restart and shutdown workflow for Raspberry Pi 5 clients, with safe escalation and clear operator outcomes.
### What Is Included
1. Asynchronous command lifecycle with idempotent command_id handling.
2. Monitoring-first state visibility from queued to terminal outcomes.
3. Client acknowledgements for accepted, execution_started, completed, and failed.
4. Pi 5 USB-SATA SSD timeout baseline and tuning rules.
5. Capability-tier recovery with optional managed PoE escalation.
### What Is Not Included
1. Full maintenance module in client-management.
2. Required managed power-switch integration.
3. Production Wake-on-LAN rollout.
### Team Split
1. Platform team: API command lifecycle, safety controls, listener ack ingestion.
2. Web team: lifecycle-aware UX and command status display.
3. Client team: strict validation, dedupe, ack sequence, secure execution helper, reboot continuity.
### Ownership Matrix
| Team | Primary Plan File | Main Deliverables |
| --- | --- | --- |
| Platform team | implementation-plans/reboot-implementation-handoff-share.md | Command lifecycle backend, policy enforcement, listener ack mapping, safety lockout and escalation |
| Web team | implementation-plans/reboot-implementation-handoff-share.md | Lifecycle UI states, bulk safety UX, capability visibility, command status polling |
| Client team | implementation-plans/reboot-implementation-handoff-client-team.md | Command validation, dedupe persistence, ack sequence, secure execution helper, reboot continuity |
| Project coordination | implementation-plans/reboot-kickoff-summary.md | Phase sequencing, canary gates, rollback thresholds, cross-team sign-off tracking |
### Baseline Operational Defaults
1. Safety lockout: 3 reboot commands per client in rolling 15 minutes.
2. Escalation cooldown: 60 seconds.
3. Reconnect target on Pi 5 SSD: 90 seconds baseline, 150 seconds cold-boot ceiling.
4. Rollback canary trigger: failed plus timed_out above 5 percent.
### Frozen Contract Snapshot
1. Canonical command topic: infoscreen/{client_uuid}/commands.
2. Canonical ack topic: infoscreen/{client_uuid}/commands/ack.
3. Transitional compatibility topics during migration:
- infoscreen/{client_uuid}/command
- infoscreen/{client_uuid}/command/ack
4. Command schema version: 1.0.
5. Allowed command actions: reboot_host, shutdown_host.
6. Allowed ack status values: accepted, execution_started, completed, failed.
7. Validation snippets:
- implementation-plans/reboot-command-payload-schemas.md
- implementation-plans/reboot-command-payload-schemas.json
### Immediate Next Steps
1. Continue implementation in parallel by team against frozen contract.
2. Client team validates dedupe and expiry handling on canonical topics.
3. Platform team verifies ack-state transitions for accepted, execution_started, completed, failed.
4. Execute one-group canary and validate timing plus failure drills.

View File

@@ -0,0 +1,127 @@
# Server Team Action Items — Infoscreen Client
This document lists everything the server/infrastructure/frontend team must implement to complete the client integration. The client-side code is production-ready for all items listed here.
---
## 1. MQTT Broker Hardening (prerequisite for everything else)
- Disable anonymous access on the broker.
- Create one broker account **per client device**:
- Username convention: `infoscreen-client-<uuid-prefix>` (e.g. `infoscreen-client-9b8d1856`)
- Provision the password to the device `.env` as `MQTT_PASSWORD_BROKER=`
- Create a **server/publisher account** (e.g. `infoscreen-server`) for all server-side publishes.
- Enforce ACLs:
| Topic | Publisher |
|---|---|
| `infoscreen/{uuid}/commands` | server only |
| `infoscreen/{uuid}/command` (alias) | server only |
| `infoscreen/{uuid}/group_id` | server only |
| `infoscreen/events/{group_id}` | server only |
| `infoscreen/groups/+/power/intent` | server only |
| `infoscreen/{uuid}/commands/ack` | client only |
| `infoscreen/{uuid}/command/ack` | client only |
| `infoscreen/{uuid}/heartbeat` | client only |
| `infoscreen/{uuid}/health` | client only |
| `infoscreen/{uuid}/logs/#` | client only |
| `infoscreen/{uuid}/service_failed` | client only |
---
## 2. Reboot / Shutdown Command — Ack Lifecycle
Client publishes ack status updates to two topics per command (canonical + transitional alias):
- `infoscreen/{uuid}/commands/ack`
- `infoscreen/{uuid}/command/ack`
**Ack payload schema (v1, frozen):**
```json
{
"command_id": "07aab032-53c2-45ef-a5a3-6aa58e9d9fae",
"status": "accepted | execution_started | completed | failed",
"error_code": null,
"error_message": null
}
```
**Status lifecycle:**
| Status | When | Notes |
|---|---|---|
| `accepted` | Command received and validated | Immediate |
| `execution_started` | Helper invoked | Immediate after accepted |
| `completed` | Execution confirmed | For `reboot_host`: arrives after reconnect (1090 s after `execution_started`) |
| `failed` | Helper returned error | `error_code` and `error_message` will be set |
**Server must:**
- Track `command_id` through the full lifecycle and update status in DB/UI.
- Surface `failed` + `error_code` to the operator UI.
- Expect `reboot_host` `completed` to arrive after a reconnect delay — do not treat the gap as a timeout.
- Use `expires_at` from the original command to determine when to abandon waiting.
---
## 3. Health Dashboard — Broker Connection Fields (Gap 2)
Every `infoscreen/{uuid}/health` payload now includes a `broker_connection` block:
```json
{
"timestamp": "2026-04-05T08:00:00.000000+00:00",
"expected_state": { "event_id": 42 },
"actual_state": {
"process": "display_manager",
"pid": 1234,
"status": "running"
},
"broker_connection": {
"broker_reachable": true,
"reconnect_count": 2,
"last_disconnect_at": "2026-04-04T10:30:00Z"
}
}
```
**Server must:**
- Display `reconnect_count` and `last_disconnect_at` per device in the health dashboard.
- Implement alerting heuristic:
- **All** clients go silent simultaneously → likely broker outage, not device crash.
- **Single** client goes silent → device crash, network failure, or process hang.
---
## 4. Service-Failed MQTT Notification (Gap 3)
When systemd gives up restarting a service after repeated crashes (`StartLimitBurst` exceeded), the client automatically publishes a **retained** message:
**Topic:** `infoscreen/{uuid}/service_failed`
**Payload:**
```json
{
"event": "service_failed",
"unit": "infoscreen-simclient.service",
"client_uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
"failed_at": "2026-04-05T08:00:00Z"
}
```
**Server must:**
- Subscribe to `infoscreen/+/service_failed` on startup (retained — message survives broker restart).
- Alert the operator immediately when this topic receives a payload.
- **Clear the retained message** once the device is acknowledged or recovered:
```
mosquitto_pub -t "infoscreen/{uuid}/service_failed" -n --retain
```
---
## 5. No Server Action Required
These items are fully implemented client-side and require no server changes:
- systemd watchdog (`WatchdogSec=60`) — hangs detected and process restarted automatically.
- Command deduplication — `command_id` deduplicated with 24-hour TTL.
- Ack retry backoff — client retries ack publish on broker disconnect until `expires_at`.
- Mock helper / test mode (`COMMAND_MOCK_REBOOT_IMMEDIATE_COMPLETE`) — development only.