feat: crash recovery, service_failed monitoring, broker health fields, command expiry sweep
- Add GET /api/clients/crashed endpoint (process_status=crashed or stale heartbeat) - Add restart_app command action with same lifecycle + lockout as reboot_host - Scheduler: crash auto-recovery loop (CRASH_RECOVERY_ENABLED flag, lockout, MQTT publish) - Scheduler: unconditional command expiry sweep per poll cycle (sweep_expired_commands) - Listener: subscribe to infoscreen/+/service_failed; persist service_failed_at + unit - Listener: extract broker_connection block from health payload; persist reconnect_count + last_disconnect_at - DB migration b1c2d3e4f5a6: service_failed_at, service_failed_unit, mqtt_reconnect_count, mqtt_last_disconnect_at on clients - Add GET /api/clients/service_failed and POST /api/clients/<uuid>/clear_service_failed - Monitoring overview API: include mqtt_reconnect_count + mqtt_last_disconnect_at per client - Frontend: orange service-failed alert panel (hidden when empty, auto-refresh, quittieren action) - Frontend: MQTT reconnect count + last disconnect in client detail panel - MQTT auth hardening: listener/scheduler/server use env credentials; broker enforces allow_anonymous false - Client command lifecycle foundation: ClientCommand model, reboot_host/shutdown_host, full ACK lifecycle - Docs: TECH-CHANGELOG, DEV-CHANGELOG, MQTT_EVENT_PAYLOAD_GUIDE, copilot-instructions updated - Add implementation-plans/, RESTART_VALIDATION_CHECKLIST.md, TODO.md
This commit is contained in:
149
implementation-plans/reboot-command-payload-schemas.json
Normal file
149
implementation-plans/reboot-command-payload-schemas.json
Normal file
@@ -0,0 +1,149 @@
|
||||
{
|
||||
"$schema": "https://json-schema.org/draft/2020-12/schema",
|
||||
"$id": "https://infoscreen.local/schemas/reboot-command-payload-schemas.json",
|
||||
"title": "Infoscreen Reboot Command Payload Schemas",
|
||||
"description": "Frozen v1 schemas for per-client command and command acknowledgement payloads.",
|
||||
"$defs": {
|
||||
"commandPayloadV1": {
|
||||
"type": "object",
|
||||
"additionalProperties": false,
|
||||
"required": [
|
||||
"schema_version",
|
||||
"command_id",
|
||||
"client_uuid",
|
||||
"action",
|
||||
"issued_at",
|
||||
"expires_at",
|
||||
"requested_by",
|
||||
"reason"
|
||||
],
|
||||
"properties": {
|
||||
"schema_version": {
|
||||
"type": "string",
|
||||
"const": "1.0"
|
||||
},
|
||||
"command_id": {
|
||||
"type": "string",
|
||||
"format": "uuid"
|
||||
},
|
||||
"client_uuid": {
|
||||
"type": "string",
|
||||
"format": "uuid"
|
||||
},
|
||||
"action": {
|
||||
"type": "string",
|
||||
"enum": [
|
||||
"reboot_host",
|
||||
"shutdown_host"
|
||||
]
|
||||
},
|
||||
"issued_at": {
|
||||
"type": "string",
|
||||
"format": "date-time"
|
||||
},
|
||||
"expires_at": {
|
||||
"type": "string",
|
||||
"format": "date-time"
|
||||
},
|
||||
"requested_by": {
|
||||
"type": [
|
||||
"integer",
|
||||
"null"
|
||||
],
|
||||
"minimum": 1
|
||||
},
|
||||
"reason": {
|
||||
"type": [
|
||||
"string",
|
||||
"null"
|
||||
],
|
||||
"maxLength": 2000
|
||||
}
|
||||
}
|
||||
},
|
||||
"commandAckPayloadV1": {
|
||||
"type": "object",
|
||||
"additionalProperties": false,
|
||||
"required": [
|
||||
"command_id",
|
||||
"status",
|
||||
"error_code",
|
||||
"error_message"
|
||||
],
|
||||
"properties": {
|
||||
"command_id": {
|
||||
"type": "string",
|
||||
"format": "uuid"
|
||||
},
|
||||
"status": {
|
||||
"type": "string",
|
||||
"enum": [
|
||||
"accepted",
|
||||
"execution_started",
|
||||
"completed",
|
||||
"failed"
|
||||
]
|
||||
},
|
||||
"error_code": {
|
||||
"type": [
|
||||
"string",
|
||||
"null"
|
||||
],
|
||||
"maxLength": 128
|
||||
},
|
||||
"error_message": {
|
||||
"type": [
|
||||
"string",
|
||||
"null"
|
||||
],
|
||||
"maxLength": 4000
|
||||
}
|
||||
},
|
||||
"allOf": [
|
||||
{
|
||||
"if": {
|
||||
"properties": {
|
||||
"status": {
|
||||
"const": "failed"
|
||||
}
|
||||
}
|
||||
},
|
||||
"then": {
|
||||
"properties": {
|
||||
"error_code": {
|
||||
"type": "string",
|
||||
"minLength": 1
|
||||
},
|
||||
"error_message": {
|
||||
"type": "string",
|
||||
"minLength": 1
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"examples": [
|
||||
{
|
||||
"commandPayloadV1": {
|
||||
"schema_version": "1.0",
|
||||
"command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
|
||||
"client_uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
|
||||
"action": "reboot_host",
|
||||
"issued_at": "2026-04-03T12:48:10Z",
|
||||
"expires_at": "2026-04-03T12:52:10Z",
|
||||
"requested_by": 1,
|
||||
"reason": "operator_request"
|
||||
}
|
||||
},
|
||||
{
|
||||
"commandAckPayloadV1": {
|
||||
"command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
|
||||
"status": "execution_started",
|
||||
"error_code": null,
|
||||
"error_message": null
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
59
implementation-plans/reboot-command-payload-schemas.md
Normal file
59
implementation-plans/reboot-command-payload-schemas.md
Normal file
@@ -0,0 +1,59 @@
|
||||
## Reboot Command Payload Schema Snippets
|
||||
|
||||
This file provides copy-ready validation snippets for client and integration test helpers.
|
||||
|
||||
### Canonical Topics (v1)
|
||||
1. Command topic: infoscreen/{client_uuid}/commands
|
||||
2. Ack topic: infoscreen/{client_uuid}/commands/ack
|
||||
|
||||
### Transitional Compatibility Topics
|
||||
1. Command topic alias: infoscreen/{client_uuid}/command
|
||||
2. Ack topic alias: infoscreen/{client_uuid}/command/ack
|
||||
|
||||
### Canonical Action Values
|
||||
1. reboot_host
|
||||
2. shutdown_host
|
||||
|
||||
### Ack Status Values
|
||||
1. accepted
|
||||
2. execution_started
|
||||
3. completed
|
||||
4. failed
|
||||
|
||||
### JSON Schema Source
|
||||
Use this file for machine validation:
|
||||
1. implementation-plans/reboot-command-payload-schemas.json
|
||||
|
||||
### Minimal Command Schema Snippet
|
||||
```json
|
||||
{
|
||||
"type": "object",
|
||||
"additionalProperties": false,
|
||||
"required": ["schema_version", "command_id", "client_uuid", "action", "issued_at", "expires_at", "requested_by", "reason"],
|
||||
"properties": {
|
||||
"schema_version": { "const": "1.0" },
|
||||
"command_id": { "type": "string", "format": "uuid" },
|
||||
"client_uuid": { "type": "string", "format": "uuid" },
|
||||
"action": { "enum": ["reboot_host", "shutdown_host"] },
|
||||
"issued_at": { "type": "string", "format": "date-time" },
|
||||
"expires_at": { "type": "string", "format": "date-time" },
|
||||
"requested_by": { "type": ["integer", "null"] },
|
||||
"reason": { "type": ["string", "null"] }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Minimal Ack Schema Snippet
|
||||
```json
|
||||
{
|
||||
"type": "object",
|
||||
"additionalProperties": false,
|
||||
"required": ["command_id", "status", "error_code", "error_message"],
|
||||
"properties": {
|
||||
"command_id": { "type": "string", "format": "uuid" },
|
||||
"status": { "enum": ["accepted", "execution_started", "completed", "failed"] },
|
||||
"error_code": { "type": ["string", "null"] },
|
||||
"error_message": { "type": ["string", "null"] }
|
||||
}
|
||||
}
|
||||
```
|
||||
@@ -0,0 +1,146 @@
|
||||
## Client Team Implementation Spec (Raspberry Pi 5)
|
||||
|
||||
### Mission
|
||||
Implement client-side command handling for reliable restart and shutdown with strict validation, idempotency, acknowledgements, and reboot recovery continuity.
|
||||
|
||||
### Ownership Boundaries
|
||||
1. Client team owns command intake, execution, acknowledgement emission, and post-reboot continuity.
|
||||
2. Platform team owns command issuance, lifecycle aggregation, and server-side escalation logic.
|
||||
3. Client implementation must not assume managed PoE availability.
|
||||
|
||||
### Required Client Behaviors
|
||||
|
||||
### Frozen MQTT Topics and Schemas (v1)
|
||||
1. Canonical command topic: infoscreen/{client_uuid}/commands.
|
||||
2. Canonical ack topic: infoscreen/{client_uuid}/commands/ack.
|
||||
3. Transitional compatibility topics during migration:
|
||||
- infoscreen/{client_uuid}/command
|
||||
- infoscreen/{client_uuid}/command/ack
|
||||
4. QoS policy: command QoS 1, ack QoS 1 recommended.
|
||||
5. Retain policy: commands and acks are non-retained.
|
||||
|
||||
Frozen command payload schema:
|
||||
|
||||
```json
|
||||
{
|
||||
"schema_version": "1.0",
|
||||
"command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
|
||||
"client_uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
|
||||
"action": "reboot_host",
|
||||
"issued_at": "2026-04-03T12:48:10Z",
|
||||
"expires_at": "2026-04-03T12:52:10Z",
|
||||
"requested_by": 1,
|
||||
"reason": "operator_request"
|
||||
}
|
||||
```
|
||||
|
||||
Frozen ack payload schema:
|
||||
|
||||
```json
|
||||
{
|
||||
"command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
|
||||
"status": "execution_started",
|
||||
"error_code": null,
|
||||
"error_message": null
|
||||
}
|
||||
```
|
||||
|
||||
Allowed ack status values:
|
||||
1. accepted
|
||||
2. execution_started
|
||||
3. completed
|
||||
4. failed
|
||||
|
||||
Frozen command action values for v1:
|
||||
1. reboot_host
|
||||
2. shutdown_host
|
||||
|
||||
Reserved but not emitted by server in v1:
|
||||
1. restart_service
|
||||
|
||||
Validation snippets for helper scripts:
|
||||
1. Human-readable snippets: implementation-plans/reboot-command-payload-schemas.md
|
||||
2. Machine-validated JSON Schema: implementation-plans/reboot-command-payload-schemas.json
|
||||
|
||||
### 1. Command Intake
|
||||
1. Subscribe to the canonical command topic with QoS 1.
|
||||
2. Parse required fields: schema_version, command_id, action, issued_at, expires_at, reason, requested_by, target metadata.
|
||||
3. Reject invalid payloads with failed acknowledgement including error_code and diagnostic message.
|
||||
4. Reject stale commands when current time exceeds expires_at.
|
||||
5. Ignore already-processed command_id values.
|
||||
|
||||
### 2. Idempotency And Persistence
|
||||
1. Persist processed command_id and execution result on local storage.
|
||||
2. Persistence must survive service restart and full OS reboot.
|
||||
3. On restart, reload dedupe cache before processing newly delivered commands.
|
||||
|
||||
### 3. Acknowledgement Contract Behavior
|
||||
1. Emit accepted immediately after successful validation and dedupe pass.
|
||||
2. Emit execution_started immediately before invoking the command action.
|
||||
3. Emit completed only when local success condition is confirmed.
|
||||
4. Emit failed with structured error_code on validation or execution failure.
|
||||
5. If MQTT is temporarily unavailable, retry ack publish with bounded backoff until command expiry.
|
||||
|
||||
### 4. Execution Security Model
|
||||
1. Execute via systemd-managed privileged helper.
|
||||
2. Allow only whitelisted operations:
|
||||
- reboot_host
|
||||
- shutdown_host
|
||||
3. Optionally keep restart_service handler as reserved path, but do not require it for v1 conformance.
|
||||
4. Disallow arbitrary shell commands and untrusted arguments.
|
||||
5. Enforce per-command execution timeout and terminate hung child processes.
|
||||
|
||||
### 5. Reboot Recovery Continuity
|
||||
1. For reboot_host action:
|
||||
- send execution_started
|
||||
- trigger reboot promptly
|
||||
2. During startup:
|
||||
- emit heartbeat early
|
||||
- emit process-health once service is ready
|
||||
3. Keep last command execution state available after reboot for reconciliation.
|
||||
|
||||
### 6. Time And Timeout Semantics
|
||||
1. Use monotonic timers for local elapsed-time checks.
|
||||
2. Use UTC wall-clock only for protocol timestamps and expiry comparisons.
|
||||
3. Target reconnect baseline on Pi 5 USB-SATA SSD: 90 seconds.
|
||||
4. Accept cold-boot and USB enumeration ceiling up to 150 seconds.
|
||||
|
||||
### 7. Capability Reporting
|
||||
1. Report recovery capability class:
|
||||
- software_only
|
||||
- managed_poe_available
|
||||
- manual_only
|
||||
2. Report watchdog enabled status.
|
||||
3. Report boot-source metadata for diagnostics.
|
||||
|
||||
### 8. Error Codes Minimum Set
|
||||
1. invalid_schema
|
||||
2. missing_field
|
||||
3. stale_command
|
||||
4. duplicate_command
|
||||
5. permission_denied_local
|
||||
6. execution_timeout
|
||||
7. execution_failed
|
||||
8. broker_unavailable
|
||||
9. internal_error
|
||||
|
||||
### Acceptance Tests (Client Team)
|
||||
1. Invalid schema payload is rejected and failed ack emitted.
|
||||
2. Expired command is rejected and not executed.
|
||||
3. Duplicate command_id is not executed twice.
|
||||
4. reboot_host emits execution_started and reconnects with heartbeat in expected window.
|
||||
5. restart_service action completes without host reboot and emits completed.
|
||||
6. MQTT outage during ack path retries correctly without duplicate execution.
|
||||
7. Boot-loop protection cooperates with server-side lockout semantics.
|
||||
|
||||
### Delivery Artifacts
|
||||
1. Client protocol conformance checklist.
|
||||
2. Test evidence for all acceptance tests.
|
||||
3. Runtime logs showing full lifecycle for one restart and one reboot scenario.
|
||||
4. Known limitations list per image version.
|
||||
|
||||
### Definition Of Done
|
||||
1. All acceptance tests pass on Pi 5 USB-SATA SSD test devices.
|
||||
2. No duplicate execution observed under reconnect and retained-delivery edge cases.
|
||||
3. Acknowledgement sequence is complete and machine-parseable for server correlation.
|
||||
4. Reboot recovery continuity works without managed PoE dependencies.
|
||||
214
implementation-plans/reboot-implementation-handoff-share.md
Normal file
214
implementation-plans/reboot-implementation-handoff-share.md
Normal file
@@ -0,0 +1,214 @@
|
||||
## Remote Reboot Reliability Handoff (Share Document)
|
||||
|
||||
### Purpose
|
||||
This document defines the agreed implementation scope for reliable remote reboot and shutdown of Raspberry Pi 5 clients, with monitoring-first visibility and safe escalation paths.
|
||||
|
||||
### Scope
|
||||
1. In scope: restart and shutdown command reliability.
|
||||
2. In scope: full lifecycle monitoring and audit visibility.
|
||||
3. In scope: capability-tier recovery model with optional managed PoE escalation.
|
||||
4. Out of scope: broader maintenance module in client-management for this cycle.
|
||||
5. Out of scope: mandatory dependency on customer-managed power switching.
|
||||
|
||||
### Agreed Operating Model
|
||||
1. Command delivery is asynchronous and lifecycle-tracked, not fire-and-forget.
|
||||
2. Commands use idempotent command_id semantics with stale-command rejection by expires_at.
|
||||
3. Monitoring is authoritative for operational state and escalation decisions.
|
||||
4. Recovery must function even when no managed power switching is available.
|
||||
|
||||
### Frozen Contract v1 (Effective Immediately)
|
||||
1. Canonical command topic: infoscreen/{client_uuid}/commands.
|
||||
2. Canonical ack topic: infoscreen/{client_uuid}/commands/ack.
|
||||
3. Transitional compatibility topics accepted during migration:
|
||||
- infoscreen/{client_uuid}/command
|
||||
- infoscreen/{client_uuid}/command/ack
|
||||
4. QoS policy: command QoS 1, ack QoS 1 recommended.
|
||||
5. Retain policy: commands and acks are non-retained.
|
||||
|
||||
Command payload schema (frozen):
|
||||
|
||||
```json
|
||||
{
|
||||
"schema_version": "1.0",
|
||||
"command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
|
||||
"client_uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
|
||||
"action": "reboot_host",
|
||||
"issued_at": "2026-04-03T12:48:10Z",
|
||||
"expires_at": "2026-04-03T12:52:10Z",
|
||||
"requested_by": 1,
|
||||
"reason": "operator_request"
|
||||
}
|
||||
```
|
||||
|
||||
Ack payload schema (frozen):
|
||||
|
||||
```json
|
||||
{
|
||||
"command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
|
||||
"status": "execution_started",
|
||||
"error_code": null,
|
||||
"error_message": null
|
||||
}
|
||||
```
|
||||
|
||||
Allowed ack status values:
|
||||
1. accepted
|
||||
2. execution_started
|
||||
3. completed
|
||||
4. failed
|
||||
|
||||
Frozen command action values:
|
||||
1. reboot_host
|
||||
2. shutdown_host
|
||||
|
||||
API endpoint mapping:
|
||||
1. POST /api/clients/{uuid}/restart -> action reboot_host
|
||||
2. POST /api/clients/{uuid}/shutdown -> action shutdown_host
|
||||
|
||||
Validation snippets:
|
||||
1. Human-readable snippets: implementation-plans/reboot-command-payload-schemas.md
|
||||
2. Machine-validated JSON Schema: implementation-plans/reboot-command-payload-schemas.json
|
||||
|
||||
### Command Lifecycle States
|
||||
1. queued
|
||||
2. publish_in_progress
|
||||
3. published
|
||||
4. ack_received
|
||||
5. execution_started
|
||||
6. awaiting_reconnect
|
||||
7. recovered
|
||||
8. completed
|
||||
9. failed
|
||||
10. expired
|
||||
11. timed_out
|
||||
12. canceled
|
||||
13. blocked_safety
|
||||
14. manual_intervention_required
|
||||
|
||||
### Timeout Defaults (Pi 5, USB-SATA SSD baseline)
|
||||
1. queued to publish_in_progress: immediate, timeout 5 seconds.
|
||||
2. publish_in_progress to published: timeout 8 seconds.
|
||||
3. published to ack_received: timeout 20 seconds.
|
||||
4. ack_received to execution_started: 15 seconds for service restart, 25 seconds for host reboot.
|
||||
5. execution_started to awaiting_reconnect: timeout 10 seconds.
|
||||
6. awaiting_reconnect to recovered: baseline 90 seconds after validation, cold-boot ceiling 150 seconds.
|
||||
7. recovered to completed: 15 to 20 seconds based on fleet stability.
|
||||
8. command expires_at default: 240 seconds, bounded 180 to 360 seconds.
|
||||
|
||||
### Recovery Tiers
|
||||
1. Tier 0 baseline, always required: watchdog, systemd auto-restart, lifecycle tracking, manual intervention fallback.
|
||||
2. Tier 1 optional: managed PoE per-port power-cycle escalation where customer infrastructure supports it.
|
||||
3. Tier 2 no remote power control: direct manual intervention workflow.
|
||||
|
||||
### Governance And Safety
|
||||
1. Role access: admin and superadmin.
|
||||
2. Bulk actions require reason capture.
|
||||
3. Safety lockout: maximum 3 reboot commands per client in 15 minutes.
|
||||
4. Escalation cooldown: 60 seconds before automatic move to manual_intervention_required.
|
||||
|
||||
### MQTT Auth Hardening (Phase 1, Required Before Broad Rollout)
|
||||
1. Intranet-only deployment is not sufficient protection for privileged MQTT actions by itself.
|
||||
2. Phase 1 hardening scope is broker authentication, authorization, and network restriction; payload URL allowlisting is deferred to a later client/server feature.
|
||||
3. MQTT broker must disable anonymous publish/subscribe access in production.
|
||||
4. MQTT broker must require authenticated identities for server-side publishers and client devices.
|
||||
5. MQTT broker must enforce ACLs so that:
|
||||
- only server-side services can publish to `infoscreen/{client_uuid}/commands`
|
||||
- only server-side services can publish scheduler event topics
|
||||
- each client can subscribe only to its own command topics and assigned event topics
|
||||
- each client can publish only its own ack, heartbeat, health, dashboard, and telemetry topics
|
||||
6. Broker port exposure must be restricted to the management network and approved hosts only.
|
||||
7. TLS support is strongly recommended in this phase and should be enabled when operationally feasible.
|
||||
|
||||
### Server Team Actions For Auth Hardening
|
||||
1. Provision broker credentials for command/event publishers and for client devices.
|
||||
2. Configure Mosquitto or equivalent broker ACLs for per-topic publish and subscribe restrictions.
|
||||
3. Disable anonymous access on production brokers.
|
||||
4. Restrict broker network exposure with firewall rules, VLAN policy, or equivalent network controls.
|
||||
5. Update server/frontend deployment to publish MQTT with authenticated credentials.
|
||||
6. Validate that server-side event publishing and reboot/shutdown command publishing still work under the new ACL policy.
|
||||
7. Coordinate credential distribution and rotation with the client deployment process.
|
||||
|
||||
### MQTT ACL Matrix (Canonical Baseline)
|
||||
| Actor | Topic Pattern | Publish | Subscribe | Notes |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| scheduler-service | infoscreen/events/+ | Yes | No | Publishes retained active event list per group. |
|
||||
| api-command-publisher | infoscreen/+/commands | Yes | No | Publishes canonical reboot/shutdown commands. |
|
||||
| api-command-publisher | infoscreen/+/command | Yes | No | Transitional compatibility publish only. |
|
||||
| api-group-assignment | infoscreen/+/group_id | Yes | No | Publishes retained client-to-group assignment. |
|
||||
| listener-service | infoscreen/+/commands/ack | No | Yes | Consumes canonical client command acknowledgements. |
|
||||
| listener-service | infoscreen/+/command/ack | No | Yes | Consumes transitional compatibility acknowledgements. |
|
||||
| listener-service | infoscreen/+/heartbeat | No | Yes | Consumes heartbeat telemetry. |
|
||||
| listener-service | infoscreen/+/health | No | Yes | Consumes health telemetry. |
|
||||
| listener-service | infoscreen/+/dashboard | No | Yes | Consumes dashboard screenshot payloads. |
|
||||
| listener-service | infoscreen/+/screenshot | No | Yes | Consumes screenshot payloads (if enabled). |
|
||||
| listener-service | infoscreen/+/logs/error | No | Yes | Consumes client error logs. |
|
||||
| listener-service | infoscreen/+/logs/warn | No | Yes | Consumes client warn logs. |
|
||||
| listener-service | infoscreen/+/logs/info | No | Yes | Consumes client info logs. |
|
||||
| listener-service | infoscreen/discovery | No | Yes | Consumes discovery announcements. |
|
||||
| listener-service | infoscreen/+/discovery_ack | Yes | No | Publishes discovery acknowledgements. |
|
||||
| client-<uuid> | infoscreen/<uuid>/commands | No | Yes | Canonical command intake for this client only. |
|
||||
| client-<uuid> | infoscreen/<uuid>/command | No | Yes | Transitional compatibility intake for this client only. |
|
||||
| client-<uuid> | infoscreen/events/<group_id> | No | Yes | Assigned group event feed only; dynamic per assignment. |
|
||||
| client-<uuid> | infoscreen/<uuid>/commands/ack | Yes | No | Canonical command acknowledgements for this client only. |
|
||||
| client-<uuid> | infoscreen/<uuid>/command/ack | Yes | No | Transitional compatibility acknowledgements for this client only. |
|
||||
| client-<uuid> | infoscreen/<uuid>/heartbeat | Yes | No | Heartbeat telemetry. |
|
||||
| client-<uuid> | infoscreen/<uuid>/health | Yes | No | Health telemetry. |
|
||||
| client-<uuid> | infoscreen/<uuid>/dashboard | Yes | No | Dashboard status and screenshot payloads. |
|
||||
| client-<uuid> | infoscreen/<uuid>/screenshot | Yes | No | Screenshot payloads (if enabled). |
|
||||
| client-<uuid> | infoscreen/<uuid>/logs/error | Yes | No | Error log stream. |
|
||||
| client-<uuid> | infoscreen/<uuid>/logs/warn | Yes | No | Warning log stream. |
|
||||
| client-<uuid> | infoscreen/<uuid>/logs/info | Yes | No | Info log stream. |
|
||||
| client-<uuid> | infoscreen/discovery | Yes | No | Discovery announcement. |
|
||||
| client-<uuid> | infoscreen/<uuid>/discovery_ack | No | Yes | Discovery acknowledgment from listener. |
|
||||
|
||||
ACL implementation notes:
|
||||
1. Use per-client identities; client ACLs must be scoped to exact client UUID and must not allow wildcard access to other clients.
|
||||
2. Event topic subscription (`infoscreen/events/<group_id>`) should be managed via broker-side ACL provisioning that updates when group assignment changes.
|
||||
3. Transitional singular command topics are temporary and should be removed after migration cutover.
|
||||
4. Deny by default: any topic not explicitly listed above should be blocked for each actor.
|
||||
|
||||
### Credential Management Guidance
|
||||
1. Real MQTT passwords must not be stored in tracked documentation or committed templates.
|
||||
2. Each client device should receive a unique broker username and password, stored only in its local [/.env](.env).
|
||||
3. Server-side publisher credentials should be stored in the server team's secret-management path, not in source control.
|
||||
4. Recommended naming convention for client broker users: `infoscreen-client-<client-uuid-prefix>`.
|
||||
5. Client passwords should be random, at least 20 characters, and rotated through deployment tooling or broker administration procedures.
|
||||
6. The server/infrastructure team owns broker-side user creation, ACL assignment, rotation, and revocation.
|
||||
7. The client team owns loading credentials from local env files and validating connection behavior against the secured broker.
|
||||
|
||||
### Client Team Actions For Auth Hardening
|
||||
1. Add MQTT username/password support in the client connection setup.
|
||||
2. Add client-side TLS configuration support from environment when certificates are provided.
|
||||
3. Update local test helpers to support authenticated MQTT publishing and subscription.
|
||||
4. Validate command and event intake against the authenticated broker configuration before canary rollout.
|
||||
|
||||
### Ready For Server/Frontend Team (Auth Phase)
|
||||
1. Client implementation is ready to connect with MQTT auth from local `.env` (`MQTT_USERNAME`, `MQTT_PASSWORD`, optional TLS settings).
|
||||
2. Client command/event intake and client ack/telemetry publishing run over the authenticated MQTT session.
|
||||
3. Server/frontend team must now complete broker-side enforcement and publisher migration.
|
||||
|
||||
Server/frontend done criteria:
|
||||
1. Anonymous broker access is disabled in production.
|
||||
2. Server-side publishers use authenticated broker credentials.
|
||||
3. ACLs are active and validated for command, event, and client telemetry topics.
|
||||
4. At least one canary client proves end-to-end flow under ACLs:
|
||||
- server publishes command/event with authenticated publisher
|
||||
- client receives payload
|
||||
- client sends ack/telemetry successfully
|
||||
5. Revocation test passes: disabling one client credential blocks only that client without impacting others.
|
||||
|
||||
Operational note:
|
||||
1. Client-side auth support is necessary but not sufficient by itself; broker ACL/auth enforcement is the security control that must be enabled by the server/infrastructure team.
|
||||
|
||||
### Rollout Plan
|
||||
1. Contract freeze and sign-off.
|
||||
2. Platform and client implementation against frozen schemas.
|
||||
3. One-group canary.
|
||||
4. Rollback if failed plus timed_out exceeds 5 percent.
|
||||
5. Expand only after 7 days below intervention threshold.
|
||||
|
||||
### Success Criteria
|
||||
1. Deterministic command lifecycle visibility from enqueue to completion.
|
||||
2. No duplicate execution under reconnect or delayed-delivery conditions.
|
||||
3. Stable Pi 5 SSD reconnect behavior within defined baseline.
|
||||
4. Clear and actionable manual intervention states when automatic recovery is exhausted.
|
||||
54
implementation-plans/reboot-kickoff-summary.md
Normal file
54
implementation-plans/reboot-kickoff-summary.md
Normal file
@@ -0,0 +1,54 @@
|
||||
## Reboot Reliability Kickoff Summary
|
||||
|
||||
### Objective
|
||||
Ship a reliable, observable restart and shutdown workflow for Raspberry Pi 5 clients, with safe escalation and clear operator outcomes.
|
||||
|
||||
### What Is Included
|
||||
1. Asynchronous command lifecycle with idempotent command_id handling.
|
||||
2. Monitoring-first state visibility from queued to terminal outcomes.
|
||||
3. Client acknowledgements for accepted, execution_started, completed, and failed.
|
||||
4. Pi 5 USB-SATA SSD timeout baseline and tuning rules.
|
||||
5. Capability-tier recovery with optional managed PoE escalation.
|
||||
|
||||
### What Is Not Included
|
||||
1. Full maintenance module in client-management.
|
||||
2. Required managed power-switch integration.
|
||||
3. Production Wake-on-LAN rollout.
|
||||
|
||||
### Team Split
|
||||
1. Platform team: API command lifecycle, safety controls, listener ack ingestion.
|
||||
2. Web team: lifecycle-aware UX and command status display.
|
||||
3. Client team: strict validation, dedupe, ack sequence, secure execution helper, reboot continuity.
|
||||
|
||||
### Ownership Matrix
|
||||
| Team | Primary Plan File | Main Deliverables |
|
||||
| --- | --- | --- |
|
||||
| Platform team | implementation-plans/reboot-implementation-handoff-share.md | Command lifecycle backend, policy enforcement, listener ack mapping, safety lockout and escalation |
|
||||
| Web team | implementation-plans/reboot-implementation-handoff-share.md | Lifecycle UI states, bulk safety UX, capability visibility, command status polling |
|
||||
| Client team | implementation-plans/reboot-implementation-handoff-client-team.md | Command validation, dedupe persistence, ack sequence, secure execution helper, reboot continuity |
|
||||
| Project coordination | implementation-plans/reboot-kickoff-summary.md | Phase sequencing, canary gates, rollback thresholds, cross-team sign-off tracking |
|
||||
|
||||
### Baseline Operational Defaults
|
||||
1. Safety lockout: 3 reboot commands per client in rolling 15 minutes.
|
||||
2. Escalation cooldown: 60 seconds.
|
||||
3. Reconnect target on Pi 5 SSD: 90 seconds baseline, 150 seconds cold-boot ceiling.
|
||||
4. Rollback canary trigger: failed plus timed_out above 5 percent.
|
||||
|
||||
### Frozen Contract Snapshot
|
||||
1. Canonical command topic: infoscreen/{client_uuid}/commands.
|
||||
2. Canonical ack topic: infoscreen/{client_uuid}/commands/ack.
|
||||
3. Transitional compatibility topics during migration:
|
||||
- infoscreen/{client_uuid}/command
|
||||
- infoscreen/{client_uuid}/command/ack
|
||||
4. Command schema version: 1.0.
|
||||
5. Allowed command actions: reboot_host, shutdown_host.
|
||||
6. Allowed ack status values: accepted, execution_started, completed, failed.
|
||||
7. Validation snippets:
|
||||
- implementation-plans/reboot-command-payload-schemas.md
|
||||
- implementation-plans/reboot-command-payload-schemas.json
|
||||
|
||||
### Immediate Next Steps
|
||||
1. Continue implementation in parallel by team against frozen contract.
|
||||
2. Client team validates dedupe and expiry handling on canonical topics.
|
||||
3. Platform team verifies ack-state transitions for accepted, execution_started, completed, failed.
|
||||
4. Execute one-group canary and validate timing plus failure drills.
|
||||
127
implementation-plans/server-team-actions.md
Normal file
127
implementation-plans/server-team-actions.md
Normal file
@@ -0,0 +1,127 @@
|
||||
# Server Team Action Items — Infoscreen Client
|
||||
|
||||
This document lists everything the server/infrastructure/frontend team must implement to complete the client integration. The client-side code is production-ready for all items listed here.
|
||||
|
||||
---
|
||||
|
||||
## 1. MQTT Broker Hardening (prerequisite for everything else)
|
||||
|
||||
- Disable anonymous access on the broker.
|
||||
- Create one broker account **per client device**:
|
||||
- Username convention: `infoscreen-client-<uuid-prefix>` (e.g. `infoscreen-client-9b8d1856`)
|
||||
- Provision the password to the device `.env` as `MQTT_PASSWORD_BROKER=`
|
||||
- Create a **server/publisher account** (e.g. `infoscreen-server`) for all server-side publishes.
|
||||
- Enforce ACLs:
|
||||
|
||||
| Topic | Publisher |
|
||||
|---|---|
|
||||
| `infoscreen/{uuid}/commands` | server only |
|
||||
| `infoscreen/{uuid}/command` (alias) | server only |
|
||||
| `infoscreen/{uuid}/group_id` | server only |
|
||||
| `infoscreen/events/{group_id}` | server only |
|
||||
| `infoscreen/groups/+/power/intent` | server only |
|
||||
| `infoscreen/{uuid}/commands/ack` | client only |
|
||||
| `infoscreen/{uuid}/command/ack` | client only |
|
||||
| `infoscreen/{uuid}/heartbeat` | client only |
|
||||
| `infoscreen/{uuid}/health` | client only |
|
||||
| `infoscreen/{uuid}/logs/#` | client only |
|
||||
| `infoscreen/{uuid}/service_failed` | client only |
|
||||
|
||||
---
|
||||
|
||||
## 2. Reboot / Shutdown Command — Ack Lifecycle
|
||||
|
||||
Client publishes ack status updates to two topics per command (canonical + transitional alias):
|
||||
- `infoscreen/{uuid}/commands/ack`
|
||||
- `infoscreen/{uuid}/command/ack`
|
||||
|
||||
**Ack payload schema (v1, frozen):**
|
||||
```json
|
||||
{
|
||||
"command_id": "07aab032-53c2-45ef-a5a3-6aa58e9d9fae",
|
||||
"status": "accepted | execution_started | completed | failed",
|
||||
"error_code": null,
|
||||
"error_message": null
|
||||
}
|
||||
```
|
||||
|
||||
**Status lifecycle:**
|
||||
|
||||
| Status | When | Notes |
|
||||
|---|---|---|
|
||||
| `accepted` | Command received and validated | Immediate |
|
||||
| `execution_started` | Helper invoked | Immediate after accepted |
|
||||
| `completed` | Execution confirmed | For `reboot_host`: arrives after reconnect (10–90 s after `execution_started`) |
|
||||
| `failed` | Helper returned error | `error_code` and `error_message` will be set |
|
||||
|
||||
**Server must:**
|
||||
- Track `command_id` through the full lifecycle and update status in DB/UI.
|
||||
- Surface `failed` + `error_code` to the operator UI.
|
||||
- Expect `reboot_host` `completed` to arrive after a reconnect delay — do not treat the gap as a timeout.
|
||||
- Use `expires_at` from the original command to determine when to abandon waiting.
|
||||
|
||||
---
|
||||
|
||||
## 3. Health Dashboard — Broker Connection Fields (Gap 2)
|
||||
|
||||
Every `infoscreen/{uuid}/health` payload now includes a `broker_connection` block:
|
||||
|
||||
```json
|
||||
{
|
||||
"timestamp": "2026-04-05T08:00:00.000000+00:00",
|
||||
"expected_state": { "event_id": 42 },
|
||||
"actual_state": {
|
||||
"process": "display_manager",
|
||||
"pid": 1234,
|
||||
"status": "running"
|
||||
},
|
||||
"broker_connection": {
|
||||
"broker_reachable": true,
|
||||
"reconnect_count": 2,
|
||||
"last_disconnect_at": "2026-04-04T10:30:00Z"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Server must:**
|
||||
- Display `reconnect_count` and `last_disconnect_at` per device in the health dashboard.
|
||||
- Implement alerting heuristic:
|
||||
- **All** clients go silent simultaneously → likely broker outage, not device crash.
|
||||
- **Single** client goes silent → device crash, network failure, or process hang.
|
||||
|
||||
---
|
||||
|
||||
## 4. Service-Failed MQTT Notification (Gap 3)
|
||||
|
||||
When systemd gives up restarting a service after repeated crashes (`StartLimitBurst` exceeded), the client automatically publishes a **retained** message:
|
||||
|
||||
**Topic:** `infoscreen/{uuid}/service_failed`
|
||||
|
||||
**Payload:**
|
||||
```json
|
||||
{
|
||||
"event": "service_failed",
|
||||
"unit": "infoscreen-simclient.service",
|
||||
"client_uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
|
||||
"failed_at": "2026-04-05T08:00:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
**Server must:**
|
||||
- Subscribe to `infoscreen/+/service_failed` on startup (retained — message survives broker restart).
|
||||
- Alert the operator immediately when this topic receives a payload.
|
||||
- **Clear the retained message** once the device is acknowledged or recovered:
|
||||
```
|
||||
mosquitto_pub -t "infoscreen/{uuid}/service_failed" -n --retain
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. No Server Action Required
|
||||
|
||||
These items are fully implemented client-side and require no server changes:
|
||||
|
||||
- systemd watchdog (`WatchdogSec=60`) — hangs detected and process restarted automatically.
|
||||
- Command deduplication — `command_id` deduplicated with 24-hour TTL.
|
||||
- Ack retry backoff — client retries ack publish on broker disconnect until `expires_at`.
|
||||
- Mock helper / test mode (`COMMAND_MOCK_REBOOT_IMMEDIATE_COMPLETE`) — development only.
|
||||
Reference in New Issue
Block a user