feat: crash recovery, service_failed monitoring, broker health fields, command expiry sweep

- Add GET /api/clients/crashed endpoint (process_status=crashed or stale heartbeat) - Add restart_app command action with same lifecycle + lockout as reboot_host - Scheduler: crash auto-recovery loop (CRASH_RECOVERY_ENABLED flag, lockout, MQTT publish) - Scheduler: unconditional command expiry sweep per poll cycle (sweep_expired_commands) - Listener: subscribe to infoscreen/+/service_failed; persist service_failed_at + unit - Listener: extract broker_connection block from health payload; persist reconnect_count + last_disconnect_at - DB migration b1c2d3e4f5a6: service_failed_at, service_failed_unit, mqtt_reconnect_count, mqtt_last_disconnect_at on clients - Add GET /api/clients/service_failed and POST /api/clients/<uuid>/clear_service_failed - Monitoring overview API: include mqtt_reconnect_count + mqtt_last_disconnect_at per client - Frontend: orange service-failed alert panel (hidden when empty, auto-refresh, quittieren action) - Frontend: MQTT reconnect count + last disconnect in client detail panel - MQTT auth hardening: listener/scheduler/server use env credentials; broker enforces allow_anonymous false - Client command lifecycle foundation: ClientCommand model, reboot_host/shutdown_host, full ACK lifecycle - Docs: TECH-CHANGELOG, DEV-CHANGELOG, MQTT_EVENT_PAYLOAD_GUIDE, copilot-instructions updated - Add implementation-plans/, RESTART_VALIDATION_CHECKLIST.md, TODO.md
feat: 2026.1.0-alpha.16 – dashboard banner refactor, period auto-activation, text & docs
2026-04-05 10:17:56 +00:00 · 2026-04-02 14:16:53 +00:00 · 2026-04-01 08:37:50 +00:00 · 2026-04-01 08:07:37 +00:00
49 changed files with 3761 additions and 1545 deletions
--- a/.env.example
+++ b/.env.example
@@ -20,8 +20,18 @@ DB_HOST=db
 # MQTT
 MQTT_BROKER_HOST=mqtt
 MQTT_BROKER_PORT=1883
-# MQTT_USER=your_mqtt_user
-# MQTT_PASSWORD=your_mqtt_password
+# Required for authenticated broker access
+MQTT_USER=your_mqtt_user
+MQTT_PASSWORD=replace_with_a_32plus_char_random_password
+# Optional: dedicated canary client account
+MQTT_CANARY_USER=your_canary_mqtt_user
+MQTT_CANARY_PASSWORD=replace_with_a_different_32plus_char_random_password
+# Optional TLS settings
+MQTT_TLS_ENABLED=false
+MQTT_TLS_CA_CERT=
+MQTT_TLS_CERTFILE=
+MQTT_TLS_KEYFILE=
+MQTT_TLS_INSECURE=false
 MQTT_KEEPALIVE=60

 # Dashboard
@@ -39,6 +49,12 @@ HEARTBEAT_GRACE_PERIOD_PROD=170
 # Optional: force periodic republish even without changes
 # REFRESH_SECONDS=0

+# Crash recovery (scheduler auto-recovery)
+# CRASH_RECOVERY_ENABLED=false
+# CRASH_RECOVERY_GRACE_SECONDS=180
+# CRASH_RECOVERY_LOCKOUT_MINUTES=15
+# CRASH_RECOVERY_COMMAND_EXPIRY_SECONDS=240
+
 # Default superadmin bootstrap (server/init_defaults.py)
 # REQUIRED: Must be set for superadmin creation
 DEFAULT_SUPERADMIN_USERNAME=superadmin
--- a/.github/copilot-instructions.md
+++ b/.github/copilot-instructions.md
@@ -1,498 +1,113 @@
 # Copilot instructions for infoscreen_2025

-# Purpose
-These instructions tell Copilot Chat how to reason about this codebase.
-Prefer explanations and refactors that align with these structures.
-
-Use this as your shared context when proposing changes. Keep edits minimal and match existing patterns referenced below.
+## Purpose
+This file is a concise, high-signal brief for coding agents.
+It is not a changelog and not a full architecture handbook.

 ## TL;DR
-Small multi-service digital signage app (Flask API, React dashboard, MQTT scheduler). Edit `server/` for API logic, `scheduler/` for event publishing, and `dashboard/` for UI. If you're asking Copilot for changes, prefer focused prompts that include the target file(s) and the desired behavior.
+- Stack: Flask API + MariaDB, React/Vite dashboard, MQTT listener, scheduler, worker.
+- Main areas:
+  - API logic in `server/`
+  - Scheduler logic in `scheduler/`
+  - UI logic in `dashboard/src/`
+- Keep changes minimal, match existing patterns, and update docs in the same commit when behavior changes.

-### How to ask Copilot
- "Add a new route `GET /api/events/summary` that returns counts per event_type — implement in `server/routes/events.py`."
- "Create an Alembic migration to add `duration` and `resolution` to `event_media` and update upload handler to populate them." 
- "Refactor `scheduler/db_utils.py` to prefer precomputed EventMedia metadata and fall back to a HEAD probe." 
- "Add an ffprobe-based worker that extracts duration/resolution/bitrate and stores them on `EventMedia`."
+## Fast file map
+- `scheduler/scheduler.py` - scheduler loop, MQTT event publishing, TV power intent publishing, crash auto-recovery, command expiry sweep
+- `scheduler/db_utils.py` - event formatting, power-intent helpers, crash recovery helpers, command expiry sweep
+- `listener/listener.py` - discovery/heartbeat/log/screenshot/service_failed MQTT consumption
+- `server/init_academic_periods.py` - idempotent academic-period seeding + auto-activation for current date
+- `server/initialize_database.py` - migration + bootstrap orchestration for local/manual setup
+- `server/routes/events.py` - event CRUD, recurrence handling, UTC normalization
+- `server/routes/eventmedia.py` - file manager, media upload/stream endpoints
+- `server/routes/groups.py` - group lifecycle, alive status, order persistence
+- `server/routes/system_settings.py` - system settings CRUD and supplement-table endpoint
+- `server/routes/clients.py` - client metadata, restart/shutdown/restart_app command issuing, command status, crashed/service_failed alert endpoints
+- `dashboard/src/settings.tsx` - settings UX and system-defaults integration
+- `dashboard/src/components/CustomEventModal.tsx` - event creation/editing UX
+- `dashboard/src/monitoring.tsx` - superadmin monitoring page
+- `TV_POWER_INTENT_SERVER_CONTRACT_V1.md` - Phase 1 TV power contract

-Keep docs synced with code. When you change services/MQTT/API/UTC/env or dev/prod run steps, update this file in the same commit (see `AI-INSTRUCTIONS-MAINTENANCE.md`).
+## Service picture
+- API: `server/` on `:8000` (health: `/health`)
+- Dashboard: `dashboard/` (dev `:5173`, proxied API calls)
+- MQTT broker: Mosquitto (`mosquitto/config/mosquitto.conf`)
+- Listener: MQTT consumer that updates server-side state
+- Scheduler: publishes active events and group-level TV power intents
+- Nginx: routes `/api/*` and `/screenshots/*` to API, dashboard otherwise
+- Prod bootstrap: `docker-compose.prod.yml` server command runs migrations, defaults init, and academic-period init before Gunicorn start

-### When not to change
- Avoid editing generated assets under `dashboard/dist/` and compiled bundles. Don't modify files produced by CI or Docker builds (unless intentionally updating build outputs).
+## Non-negotiable conventions
+- Datetime:
+  - Store/compare in UTC.
+  - API returns ISO strings without `Z` in many routes.
+  - Frontend must append `Z` before parsing if needed.
+- JSON naming:
+  - Backend internals use snake_case.
+  - API responses use camelCase (via `server/serializers.py`).
+- DB host in containers: `db` (not localhost).
+- Never put secrets in docs.

-### Contact / owner
- Primary maintainer: RobbStarkAustria (owner). For architecture questions, ping the repo owner or open an issue and tag `@RobbStarkAustria`.
+## MQTT contracts
+- Event list topic (retained): `infoscreen/events/{group_id}`
+- Group assignment topic (retained): `infoscreen/{uuid}/group_id`
+- Heartbeat topic: `infoscreen/{uuid}/heartbeat`
+- Logs topic family: `infoscreen/{uuid}/logs/{error|warn|info}`
+- Health topic: `infoscreen/{uuid}/health`
+- Dashboard screenshot topic: `infoscreen/{uuid}/dashboard`
+- Client command topic (QoS1, non-retained): `infoscreen/{uuid}/commands` (compat alias: `infoscreen/{uuid}/command`)
+- Client command ack topic (QoS1, non-retained): `infoscreen/{uuid}/commands/ack` (compat alias: `infoscreen/{uuid}/command/ack`)
+- Service-failed topic (retained, client→server): `infoscreen/{uuid}/service_failed`
+- TV power intent Phase 1 topic (retained, QoS1): `infoscreen/groups/{group_id}/power/intent`

-### Important files (quick jump targets)
- `scheduler/db_utils.py` — event formatting and scheduler-facing logic
- `scheduler/scheduler.py` — scheduler main loop and MQTT publisher
- `server/routes/eventmedia.py` — file uploads, streaming endpoint
- `server/routes/events.py` — event CRUD and recurrence handling
- `server/routes/groups.py` — group management, alive status, display order persistence
- `dashboard/src/components/CustomEventModal.tsx` — event creation UI
- `dashboard/src/media.tsx` — FileManager / upload settings
- `dashboard/src/settings.tsx` — settings UI (nested tabs; system defaults for presentations and videos)
- `dashboard/src/ressourcen.tsx` — timeline view showing all groups' active events in parallel
- `dashboard/src/ressourcen.css` — timeline and resource view styling
- `dashboard/src/monitoring.tsx` — superadmin-only monitoring dashboard for client health, screenshots, and logs
+TV power intent Phase 1 rules:
+- Schema version is `"1.0"`.
+- Group-only scope in Phase 1.
+- Heartbeat publish keeps `intent_id`; semantic transition rotates `intent_id`.
+- Expiry rule: `expires_at = issued_at + max(3 x poll_interval_sec, 90s)`.
+- Canonical contract is `TV_POWER_INTENT_SERVER_CONTRACT_V1.md`.

+## Backend patterns
+- Routes in `server/routes/*`, registered in `server/wsgi.py`.
+- Use one request-scoped DB session, commit on mutation, always close session.
+- Keep enum/datetime serialization JSON-safe.
+- Maintain UTC-safe comparisons in scheduler and routes.
+- Keep recurrence handling backend-driven and consistent with exceptions.
+- Academic periods bootstrap is idempotent and should auto-activate period covering `date.today()` when available.

+## Frontend patterns
+- Use Syncfusion-based patterns already present in dashboard.
+- Keep API requests relative (`/api/...`) to use Vite proxy in dev.
+- Respect `FRONTEND_DESIGN_RULES.md` for component and styling conventions.
+- Keep role-gated UI behavior aligned with backend authorization.
+- Holiday status banner in dashboard should render from computed state and avoid stale message reuse in 3rd-party UI components.

-## Big picture
- Multi-service app orchestrated by Docker Compose.
-  - API: Flask + SQLAlchemy (MariaDB), in `server/` exposed on :8000 (health: `/health`).
-  - Dashboard: React + Vite in `dashboard/`, dev on :5173, served via Nginx in prod.
-  - MQTT broker: Eclipse Mosquitto, config in `mosquitto/config/mosquitto.conf`.
-  - Listener: MQTT consumer handling discovery, heartbeats, and dashboard screenshot uploads in `listener/listener.py`.
-  - Scheduler: Publishes only currently active events (per group, at "now") to MQTT retained topics in `scheduler/scheduler.py`. It queries a future window (default: 7 days) to expand recurring events using RFC 5545 rules and applies event exceptions, but only publishes events that are active at the current time. When a group has no active events, the scheduler clears its retained topic by publishing an empty list. All time comparisons are UTC; any naive timestamps are normalized. Logging is concise; conversion lookups are cached and logged only once per media.
-  - Nginx: Reverse proxy routes `/api/*` and `/screenshots/*` to API; everything else to dashboard (`nginx.conf`).
+## Environment variables (high-value)
+- Scheduler: `POLL_INTERVAL_SECONDS`, `REFRESH_SECONDS`
+- Power intent: `POWER_INTENT_PUBLISH_ENABLED`, `POWER_INTENT_HEARTBEAT_ENABLED`, `POWER_INTENT_EXPIRY_MULTIPLIER`, `POWER_INTENT_MIN_EXPIRY_SECONDS`
+- Monitoring: `PRIORITY_SCREENSHOT_TTL_SECONDS`
+- Crash recovery: `CRASH_RECOVERY_ENABLED`, `CRASH_RECOVERY_GRACE_SECONDS`, `CRASH_RECOVERY_LOCKOUT_MINUTES`, `CRASH_RECOVERY_COMMAND_EXPIRY_SECONDS`
+- Core: `DB_CONN`, `DB_USER`, `DB_PASSWORD`, `DB_HOST`, `DB_NAME`, `ENV`
+- MQTT auth/connectivity: `MQTT_BROKER_HOST`, `MQTT_BROKER_PORT`, `MQTT_USER`, `MQTT_PASSWORD` (listener/scheduler/server should use authenticated broker access)

-  - Dev Container (hygiene): UI-only `Dev Containers` extension runs on host UI via `remote.extensionKind`; do not install it in-container. Dashboard installs use `npm ci`; shell aliases in `postStartCommand` are appended idempotently.
+## Edit guardrails
+- Do not edit generated assets in `dashboard/dist/`.
+- Do not change CI/build outputs unless explicitly intended.
+- Preserve existing API behavior unless task explicitly requires a change.
+- Prefer links to canonical docs instead of embedding long historical notes here.

-### Screenshot retention
- Screenshots sent via dashboard MQTT are stored in `server/screenshots/`.
- Screenshot payloads support `screenshot_type` with values `periodic`, `event_start`, `event_stop`.
- `periodic` is the normal heartbeat/dashboard screenshot path; `event_start` and `event_stop` are high-priority screenshots for monitoring.
- For each client, the API keeps `{uuid}.jpg` as latest and the last 20 timestamped screenshots (`{uuid}_..._{type}.jpg`), deleting older timestamped files automatically.
- For high-priority screenshots, the API additionally maintains `{uuid}_priority.jpg` and metadata in `{uuid}_meta.json` (`latest_screenshot_type`, `last_priority_*`).
+## Documentation sync rule
+When services/MQTT/API/UTC/env behavior changes:
+1. Update this file (concise deltas only).
+2. Update canonical docs where details live.
+3. Update changelogs separately (`TECH-CHANGELOG.md`, `DEV-CHANGELOG.md`, `dashboard/public/program-info.json` as appropriate).

-  ## Recent changes since last commit
-
-  ### Latest (March 2026)
-
-  - **Monitoring System Completion (no version bump)**:
-    - End-to-end monitoring pipeline completed: MQTT logs/health → listener persistence → monitoring APIs → superadmin dashboard
-    - API now serves aggregated monitoring via `GET /api/client-logs/monitoring-overview` and system-wide recent errors via `GET /api/client-logs/recent-errors`
-    - Monitoring dashboard (`dashboard/src/monitoring.tsx`) is active and displays client health states, screenshots, process metadata, and recent log activity
-  - **Screenshot Priority Pipeline (no version bump)**:
-    - Listener forwards `screenshot_type` from MQTT screenshot/dashboard payloads to `POST /api/clients/<uuid>/screenshot`.
-    - API stores typed screenshots, tracks latest/priority metadata, and serves priority images via `GET /screenshots/<uuid>/priority`.
-    - Monitoring overview exposes screenshot priority state (`latestScreenshotType`, `priorityScreenshotType`, `priorityScreenshotReceivedAt`, `hasActivePriorityScreenshot`) and `summary.activePriorityScreenshots`.
-    - Monitoring UI shows screenshot type badges and switches to faster refresh while priority screenshots are active.
-  - **MQTT Dashboard Payload v2 Cutover (no version bump)**:
-    - Dashboard payload parsing in `listener/listener.py` is now v2-only (`message`, `content`, `runtime`, `metadata`).
-    - Legacy top-level dashboard fallback was removed after migration soak (`legacy_fallback=0`).
-    - Listener observability summarizes parser health using `v2_success` and `parse_failures` counters.
-  - **Presentation Flags Persistence Fix**:
-    - Fixed persistence for presentation `page_progress` and `auto_progress` to ensure values are reliably stored and returned across create/update paths and detached occurrences
-
-  ### Earlier (January 2026)
-  
-  - **Ressourcen Page (Timeline View)**:
-    - New 'Ressourcen' page with parallel timeline view showing active events for all room groups
-    - Compact timeline display with adjustable row height (65px per group)
-    - Real-time view of currently running events with type, title, and time window
-    - Customizable group ordering with visual reordering panel (drag up/down buttons)
-    - Group order persisted via `GET/POST /api/groups/order` endpoints
-    - Color-coded event bars matching group theme
-    - Timeline modes: Day and Week views (day view by default)
-    - Dynamic height calculation based on number of groups
-    - Syncfusion ScheduleComponent with TimelineViews, Resize, and DragAndDrop support
-    - Files: `dashboard/src/ressourcen.tsx` (page), `dashboard/src/ressourcen.css` (styles)
-
-  ### Earlier (November 2025)
-  
-  - **API Naming Convention Standardization (camelCase)**:
-    - Backend: Created `server/serializers.py` with `dict_to_camel_case()` utility for consistent JSON serialization
-    - Events API: `GET /api/events` and `GET /api/events/<id>` now return camelCase fields (`id`, `subject`, `startTime`, `endTime`, `type`, `groupId`, etc.) instead of PascalCase
-    - Frontend: Dashboard and appointments page updated to consume camelCase API responses
-    - Appointments page maintains internal PascalCase for Syncfusion scheduler compatibility with automatic mapping from API responses
-    - **Breaking**: External API consumers must update field names from PascalCase to camelCase
-  
-  - **UTC Time Handling**:
-    - Database stores all timestamps in UTC (naive timestamps normalized by backend)
-    - API returns ISO strings without 'Z' suffix: `"2025-11-27T20:03:00"`
-    - Frontend: Dashboard and appointments automatically append 'Z' to parse as UTC and display in user's local timezone
-    - Time formatting functions use `toLocaleTimeString('de-DE')` for German locale display
-    - All time comparisons use UTC; `new Date().toISOString()` sends UTC back to API
-    - API returns ISO strings without `Z`; frontend must append `Z` before parsing to ensure UTC
-
-  - **Dashboard Enhancements**:
-    - New card-based design for Raumgruppen (room groups) with Syncfusion components
-    - Global statistics summary: total infoscreens, online/offline counts, warning groups
-    - Filter buttons: All, Online, Offline, Warnings with dynamic counts
-    - Active event display per group: shows currently playing content with type icon, title, date, and time
-    - Health visualization with color-coded progress bars per group
-    - Expandable client details with last alive timestamps
-    - Bulk restart functionality for offline clients per group
-    - Manual refresh button with toast notifications
-    - 15-second auto-refresh interval
-
-  ### Earlier changes
-
-  - Scheduler: when formatting video events the scheduler now performs a best-effort HEAD probe of the streaming URL and includes basic metadata in the emitted payload (mime_type, size, accept_ranges). Placeholders for richer metadata (duration, resolution, bitrate, qualities, thumbnails, checksum) are included for later population by a background worker.
-  - Streaming endpoint: a range-capable streaming endpoint was added at `/api/eventmedia/stream/<media_id>/<filename>` that supports byte-range requests (206 Partial Content) to enable seeking from clients.
-  - Event model & API: `Event` gained video-related fields (`event_media_id`, `autoplay`, `loop`, `volume`) and the API accepts and persists these when creating/updating video events.
-  - Dashboard: UI updated to allow selecting uploaded videos for events and to specify autoplay/loop/volume. File upload settings were increased (maxFileSize raised) and the client now validates video duration (max 10 minutes) before upload.
-  - FileManager: uploads compute basic metadata and enqueue conversions for office formats as before; video uploads now surface size and are streamable via the new endpoint.
-
-  - Event model & API (new): Added `muted` (Boolean) for video events; create/update and GET endpoints accept, persist, and return `muted` alongside `autoplay`, `loop`, and `volume`.
-  - Dashboard — Settings: Settings page refactored to nested tabs; added Events → Videos defaults (autoplay, loop, volume, mute) backed by system settings keys (`video_autoplay`, `video_loop`, `video_volume`, `video_muted`).
-  - Dashboard — Events UI: CustomEventModal now exposes per-event video `muted` and initializes all video fields from system defaults when creating a new event.
-  - Dashboard — Academic Calendar: Holiday management now uses a single “📥 Ferienkalender: Import/Anzeige” tab; admins select the target academic period first, and import/list content redraws for that period.
-  - Dashboard — Holiday Management Hardening: The same tab now supports manual holiday CRUD in addition to CSV/TXT import. Imports and manual saves validate date ranges against the selected academic period, prevent duplicates, auto-merge same normalized name+region overlaps (including adjacent ranges), and report conflicting overlaps.
-
-  Note: these edits are intentionally backwards-compatible — if the probe fails, the scheduler still emits the stream URL and the client should fallback to a direct play attempt or request richer metadata when available.
-
-  Backend rework notes (no version bump):
-  - Dev container hygiene: UI-only Remote Containers; reproducible dashboard installs (`npm ci`); idempotent shell aliases.
-  - Serialization consistency: snake_case internal → camelCase external via `server/serializers.py` for all JSON.
-  - UTC normalization across routes/scheduler; enums and datetimes serialize consistently.
-
-## Service boundaries & data flow
- Database connection string is passed as `DB_CONN` (mysql+pymysql) to Python services.
-  - API builds its engine in `server/database.py` (loads `.env` only in development).
-  - Listener also creates its own engine for writes to `clients`.
-  - Scheduler queries a future window (default: 7 days) to expand recurring events using RFC 5545 rules, applies event exceptions (skipped dates, detached occurrences), and publishes only events that are active at the current time (UTC). When a group has no active events, the scheduler clears its retained topic by publishing an empty list. Time comparisons are UTC; naive timestamps are normalized. Logging is concise; conversion lookups are cached and logged only once per media.
- MQTT topics (paho-mqtt v2, use Callback API v2):
-  - Discovery: `infoscreen/discovery` (JSON includes `uuid`, hw/ip data). ACK to `infoscreen/{uuid}/discovery_ack`. See `listener/listener.py`.
-  - Heartbeat: `infoscreen/{uuid}/heartbeat` updates `Client.last_alive` (UTC); enhanced payload includes `current_process`, `process_pid`, `process_status`, `current_event_id`.
-  - Event lists (retained): `infoscreen/events/{group_id}` from `scheduler/scheduler.py`.
-  - Per-client group assignment (retained): `infoscreen/{uuid}/group_id` via `server/mqtt_helper.py`.
-  - Client logs: `infoscreen/{uuid}/logs/{error|warn|info}` with JSON payload (timestamp, message, context); QoS 1 for ERROR/WARN, QoS 0 for INFO.
-  - Client health: `infoscreen/{uuid}/health` with metrics (expected_state, actual_state, health_metrics); QoS 0, published every 5 seconds.
-  - Dashboard screenshots: `infoscreen/{uuid}/dashboard` uses grouped v2 payload blocks (`message`, `content`, `runtime`, `metadata`); listener reads screenshot data from `content.screenshot` and capture type from `metadata.capture.type`.
- Screenshots: server-side folder `server/screenshots/`; API serves `/screenshots/{uuid}.jpg` (latest) and `/screenshots/{uuid}/priority` (active high-priority fallback to latest).
-
- Dev Container guidance: If extensions reappear inside the container, remove UI-only extensions from `devcontainer.json` `extensions` and map them in `remote.extensionKind` as `"ui"`.
-
- Presentation conversion (PPT/PPTX/ODP → PDF):
-  - Trigger: on upload in `server/routes/eventmedia.py` for media types `ppt|pptx|odp` (compute sha256, upsert `Conversion`, enqueue job).
-  - Worker: RQ worker runs `server.worker.convert_event_media_to_pdf`, calls Gotenberg LibreOffice endpoint, writes to `server/media/converted/`.
-  - Services: Redis (queue) and Gotenberg added in compose; worker service consumes the `conversions` queue.
-  - Env: `REDIS_URL` (default `redis://redis:6379/0`), `GOTENBERG_URL` (default `http://gotenberg:3000`).
-  - Endpoints: `POST /api/conversions/<media_id>/pdf` (ensure/enqueue), `GET /api/conversions/<media_id>/status`, `GET /api/files/converted/<path>` (serve PDFs).
-  - Storage: originals under `server/media/…`, outputs under `server/media/converted/` (prod compose mounts a shared volume for this path).
-
-## Data model highlights (see `models/models.py`)
- User model: Includes 7 new audit/security fields (migration: `4f0b8a3e5c20_add_user_audit_fields.py`):
-  - `last_login_at`, `last_password_change_at`: TIMESTAMP (UTC) tracking for auth events
-  - `failed_login_attempts`, `last_failed_login_at`: Security monitoring for brute-force detection
-  - `locked_until`: TIMESTAMP placeholder for account lockout (infrastructure in place, not yet enforced)
-  - `deactivated_at`, `deactivated_by`: Soft-delete audit trail (FK self-reference); soft deactivation is the default, hard delete superadmin-only
-  - Role hierarchy (privilege escalation enforced): `user` < `editor` < `admin` < `superadmin`
- Client monitoring (migration: `c1d2e3f4g5h6_add_client_monitoring.py`):
-  - `ClientLog` model: Centralized log storage with fields (id, client_uuid, timestamp, level, message, context, created_at); FK to clients.uuid (CASCADE)
-  - `Client` model extended: 7 health monitoring fields (`current_event_id`, `current_process`, `process_status`, `process_pid`, `last_screenshot_analyzed`, `screen_health_status`, `last_screenshot_hash`)
-  - Enums: `LogLevel` (ERROR, WARN, INFO, DEBUG), `ProcessStatus` (running, crashed, starting, stopped), `ScreenHealthStatus` (OK, BLACK, FROZEN, UNKNOWN)
-  - Indexes: (client_uuid, timestamp DESC), (level, timestamp DESC), (created_at DESC) for performance
- System settings: `system_settings` key–value store via `SystemSetting` for global configuration (e.g., WebUntis/Vertretungsplan supplement-table). Managed through routes in `server/routes/system_settings.py`.
-  - Presentation defaults (system-wide):
-    - `presentation_interval` (seconds, default "10")
-    - `presentation_page_progress` ("true"/"false", default "true")
-    - `presentation_auto_progress` ("true"/"false", default "true")
-    Seeded in `server/init_defaults.py` if missing.
-  - Video defaults (system-wide):
-    - `video_autoplay` ("true"/"false", default "true")
-    - `video_loop` ("true"/"false", default "true")
-    - `video_volume` (0.0–1.0, default "0.8")
-    - `video_muted` ("true"/"false", default "false")
-    Used as initial values when creating new video events; editable per event.
-  - Events: Added `page_progress` (Boolean) and `auto_progress` (Boolean) for presentation behavior per event.
-  - Event (video fields): `event_media_id`, `autoplay`, `loop`, `volume`, `muted`.
-  - WebUntis URL: WebUntis uses the existing Vertretungsplan/Supplement-Table URL (`supplement_table_url`). There is no separate `webuntis_url` setting; use `GET/POST /api/system-settings/supplement-table`.
-
- Conversions:
-  - Enum `ConversionStatus`: `pending`, `processing`, `ready`, `failed`.
-  - Table `conversions`: `id`, `source_event_media_id` (FK→`event_media.id` ondelete CASCADE), `target_format`, `target_path`, `status`, `file_hash` (sha256), `started_at`, `completed_at`, `error_message`.
-  - Indexes: `(source_event_media_id, target_format)`, `(status, target_format)`; Unique: `(source_event_media_id, target_format, file_hash)`.
-
-## API patterns
- Blueprints live in `server/routes/*` and are registered in `server/wsgi.py` with `/api/...` prefixes.
- Session usage: instantiate `Session()` per request, commit when mutating, and always `session.close()` before returning.
- Examples:
-  - Clients: `server/routes/clients.py` includes bulk group updates and MQTT sync (`publish_multiple_client_groups`).
-  - Groups: `server/routes/groups.py` computes “alive” using a grace period that varies by `ENV`.
-    - `GET /api/groups/order` — retrieve saved group display order
-    - `POST /api/groups/order` — persist group display order (array of group IDs)
-  - Events: `server/routes/events.py` serializes enum values to strings and normalizes times to UTC. Recurring events are only deactivated after their recurrence_end (UNTIL); non-recurring events deactivate after their end time. Event exceptions are respected and rendered in scheduler output.
-  - Holidays: `server/routes/holidays.py` supports period-scoped list/import/manual CRUD (`GET/POST /api/holidays`, `POST /api/holidays/upload`, `PUT/DELETE /api/holidays/<id>`), validates date ranges against the target period, prevents duplicates, merges same normalized `name+region` overlaps (including adjacent ranges), and rejects conflicting overlaps.
-  - Media: `server/routes/eventmedia.py` implements a simple file manager API rooted at `server/media/`.
-  - System settings: `server/routes/system_settings.py` exposes key–value CRUD (`/api/system-settings`) and a convenience endpoint for WebUntis/Vertretungsplan supplement-table: `GET/POST /api/system-settings/supplement-table` (admin+).
-  - Academic periods: `server/routes/academic_periods.py` exposes full lifecycle management (admin+ only):
-    - `GET /api/academic_periods` — list all non-archived periods ordered by start_date
-    - `GET /api/academic_periods/<id>` — get single period by ID (including archived)
-    - `GET /api/academic_periods/active` — get currently active period
-    - `GET /api/academic_periods/for_date?date=YYYY-MM-DD` — period covering given date (non-archived)
-    - `GET /api/academic_periods/<id>/usage` — check linked events/media and recurrence spillover blockers
-    - `POST /api/academic_periods` — create period (validates name uniqueness among non-archived, date range, overlaps within periodType)
-    - `PUT /api/academic_periods/<id>` — update period (cannot update archived periods)
-    - `POST /api/academic_periods/<id>/activate` — activate period (deactivates all others; cannot activate archived)
-    - `POST /api/academic_periods/<id>/archive` — soft-delete period (blocked if active or has active recurrence)
-    - `POST /api/academic_periods/<id>/restore` — restore archived period (returns to inactive)
-    - `DELETE /api/academic_periods/<id>` — hard-delete archived inactive period (blocked if linked events exist)
-    - All responses use camelCase: `startDate`, `endDate`, `periodType`, `isActive`, `isArchived`, `archivedAt`, `archivedBy`
-    - Validation: name required/trimmed/unique among non-archived; startDate ≤ endDate; periodType in {schuljahr, semester, trimester}; overlaps blocked within same periodType
-    - Recurrence spillover detection: archive/delete blocked if recurring master events assigned to period still have current/future occurrences
-  - User management: `server/routes/users.py` exposes comprehensive CRUD for users (admin+):
-    - `GET /api/users` — list all users (role-filtered: admin sees user/editor/admin, superadmin sees all); includes audit fields in camelCase (lastLoginAt, lastPasswordChangeAt, failedLoginAttempts, deactivatedAt, deactivatedBy)
-    - `POST /api/users` — create user with username, password (min 6 chars), role, and status; admin cannot create superadmin; initializes audit fields
-    - `GET /api/users/<id>` — get detailed user record with all audit fields
-    - `PUT /api/users/<id>` — update user (cannot change own role/status; admin cannot modify superadmin accounts)
-    - `PUT /api/users/<id>/password` — admin password reset (requires backend check to reject self-reset for consistency)
-    - `DELETE /api/users/<id>` — hard delete (superadmin only, with self-deletion check)
-  - Auth routes (`server/routes/auth.py`): Enhanced to track login events (sets `last_login_at`, resets `failed_login_attempts` on success; increments `failed_login_attempts` and `last_failed_login_at` on failure). Self-service password change via `PUT /api/auth/change-password` requires current password verification.
-  - Client logs (`server/routes/client_logs.py`): Centralized log retrieval for monitoring:
-    - `GET /api/client-logs/<uuid>/logs` – Query client logs with filters (level, limit, since); admin_or_higher
-    - `GET /api/client-logs/summary` – Log counts by level per client (last 24h); admin_or_higher
-    - `GET /api/client-logs/recent-errors` – System-wide error monitoring; admin_or_higher
-    - `GET /api/client-logs/monitoring-overview` – Includes screenshot priority fields per client plus `summary.activePriorityScreenshots`; superadmin_only
-    - `GET /api/client-logs/test` – Infrastructure validation (no auth); returns recent logs with counts
-
-  Documentation maintenance: keep this file aligned with real patterns; update when routes/session/UTC rules change. Avoid long prose; link exact paths.
-
-## Frontend patterns (dashboard)
- **UI design rules**: Component choices, layout structure, button variants, badge colors, dialog patterns, toast conventions, and tab structure are defined in [`FRONTEND_DESIGN_RULES.md`](./FRONTEND_DESIGN_RULES.md). Follow that file for all dashboard work.
- Vite React app; proxies `/api` and `/screenshots` to API in dev (`vite.config.ts`).
- Uses Syncfusion components; Vite config pre-bundles specific packages to avoid alias issues.
- Environment: `VITE_API_URL` provided at build/run; in dev compose, proxy handles `/api` so local fetches can use relative `/api/...` paths.
- **API Response Format**: All API endpoints return camelCase JSON (e.g., `startTime`, `endTime`, `groupId`). Frontend consumes camelCase directly.
- **UTC Time Parsing**: API returns ISO strings without 'Z' suffix. Frontend appends 'Z' before parsing to ensure UTC interpretation: `const utcString = dateStr.endsWith('Z') ? dateStr : dateStr + 'Z'; new Date(utcString);`. Display uses `toLocaleTimeString('de-DE')` for German format.
-
- Dev Container: When adding frontend deps, prefer `npm ci` and, if using named volumes, recreate dashboard `node_modules` volume so installs occur inside the container.
- Theming: All Syncfusion component CSS is imported centrally in `dashboard/src/main.tsx`. Theme conventions, component defaults, the full CSS import list, and Tailwind removal are documented in `FRONTEND_DESIGN_RULES.md`.
- - Scheduler (appointments page): top bar includes Group and Academic Period selectors (Syncfusion DropDownList). Selecting a period calls `POST /api/academic_periods/active`, moves the calendar to today’s month/day within the period year, and refreshes a right-aligned indicator row showing:
-   - Holidays present in the current view (count)
-   - Period label (display_name or name) with a badge indicating whether any holidays exist in that period (overlap check)
-
- - Recurrence & holidays (latest):
-  - Backend stores holiday skips in `EventException` and emits `RecurrenceException` (EXDATE) for master events in `GET /api/events`. EXDATE tokens are formatted in RFC 5545 compact form (`yyyyMMddTHHmmssZ`) and correspond to each occurrence start time (UTC). Syncfusion uses these to exclude holiday instances reliably.
-  - Frontend lets Syncfusion handle all recurrence patterns natively (no client-side expansion). Scheduler field mappings include `recurrenceID`, `recurrenceRule`, and `recurrenceException` so series and edited occurrences are recognized correctly.
-  - Event deletion: All event types (single, single-in-series, entire series) are handled with custom dialogs. The frontend intercepts Syncfusion's built-in RecurrenceAlert and DeleteAlert popups to provide a unified, user-friendly deletion flow:
-    - Single (non-recurring) event: deleted directly after confirmation.
-    - Single occurrence of a recurring series: user can delete just that instance.
-    - Entire recurring series: user can delete all occurrences after a final custom confirmation dialog.
-    - Detached occurrences (edited/broken out): treated as single events.
-  - Single occurrence editing: Users can detach individual occurrences from recurring series. The frontend hooks `actionComplete`/`onActionCompleted` with `requestType='eventChanged'` to persist changes: it calls `POST /api/events/<id>/occurrences/<date>/detach` for single-occurrence edits and `PUT /api/events/<id>` for series or single events as appropriate. The backend creates `EventException` and a standalone `Event` without modifying the master beyond EXDATEs.
-  - UI: Events with `SkipHolidays` render a TentTree icon next to the main event icon. The custom recurrence icon in the header was removed; rely on Syncfusion’s native lower-right recurrence badge.
-  - Website & WebUntis: Both event types display a website. WebUntis reads its URL from the system `supplement_table_url` and does not provide a per-event URL field.
-
- Program info page (`dashboard/src/programminfo.tsx`):
-  - Loads data from `dashboard/public/program-info.json` (app name, version, build info, tech stack, changelog).
-  - Uses Syncfusion card classes (`e-card`, `e-card-header`, `e-card-title`, `e-card-content`) for consistent styling.
-  - Changelog is paginated with `PagerComponent` (from `@syncfusion/ej2-react-grids`), default page size 5; adjust `pageSize` or add a selector as needed.
-
- Groups page (`dashboard/src/infoscreen_groups.tsx`):
-  - Migrated to Syncfusion inputs and popups: Buttons, TextBox, DropDownList, Dialog; Kanban remains for drag/drop.
-  - Unified toast/dialog wording; replaced legacy alerts with toasts; spacing handled via inline styles to avoid Tailwind dependency.
-
- Header user menu (top-right):
-  - Shows current username and role; click opens a menu with "Passwort ändern" (lock icon), "Profil", and "Abmelden".
-  - Implemented with Syncfusion DropDownButton (`@syncfusion/ej2-react-splitbuttons`).
-  - "Passwort ändern": Opens self-service password change dialog (available to all authenticated users); requires current password verification, new password min 6 chars, must match confirm field; calls `PUT /api/auth/change-password`
-  - "Abmelden" navigates to `/logout`; the page invokes backend logout and redirects to `/login`.
-
- User management page (`dashboard/src/users.tsx`):
-  - Full CRUD interface for managing users (admin+ only in menu); accessible via "Benutzer" sidebar entry
-  - Syncfusion GridComponent: 20 per page (configurable), sortable columns (ID, username, role), custom action button template with role-based visibility
-  - Statistics cards: total users, active (non-deactivated), inactive (deactivated) counts
-  - Dialogs: Create (username/password/role/status), Edit (with self-edit protections), Password Reset (admin only, no current password required), Delete (superadmin only, self-check), Details (read-only audit info with formatted timestamps)
-  - Role badges: Color-coded display (user: gray, editor: blue, admin: green, superadmin: red)
-  - Audit information displayed: last login, password change, last failed login, deactivation timestamps and deactivating user
-  - Role-based permissions (enforced backend + frontend):
-    - Admin: can manage user/editor/admin roles (not superadmin); soft-deactivate only; cannot see/edit superadmin accounts
-    - Superadmin: can manage all roles including other superadmins; can permanently hard-delete users
-  - Security rules enforced: cannot change own role, cannot deactivate own account, cannot delete self, cannot reset own password via admin route (must use self-service)
-  - API client in `dashboard/src/apiUsers.ts` for all user operations (listUsers, getUser, createUser, updateUser, resetUserPassword, deleteUser)
-  - Menu visibility: "Benutzer" menu item only visible to admin+ (role-gated in App.tsx)
-
- Monitoring page (`dashboard/src/monitoring.tsx`):
-  - Superadmin-only dashboard for client monitoring and diagnostics; menu item is hidden for lower roles and the route redirects non-superadmins.
-  - Uses `GET /api/client-logs/monitoring-overview` for aggregated live status, `GET /api/client-logs/recent-errors` for system-wide errors, and `GET /api/client-logs/<uuid>/logs` for per-client details.
-  - Shows per-client status (`healthy`, `warning`, `critical`, `offline`) based on heartbeat freshness, process state, screen state, and recent log counts.
-  - Displays latest screenshot preview and active priority screenshot (`/screenshots/{uuid}/priority` when active), screenshot type badges, current process metadata, and recent ERROR/WARN activity.
-  - Uses adaptive refresh: normal interval in steady state, faster polling while `activePriorityScreenshots > 0`.
-
- Settings page (`dashboard/src/settings.tsx`):
-  - Structure: Syncfusion TabComponent with role-gated tabs
-    - 📅 Academic Calendar (all users)
-      - **🗂️ Perioden (first sub-tab)**: Full period lifecycle management (admin+)
-        - List non-archived periods with active/archived badges and action buttons
-        - Create: dialog for name, displayName, startDate, endDate, periodType with validation
-        - Edit: update name, displayName, dates, type (cannot edit archived)
-        - Activate: set as active (deactivates all others)
-        - Archive: soft-delete with blocker checks (blocks if active or has active recurrence)
-        - Restore: restore archived periods to inactive state
-        - Delete: hard-delete archived periods with blocker checks (blocks if linked events)
-        - Archive visibility: toggle to show/hide archived periods
-        - Blockers: display prevents action with clear list of reasons (linked events, active recurrence, active status)
-      - **📥 Ferienkalender: Import/Anzeige (second sub-tab)**: CSV/TXT holiday import plus manual holiday create/edit/delete scoped to the selected academic period; changing the period redraws the import/list body.
-        - Import summary surfaces inserted/updated/merged/skipped/conflict counts and detailed conflict lines.
-        - File selection uses Syncfusion-styled trigger button and visible selected filename state.
-        - Manual date inputs guide users with bidirectional start/end constraints and prefill behavior.
-    - 🖥️ Display & Clients (admin+)
-      - Default Settings: placeholders for heartbeat, screenshots, defaults
-      - Client Configuration: quick links to Clients and Groups pages
-    - 🎬 Media & Files (admin+)
-      - Upload Settings: placeholders for limits and types
-      - Conversion Status: placeholder for conversions overview
-    - 🗓️ Events (admin+)
-      - WebUntis / Vertretungsplan: system-wide supplement table URL with enable/disable, save, and preview; persists via `/api/system-settings/supplement-table`
-      - Presentations: general defaults for slideshow interval, page-progress, and auto-progress; persisted via `/api/system-settings` keys (`presentation_interval`, `presentation_page_progress`, `presentation_auto_progress`). These defaults are applied when creating new presentation events (the custom event modal reads them and falls back to per-event values when editing).
-      - Videos: system-wide defaults for `autoplay`, `loop`, `volume`, and `muted`; persisted via `/api/system-settings` keys (`video_autoplay`, `video_loop`, `video_volume`, `video_muted`). These defaults are applied when creating new video events (the custom event modal reads them and falls back to per-event values when editing).
-      - Other event types (website, message, other): placeholders for defaults
-    - ⚙️ System (superadmin)
-      - Organization Info and Advanced Configuration placeholders
-  - Role gating: Admin/Superadmin tabs are hidden if the user lacks permission; System is superadmin-only
-  - API clients use relative `/api/...` URLs so Vite dev proxy handles requests without CORS issues. The settings UI calls are centralized in `dashboard/src/apiSystemSettings.ts` (system settings) and `dashboard/src/apiAcademicPeriods.ts` (periods CRUD).
-  - Nested tabs: implemented as controlled components using `selectedItem` with stateful handlers to prevent sub-tab resets during updates.
-  - Academic periods API client (`dashboard/src/apiAcademicPeriods.ts`): provides type-safe camelCase accessors (listAcademicPeriods, getAcademicPeriod, createAcademicPeriod, updateAcademicPeriod, setActiveAcademicPeriod, archiveAcademicPeriod, restoreAcademicPeriod, getAcademicPeriodUsage, deleteAcademicPeriod).
-
- Dashboard page (`dashboard/src/dashboard.tsx`):
-  - Card-based overview of all Raumgruppen (room groups) with real-time status monitoring
-  - Global statistics: total infoscreens, online/offline counts, warning groups
-  - Filter buttons: All / Online / Offline / Warnings with dynamic counts
-  - Per-group cards show:
-    - Currently active event (title, type, date/time in local timezone)
-    - Health bar with online/offline ratio and color-coded status
-    - Expandable client list with last alive timestamps
-    - Bulk restart button for offline clients
-  - Uses Syncfusion ButtonComponent, ToastComponent, and card CSS classes
-  - Auto-refresh every 15 seconds; manual refresh button available
-  - "Nicht zugeordnet" group always appears last in sorted list
-
- Ressourcen page (`dashboard/src/ressourcen.tsx`):
-  - Timeline view showing all groups and their active events in parallel
-  - Uses Syncfusion ScheduleComponent with TimelineViews (day/week modes)
-  - Compact row display: 65px height per group, dynamically calculated total height
-  - Group ordering panel with drag up/down controls; order persisted to backend via `/api/groups/order`
-  - Filters out "Nicht zugeordnet" group from timeline display
-  - Fetches events per group for current date range; displays first active event per group
-  - Color-coded event bars using `getGroupColor()` from `groupColors.ts`
-  - Resource-based timeline: each group is a resource row, events mapped to `ResourceId`
-  - Real-time updates: loads events on mount and when view/date changes
-  - Custom CSS in `dashboard/src/ressourcen.css` for timeline styling and controls
-
- User dropdown technical notes:
-  - Dependencies: `@syncfusion/ej2-react-splitbuttons` and `@syncfusion/ej2-splitbuttons` must be installed.
-  - Vite: add both to `optimizeDeps.include` in `vite.config.ts` to avoid import-analysis errors.
-  - Dev containers: when `node_modules` is a named volume, recreate the dashboard node_modules volume after adding dependencies so `npm ci` runs inside the container.
-
-Note: Syncfusion usage in the dashboard is already documented above; if a UI for conversion status/downloads is added later, link its routes and components here.
-
-## Local development
- Compose: development is `docker-compose.yml` + `docker-compose.override.yml`.
-  - API (dev): `server/Dockerfile.dev` with debugpy on 5678, Flask app `wsgi:app` on :8000.
-  - Dashboard (dev): `dashboard/Dockerfile.dev` exposes :5173 and waits for API via `dashboard/wait-for-backend.sh`.
-  - Mosquitto: allows anonymous in dev; WebSocket on :9001.
- Common env vars: `DB_CONN`, `DB_USER`, `DB_PASSWORD`, `DB_HOST=db`, `DB_NAME`, `ENV`, `MQTT_USER`, `MQTT_PASSWORD`.
-  - Alembic: prod compose runs `alembic ... upgrade head` and `server/init_defaults.py` before gunicorn.
-  - Local dev: prefer `python server/initialize_database.py` for one-shot setup (migrations + defaults + academic periods).
-  - Defaults: `server/init_defaults.py` seeds initial system settings like `supplement_table_url` and `supplement_table_enabled` if missing.
-  - `server/init_academic_periods.py` remains available to (re)seed school years.
-
-## Production
- `docker-compose.prod.yml` uses prebuilt images (`ghcr.io/robbstarkaustria/*`).
- Nginx serves dashboard and proxies API; TLS certs expected in `certs/` and mounted to `/etc/nginx/certs`.
-
-## Environment variables (reference)
- DB_CONN — Preferred DB URL for services. Example: `mysql+pymysql://${DB_USER}:${DB_PASSWORD}@db/${DB_NAME}`
- DB_USER, DB_PASSWORD, DB_NAME, DB_HOST — Used to assemble DB_CONN in dev if missing; inside containers `DB_HOST=db`.
- ENV — `development` or `production`; in development, `server/database.py` loads `.env`.
- MQTT_BROKER_HOST, MQTT_BROKER_PORT — Defaults `mqtt` and `1883`; MQTT_USER/MQTT_PASSWORD optional (dev often anonymous per Mosquitto config).
- VITE_API_URL — Dashboard build-time base URL (prod); in dev the Vite proxy serves `/api` to `server:8000`.
- HEARTBEAT_GRACE_PERIOD_DEV / HEARTBEAT_GRACE_PERIOD_PROD — Groups "alive" window (defaults 180s dev / 170s prod). Clients send heartbeats every ~65s; grace periods allow 2 missed heartbeats plus safety margin.
- REFRESH_SECONDS — Optional scheduler republish interval; `0` disables periodic refresh.
- PRIORITY_SCREENSHOT_TTL_SECONDS — Optional monitoring priority window in seconds (default `120`); controls when event screenshots are considered active priority.
-
-## Conventions & gotchas
- **Datetime Handling**:
-  - Always compare datetimes in UTC; some DB values may be naive—normalize before comparing (see `routes/events.py`).
-  - Database stores timestamps in UTC (naive datetimes are normalized to UTC by backend)
-  - API returns ISO strings **without** 'Z' suffix: `"2025-11-27T20:03:00"`
-  - Frontend **must** append 'Z' before parsing: `const utcStr = dateStr.endsWith('Z') ? dateStr : dateStr + 'Z'; new Date(utcStr);`
-  - Display in local timezone using `toLocaleTimeString('de-DE', { hour: '2-digit', minute: '2-digit' })`
-  - When sending to API, use `date.toISOString()` which includes 'Z' and is UTC
- **JSON Naming Convention**:
-  - Backend uses snake_case internally (Python convention)
-  - API returns camelCase JSON (web standard): `startTime`, `endTime`, `groupId`, etc.
-  - Use `dict_to_camel_case()` from `server/serializers.py` before `jsonify()`
-  - Frontend consumes camelCase directly; Syncfusion scheduler maintains internal PascalCase with field mappings
- Scheduler enforces UTC comparisons and normalizes naive timestamps. It publishes only currently active events and clears retained topics for groups with no active events. It also queries a future window (default: 7 days) and expands recurring events using RFC 5545 rules. Event exceptions are respected. Logging is concise and conversion lookups are cached.
- Use retained MQTT messages for state that clients must recover after reconnect (events per group, client group_id).
-  - Clients should parse `event_type` and then read the corresponding nested payload (`presentation`, `website`, `video`, etc.). `website` and `webuntis` use the same nested `website` payload with `type: browser` and a `url`. Video events include `autoplay`, `loop`, `volume`, and `muted`.
- In-container DB host is `db`; do not use `localhost` inside services.
- No separate dev vs prod secret conventions: use the same env var keys across environments (e.g., `DB_CONN`, `MQTT_USER`, `MQTT_PASSWORD`).
- When adding a new route:
-  1) Create a Blueprint in `server/routes/...`,
-  2) Register it in `server/wsgi.py`,
-  3) Manage `Session()` lifecycle,
-  4) Return JSON-safe values (serialize enums and datetimes), and
-  5) Use `dict_to_camel_case()` for camelCase JSON responses
-
-Docs maintenance guardrails (solo-friendly): Update this file alongside code changes (services/MQTT/API/UTC/env). Keep it concise (20–50 lines per section). Never include secrets.
- When extending media types, update `MediaType` and any logic in `eventmedia` and dashboard that depends on it.
- Academic periods: Events/media can be optionally associated with periods for educational organization. Only one period should be active at a time (`is_active=True`).
- Initialization scripts: legacy DB init scripts were removed; use Alembic and `initialize_database.py` going forward.
-
-### Recurrence & holidays: conventions
- Do not pre-expand recurrences on the backend. Always send master events with `RecurrenceRule` + `RecurrenceException`.
- Ensure EXDATE tokens are RFC 5545 timestamps (`yyyyMMddTHHmmssZ`) matching the occurrence start time (UTC) so Syncfusion can exclude them natively.
- School holidays are scoped by `academic_period_id`; holiday imports and queries should use the relevant academic period rather than treating holiday rows as global.
- Holiday write operations (manual/import) must validate date ranges against the selected academic period.
- Overlap policy: same normalized `name+region` overlaps (including adjacent ranges) are merged; overlaps with different identity are conflicts (manual blocked, import skipped with details).
- When `skip_holidays` or recurrence changes, regenerate `EventException` rows so `RecurrenceException` stays in sync, using the event's `academic_period_id` holidays (or only unassigned holidays for legacy events without a period).
- Single occurrence detach: Use `POST /api/events/<id>/occurrences/<date>/detach` to create standalone events and add EXDATE entries without modifying master events. The frontend persists edits via `actionComplete` (`requestType='eventChanged'`).
-
-## Quick examples
- Add client description persists to DB and publishes group via MQTT: see `PUT /api/clients/<uuid>/description` in `routes/clients.py`.
- Bulk group assignment emits retained messages for each client: `PUT /api/clients/group`.
- Listener heartbeat path: `infoscreen/<uuid>/heartbeat` → sets `clients.last_alive` and captures process health data.
- Client monitoring flow: Client publishes to `infoscreen/{uuid}/logs/error` and `infoscreen/{uuid}/health` → listener stores/updates monitoring state → API serves `/api/client-logs/monitoring-overview`, `/api/client-logs/recent-errors`, and `/api/client-logs/<uuid>/logs` → superadmin monitoring dashboard displays live status.
-
-## Scheduler payloads: presentation extras
- Presentation event payloads now include `page_progress` and `auto_progress` in addition to `slide_interval` and media files. These are sourced from per-event fields in the database (with system defaults applied on event creation).
- 
-## Scheduler payloads: website & webuntis
- For both `website` and `webuntis`, the scheduler emits a nested `website` object:
-  - `{ "type": "browser", "url": "https://..." }`
- The `event_type` remains `website` or `webuntis`. Clients should treat both identically for rendering.
- The WebUntis URL is set at event creation by reading the system `supplement_table_url`.
-
-Questions or unclear areas? Tell us if you need: exact devcontainer debugging steps, stricter Alembic workflow, or a seed dataset beyond `init_defaults.py`.
-
-## Academic Periods System
- **Purpose**: Organize events and media by educational cycles (school years, semesters, trimesters) with full lifecycle management.
- **Design**: Fully backward compatible - existing events/media continue to work without period assignment.
- **Lifecycle States**: 
-  - Active: exactly one period at a time (all others deactivated when activated)
-  - Inactive: saved period, not currently active
-  - Archived: soft-deleted; hidden from normal list; can be restored
-  - Deleted: hard-deleted; permanent removal (only when no linked events exist and no active recurrence)
- **Archive Rules**: Cannot archive active periods or periods with recurring master events that have current/future occurrences
- **Delete Rules**: Only archived inactive periods can be hard-deleted; blocked if linked events exist
- **Validation Rules**:
-  - Name: required, trimmed, unique among non-archived periods
-  - Dates: startDate ≤ endDate
-  - Type: schuljahr, semester, or trimester
-  - Overlaps: disallowed within same periodType (allowed across types)
- **Recurrence Spillover Detection**: Archive/delete blocked if recurring master events assigned to period still generate current/future occurrences
- **Model Fields**: `id`, `name`, `display_name`, `start_date`, `end_date`, `period_type`, `is_active`, `is_archived`, `archived_at`, `archived_by`, `created_at`, `updated_at`
- **Events/Media Association**: Both `Event` and `EventMedia` have optional `academic_period_id` FK for organizational grouping
- **UI Integration** (`dashboard/src/settings.tsx` > 🗂️ Perioden):
-  - List with badges (Active/Archived)
-  - Create/Edit dialogs with validation
-  - Activate, Archive, Restore, Delete actions with blocker preflight checks
-  - Archive visibility toggle to show/hide retired periods
-  - Error dialogs showing exact blockers (linked events, active recurrence, active status)
-
-## Changelog Style Guide (Program info)
-
- Source: `dashboard/public/program-info.json`; newest entry first
- Fields per release: `version`, `date` (YYYY-MM-DD), `changes` (array of short bullets)
- Tone: concise, user-facing; German wording; area prefixes allowed (e.g., “UI: …”, “API: …”)
- Categories via emoji or words: Added (🆕/✨), Changed (🛠️), Fixed (✅/🐛), Removed (🗑️), Security (🔒), Deprecated (⚠️)
- Breaking changes must be prefixed with `BREAKING:`
- Keep ≤ 8–10 bullets; summarize or group micro-changes
- JSON hygiene: valid JSON, no trailing commas, don’t edit historical entries except typos
-
-## Versioning Convention (Tech vs UI)
-
- Use one unified app version across technical and user-facing release notes.
- `dashboard/public/program-info.json` is user-facing and should list only user-visible changes.
- `TECH-CHANGELOG.md` can include deeper technical details for the same released version.
- If server/infrastructure work is implemented but not yet released or not user-visible, document it under the latest released section as:
-  - `Backend technical work (post-release notes; no version bump)`
- Do not create a new version header in `TECH-CHANGELOG.md` for internal milestones alone.
- Bump version numbers when a release is actually cut/deployed (or when user-facing release notes are published), not for intermediate backend-only steps.
- When UI integration lands later, include the user-visible part in the next release version and reference prior post-release technical groundwork when useful.
+## Canonical docs map
+- Repo entry: `README.md`
+- Instruction governance: `AI-INSTRUCTIONS-MAINTENANCE.md`
+- Technical release details: `TECH-CHANGELOG.md`
+- Workspace/development notes: `DEV-CHANGELOG.md`
+- MQTT payload details: `MQTT_EVENT_PAYLOAD_GUIDE.md`
+- TV power contract: `TV_POWER_INTENT_SERVER_CONTRACT_V1.md`
+- Frontend patterns: `FRONTEND_DESIGN_RULES.md`
+- Archived historical docs: `docs/archive/`
--- a/AI-INSTRUCTIONS-MAINTENANCE.md
+++ b/AI-INSTRUCTIONS-MAINTENANCE.md
@@ -36,6 +36,27 @@ Update the instructions in the same commit as your change whenever you:
 - Include concrete examples from this repo when describing patterns (e.g., which route shows enum serialization).
 - Never include secrets or real tokens; show only variable names and example formats.

+## Scope boundaries (important)
+To avoid turning `.github/copilot-instructions.md` into a shadow README/changelog, keep clear boundaries:
+- `.github/copilot-instructions.md`: quick operational brief for agents (architecture snapshot, non-negotiable conventions, key paths, critical contracts).
+- `README.md`: project entrypoint and documentation navigation.
+- `TECH-CHANGELOG.md` and `DEV-CHANGELOG.md`: change history and release/development notes.
+- Feature contracts/specs: dedicated files (for example `MQTT_EVENT_PAYLOAD_GUIDE.md`, `TV_POWER_INTENT_SERVER_CONTRACT_V1.md`).
+
+Do not place long historical sections, release summaries, or full endpoint catalogs in `.github/copilot-instructions.md`.
+
+## Size and quality guardrails
+- Target size for `.github/copilot-instructions.md`: about 120-220 lines.
+- If a new section exceeds ~10 lines, prefer linking to an existing canonical doc instead.
+- Keep each section focused on actionability for coding agents.
+- Remove duplicate rules if already stated in another section.
+- Use concise bullets over long prose blocks.
+
+Quick pre-commit check:
+- Is this content a rule/pattern needed during coding now?
+- If it is historical context, move it to changelog/archive docs.
+- If it is deep reference material, move it to the canonical feature doc and link it.
+
 ## Solo-friendly workflow
 - Update docs in the same commit as your change:
  - Code changed → docs changed (copilot-instructions, `.env.example`, `deployment.md` as needed)
@@ -101,4 +122,5 @@ exit 0  # warn only; do not block commit
 - Dev/Prod docs: `deployment.md`, `.env.example`

 ## Documentation sync log
- 2026-03-24: Synced docs for completed monitoring rollout and presentation flag persistence fix (`page_progress` / `auto_progress`). Updated `.github/copilot-instructions.md`, `README.md`, `TECH-CHANGELOG.md`, `DEV-CHANGELOG.md`, and `CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md` without a user-version bump.
+- 2026-03-24: Synced docs for completed monitoring rollout and presentation flag persistence fix (`page_progress` / `auto_progress`). Updated `.github/copilot-instructions.md`, `README.md`, `TECH-CHANGELOG.md`, `DEV-CHANGELOG.md`, and `docs/archive/CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md` without a user-version bump.
+- 2026-04-01: Synced docs for TV power intent Phase 1 implementation and contract consistency. Updated `.github/copilot-instructions.md`, `MQTT_EVENT_PAYLOAD_GUIDE.md`, `docs/archive/TV_POWER_PHASE_1_SERVER_HANDOFF.md`, `TECH-CHANGELOG.md`, and `DEV-CHANGELOG.md` to match scheduler behavior (`infoscreen/groups/{group_id}/power/intent`, `schema_version: "1.0"`, transition + heartbeat semantics, poll-based expiry).
--- a/CLIENT_MONITORING_SPECIFICATION.md
+++ b/CLIENT_MONITORING_SPECIFICATION.md
@@ -891,7 +891,7 @@ Reset: After 5 minutes of successful operation
 - Base URL: `http://192.168.43.201:8000`
 - Health check: `GET /health`
 - Test logs: `GET /api/client-logs/test` (no auth)
- Full API docs: See `CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md` on server
+- Full API docs: See `docs/archive/CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md` on server

 **MQTT Broker:**
 - Host: `192.168.43.201`
@@ -974,6 +974,6 @@ watchdog.monitor_loop()
 **END OF SPECIFICATION**

 Questions? Refer to:
- `CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md` (server repo)
+- `docs/archive/CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md` (server repo)
 - Server API: `http://192.168.43.201:8000/api/client-logs/test`
 - MQTT test: `mosquitto_sub -h 192.168.43.201 -t infoscreen/#`
--- a/DEV-CHANGELOG.md
+++ b/DEV-CHANGELOG.md
@@ -5,6 +5,37 @@ This changelog tracks all changes made in the development workspace, including i
 ---

 ## Unreleased (development workspace)
+- Crash detection API: Added `GET /api/clients/crashed` returning clients with `process_status=crashed` or stale heartbeat; includes `crash_reason` field (`process_crashed` | `heartbeat_stale`).
+- Crash auto-recovery (scheduler): Feature-flagged loop (`CRASH_RECOVERY_ENABLED`) scans crash candidates, issues `reboot_host` command, publishes to primary + compat MQTT topics; lockout window and expiry configurable via env.
+- Command expiry sweep (scheduler): Unconditional per-cycle sweep in `sweep_expired_commands()` marks non-terminal `ClientCommand` rows past `expires_at` as `expired`.
+- `restart_app` action registered in `server/routes/clients.py` API action map; sends same command lifecycle as `reboot_host`; safety lockout covers both actions.
+- `service_failed` listener: subscribes to `infoscreen/+/service_failed` on every connect; persists `service_failed_at` + `service_failed_unit` to `Client`; empty payload (retain clear) silently ignored.
+- Broker connection health: Listener health handler now extracts `broker_connection.reconnect_count` + `broker_connection.last_disconnect_at` and persists to `Client`.
+- DB migration `b1c2d3e4f5a6`: adds `service_failed_at`, `service_failed_unit`, `mqtt_reconnect_count`, `mqtt_last_disconnect_at` to `clients` table.
+- Model update: `models/models.py` Client class updated with all four new columns.
+- `GET /api/clients/service_failed`: lists clients with `service_failed_at` set, admin-or-higher gated.
+- `POST /api/clients/<uuid>/clear_service_failed`: clears DB flag and publishes empty retained MQTT to `infoscreen/{uuid}/service_failed`.
+- Monitoring overview includes `mqtt_reconnect_count` + `mqtt_last_disconnect_at` per client.
+- Frontend monitoring: orange service-failed alert panel (hidden when count=0), auto-refresh 15s, per-row Quittieren action.
+- Frontend monitoring: client detail now shows MQTT reconnect count + last disconnect timestamp.
+- Frontend types: `ServiceFailedClient`, `ServiceFailedClientsResponse`; helpers `fetchServiceFailedClients()`, `clearServiceFailed()` added to `dashboard/src/apiClients.ts`.
+- `MQTT_EVENT_PAYLOAD_GUIDE.md`: added `service_failed` topic contract.
+- MQTT auth hardening: Listener and scheduler now connect to broker with env-configured credentials (`MQTT_BROKER_HOST`, `MQTT_BROKER_PORT`, `MQTT_USER`, `MQTT_PASSWORD`) instead of anonymous fixed host/port defaults; optional TLS env toggles added in code path (`MQTT_TLS_*`).
+- Broker auth enforcement: `mosquitto/config/mosquitto.conf` now disables anonymous access and enables password-file authentication. `docker-compose.yml` MQTT service now bootstraps/update password entries from env (`MQTT_USER`/`MQTT_PASSWORD`, optional canary user) before starting broker.
+- Compose wiring: Added MQTT credential env propagation for listener/scheduler in both base and dev override compose files and switched MQTT healthcheck publish to authenticated mode.
+- Backend implementation: Introduced client command lifecycle foundation for remote control in `server/routes/clients.py` with command persistence (`ClientCommand`), schema-based MQTT publish to `infoscreen/{uuid}/commands` (QoS1, non-retained), new endpoints `POST /api/clients/<uuid>/shutdown` and `GET /api/clients/commands/<command_id>`, and restart safety lockout (`blocked_safety` after 3 restarts in 15 minutes). Added migration `server/alembic/versions/aa12bb34cc56_add_client_commands_table.py` and model updates in `models/models.py`. Restart path keeps transitional legacy MQTT publish to `clients/{uuid}/restart` for compatibility.
+- Listener integration: `listener/listener.py` now subscribes to `infoscreen/+/commands/ack` and updates command lifecycle states from client ACK payloads (`accepted`, `execution_started`, `completed`, `failed`).
+- Frontend API client prep: Extended `dashboard/src/apiClients.ts` with `ClientCommand` typing and helper calls for lifecycle consumption (`shutdownClient`, `fetchClientCommandStatus`), and updated `restartClient` to accept optional reason payload.
+- Contract freeze clarification: implementation-plan docs now explicitly freeze canonical MQTT topics (`infoscreen/{uuid}/commands`, `infoscreen/{uuid}/commands/ack`) and JSON schemas with examples; added transitional singular-topic compatibility aliases (`infoscreen/{uuid}/command`, `infoscreen/{uuid}/command/ack`) in server publish and listener ingest.
+- Action value canonicalization: command payload actions are now frozen as host-level values (`reboot_host`, `shutdown_host`). API endpoint mapping is explicit (`/restart` -> `reboot_host`, `/shutdown` -> `shutdown_host`), and docs/examples were updated to remove `restart` payload ambiguity.
+- Client helper snippets: Added frozen payload validation artifacts `implementation-plans/reboot-command-payload-schemas.md` and `implementation-plans/reboot-command-payload-schemas.json` (copy-ready snippets plus machine-validated JSON Schema).
+- Documentation alignment: Added active reboot implementation handoff docs under `implementation-plans/` and linked them in `README.md` for immediate cross-team access (`reboot-implementation-handoff-share.md`, `reboot-implementation-handoff-client-team.md`, `reboot-kickoff-summary.md`).
+- Programminfo GUI regression/fix: `dashboard/public/program-info.json` could not be loaded in Programminfo menu due to invalid JSON in the new alpha.16 changelog line (malformed quote in a text entry). Fixed JSON entry and verified file parses correctly again.
+- Dashboard holiday banner fix: `dashboard/src/dashboard.tsx` — `loadHolidayStatus` now uses a stable `useCallback` with empty deps, preventing repeated re-creation on render. `useEffect` depends only on the stable callback reference.
+- Dashboard Syncfusion stale-render fix: `MessageComponent` in the holiday banner now receives `key={`${severity}:${text}`}` to force remount when severity or text changes; without this Syncfusion cached stale DOM and the banner did not update reactively.
+- Dashboard German text: Replaced transliterated forms (ae/oe/ue) with correct Umlauts throughout visible dashboard UI strings — `Präsentation`, `für`, `prüfen`, `Ferienüberschneidungen`, `verfügbar`, `Vorfälle`, `Ausfälle`.
+- TV power intent (Phase 1): Scheduler publishes retained QoS1 group-level intents to `infoscreen/groups/{group_id}/power/intent` with transition+heartbeat semantics, startup/reconnect republish, and poll-based expiry (`max(3 × poll_interval_sec, 90s)`).
+- TV power validation: Added unit/integration/canary coverage in `scheduler/test_power_intent_utils.py`, `scheduler/test_power_intent_scheduler.py`, and `test_power_intent_canary.py`.
 - Monitoring system completion: End-to-end monitoring pipeline is active (MQTT logs/health → listener persistence → monitoring APIs → superadmin dashboard).
 - Monitoring API: Added/active endpoints `GET /api/client-logs/monitoring-overview` and `GET /api/client-logs/recent-errors`; per-client logs via `GET /api/client-logs/<uuid>/logs`.
 - Dashboard monitoring UI: Superadmin monitoring page is integrated and displays client health status, screenshots, process metadata, and recent error activity.
--- a/MQTT_EVENT_PAYLOAD_GUIDE.md
+++ b/MQTT_EVENT_PAYLOAD_GUIDE.md
@@ -31,7 +31,7 @@ Example payload:

 ```json
 {
-  "schema_version": "tv-power-intent.v1",
+  "schema_version": "1.0",
  "intent_id": "9cf26d9b-87a3-42f1-8446-e90bb6f6ce63",
  "group_id": 12,
  "desired_state": "on",
@@ -39,7 +39,9 @@ Example payload:
  "issued_at": "2026-03-31T10:15:30Z",
  "expires_at": "2026-03-31T10:17:00Z",
  "poll_interval_sec": 30,
-  "source": "scheduler"
+  "active_event_ids": [148],
+  "event_window_start": "2026-03-31T10:15:00Z",
+  "event_window_end": "2026-03-31T11:00:00Z"
 }
 ```

@@ -48,6 +50,91 @@ Contract notes:
 - Heartbeat republishes keep `intent_id` stable while refreshing `issued_at` and `expires_at`.
 - Expiry is poll-based: `max(3 x poll_interval_sec, 90)`.

+### Service-Failed Notification (client → server, retained)
+- **Topic**: `infoscreen/{uuid}/service_failed`
+- **QoS**: 1
+- **Retained**: Yes
+- **Direction**: client → server
+- **Purpose**: Client signals that systemd has exhausted restart attempts (`StartLimitBurst` exceeded) — manual intervention is required.
+
+Example payload:
+
+```json
+{
+  "event": "service_failed",
+  "unit": "infoscreen-simclient.service",
+  "client_uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
+  "failed_at": "2026-04-05T08:00:00Z"
+}
+```
+
+Contract notes:
+- Message is retained so the server receives it even after a broker restart.
+- Server persists `service_failed_at` and `service_failed_unit` to the `clients` table.
+- To clear after resolution: `POST /api/clients/<uuid>/clear_service_failed` — clears the DB flag and publishes an empty retained payload to delete the retained message from the broker.
+- Empty payload (empty bytes) on this topic = retain-clear in transit; listener ignores it.
+
+### Client Command Intent (Phase 1)
+- **Topic**: `infoscreen/{uuid}/commands`
+- **QoS**: 1
+- **Retained**: No
+- **Format**: JSON object
+- **Purpose**: Per-client control commands (currently `restart` and `shutdown`)
+
+Compatibility note:
+- During restart transition, server also publishes legacy restart command to `clients/{uuid}/restart` with payload `{ "action": "restart" }`.
+- During topic naming transition, server also publishes command payload to `infoscreen/{uuid}/command`.
+
+Example payload:
+
+```json
+{
+  "schema_version": "1.0",
+  "command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
+  "client_uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
+  "action": "reboot_host",
+  "issued_at": "2026-04-03T12:48:10Z",
+  "expires_at": "2026-04-03T12:52:10Z",
+  "requested_by": 1,
+  "reason": "operator_request"
+}
+```
+
+Contract notes:
+- Clients must reject stale commands where local UTC time is greater than `expires_at`.
+- Clients must deduplicate by `command_id` and never execute a duplicate command twice.
+- `schema_version` is required for forward-compatibility.
+- Allowed command action values in v1: `reboot_host`, `shutdown_host`, `restart_app`.
+- `restart_app` = soft app restart (no OS reboot); `reboot_host` = full OS reboot.
+- API mapping for operators: restart endpoint emits `reboot_host`; shutdown endpoint emits `shutdown_host`.
+
+### Client Command Acknowledgements (Phase 1)
+- **Topic**: `infoscreen/{uuid}/commands/ack`
+- **QoS**: 1 (recommended)
+- **Retained**: No
+- **Format**: JSON object
+- **Purpose**: Client reports command lifecycle progression back to server
+
+Compatibility note:
+- During topic naming transition, listener also accepts acknowledgements from `infoscreen/{uuid}/command/ack`.
+
+Example payload:
+
+```json
+{
+  "command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
+  "status": "execution_started",
+  "error_code": null,
+  "error_message": null
+}
+```
+
+Allowed `status` values:
+- `accepted`
+- `execution_started`
+- `completed`
+- `failed`
+
 ## Message Structure

 ### General Principles
@@ -150,7 +237,7 @@ Every event includes these common fields:
 }
 ```

-**Note**: WebUntis events use the same payload structure as website events. The URL is fetched from system settings (`webuntis_url`) rather than being specified per-event. Clients treat `webuntis` and `website` event types identically—both display a website.
+**Note**: WebUntis events use the same payload structure as website events. The URL is fetched from system settings (`supplement_table_url`) rather than being specified per-event. Clients treat `webuntis` and `website` event types identically—both display a website.

 #### Video Events

--- a/README.md
+++ b/README.md
@@ -15,6 +15,13 @@ Core stack:
 - Messaging: MQTT (Mosquitto)
 - Background jobs: Redis + RQ + Gotenberg

+## Latest Release Highlights (2026.1.0-alpha.16)
+
+- Dashboard holiday status banner now updates reliably after hard refresh and after switching between settings and dashboard.
+- Production startup now auto-initializes and auto-activates the academic period for the current date.
+- Dashboard German UI wording was polished with proper Umlauts.
+- User-facing changelog source: [dashboard/public/program-info.json](dashboard/public/program-info.json)
+
 ## Architecture (Short)

 - Dashboard talks only to API (`/api/...` via Vite proxy in dev).
@@ -149,13 +156,11 @@ Rollout strategy (Phase 1):
 - [SUPERADMIN_SETUP.md](SUPERADMIN_SETUP.md)

 ### Monitoring, Screenshots, Health
- [CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md](CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md)
 - [CLIENT_MONITORING_SPECIFICATION.md](CLIENT_MONITORING_SPECIFICATION.md)
 - [SCREENSHOT_IMPLEMENTATION.md](SCREENSHOT_IMPLEMENTATION.md)

 ### MQTT & Payloads
 - [MQTT_EVENT_PAYLOAD_GUIDE.md](MQTT_EVENT_PAYLOAD_GUIDE.md)
- [MQTT_PAYLOAD_MIGRATION_GUIDE.md](MQTT_PAYLOAD_MIGRATION_GUIDE.md)

 ### Events, Calendar, WebUntis
 - [WEBUNTIS_EVENT_IMPLEMENTATION.md](WEBUNTIS_EVENT_IMPLEMENTATION.md)
@@ -167,9 +172,17 @@ Rollout strategy (Phase 1):
 - [docs/archive/CLEANUP_SUMMARY.md](docs/archive/CLEANUP_SUMMARY.md)

 ### Conversion / Media
- [pptx_conversion_guide.md](pptx_conversion_guide.md)
 - [pptx_conversion_guide_gotenberg.md](pptx_conversion_guide_gotenberg.md)

+### Historical / Archived Docs
+- [docs/archive/CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md](docs/archive/CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md) - completed implementation plan/history
+- [docs/archive/MQTT_DASHBOARD_V1_TO_V2_MIGRATION.md](docs/archive/MQTT_DASHBOARD_V1_TO_V2_MIGRATION.md) - completed MQTT payload migration notes
+- [docs/archive/PPTX_CONVERSION_LEGACY_APPROACH.md](docs/archive/PPTX_CONVERSION_LEGACY_APPROACH.md) - legacy conversion approach (superseded)
+- [docs/archive/TV_POWER_PHASE_1_COORDINATION.md](docs/archive/TV_POWER_PHASE_1_COORDINATION.md) - TV power Phase 1 coordination tasklist
+- [docs/archive/TV_POWER_PHASE_1_SERVER_HANDOFF.md](docs/archive/TV_POWER_PHASE_1_SERVER_HANDOFF.md) - TV power Phase 1 server handoff notes
+- [docs/archive/TV_POWER_PHASE_1_CANARY_VALIDATION.md](docs/archive/TV_POWER_PHASE_1_CANARY_VALIDATION.md) - TV power Phase 1 canary validation checklist
+- [docs/archive/TV_POWER_PHASE_1_IMPLEMENTATION_CHECKLIST.md](docs/archive/TV_POWER_PHASE_1_IMPLEMENTATION_CHECKLIST.md) - TV power Phase 1 implementation checklist
+
 ### Frontend
 - [FRONTEND_DESIGN_RULES.md](FRONTEND_DESIGN_RULES.md)
 - [dashboard/README.md](dashboard/README.md)
@@ -179,9 +192,18 @@ Rollout strategy (Phase 1):
 - [AI-INSTRUCTIONS-MAINTENANCE.md](AI-INSTRUCTIONS-MAINTENANCE.md)
 - [DEV-CHANGELOG.md](DEV-CHANGELOG.md)

+### Active Implementation Plans
+- [implementation-plans/reboot-implementation-handoff-share.md](implementation-plans/reboot-implementation-handoff-share.md)
+- [implementation-plans/reboot-implementation-handoff-client-team.md](implementation-plans/reboot-implementation-handoff-client-team.md)
+- [implementation-plans/reboot-kickoff-summary.md](implementation-plans/reboot-kickoff-summary.md)
+
 ## API Highlights

 - Core resources: clients, groups, events, academic periods
+- Client command lifecycle:
+   - `POST /api/clients/<uuid>/restart`
+   - `POST /api/clients/<uuid>/shutdown`
+   - `GET /api/clients/commands/<command_id>`
 - Holidays: `GET/POST /api/holidays`, `POST /api/holidays/upload`, `PUT/DELETE /api/holidays/<id>`
 - Media: upload/download/stream + conversion status
 - Auth: login/logout/change-password
--- a/RESTART_VALIDATION_CHECKLIST.md
+++ b/RESTART_VALIDATION_CHECKLIST.md
@@ -0,0 +1,149 @@
+# Restart Validation Checklist
+
+Purpose: Validate end-to-end restart command flow after MQTT auth hardening.
+
+## Scope
+
+- API command issue route: `POST /api/clients/{uuid}/restart`
+- MQTT command topic: `infoscreen/{uuid}/commands` (compat: `infoscreen/{uuid}/command`)
+- MQTT ACK topic: `infoscreen/{uuid}/commands/ack` (compat: `infoscreen/{uuid}/command/ack`)
+- Status API: `GET /api/clients/commands/{command_id}`
+
+## Preconditions
+
+- Stack is up and healthy (`db`, `mqtt`, `server`, `listener`, `scheduler`).
+- You have an `admin` or `superadmin` account.
+- At least one canary client is online and can process restart commands.
+- `.env` has valid `MQTT_USER` / `MQTT_PASSWORD`.
+
+## 1) Open Monitoring Session (MQTT)
+
+On host/server:
+
+```bash
+set -a
+. ./.env
+set +a
+
+mosquitto_sub -h 127.0.0.1 -p 1883 \
+  -u "$MQTT_USER" -P "$MQTT_PASSWORD" \
+  -t "infoscreen/+/commands" \
+  -t "infoscreen/+/commands/ack" \
+  -t "infoscreen/+/command" \
+  -t "infoscreen/+/command/ack" \
+  -v
+```
+
+Expected:
+- Command publish appears on `infoscreen/{uuid}/commands`.
+- ACK(s) appear on `infoscreen/{uuid}/commands/ack`.
+
+## 2) Login and Keep Session Cookie
+
+```bash
+API_BASE="http://127.0.0.1:8000"
+USER="<admin_or_superadmin_username>"
+PASS="<password>"
+
+curl -sS -X POST "$API_BASE/api/auth/login" \
+  -H "Content-Type: application/json" \
+  -d "{\"username\":\"$USER\",\"password\":\"$PASS\"}" \
+  -c /tmp/infoscreen-cookies.txt
+```
+
+Expected:
+- Login success response.
+- Cookie jar file created at `/tmp/infoscreen-cookies.txt`.
+
+## 3) Pick Target Client UUID
+
+Option A: Use known canary UUID.
+
+Option B: query alive clients:
+
+```bash
+curl -sS "$API_BASE/api/clients/with_alive_status" -b /tmp/infoscreen-cookies.txt
+```
+
+Choose one `uuid` where `is_alive` is `true`.
+
+## 4) Issue Restart Command
+
+```bash
+CLIENT_UUID="<target_uuid>"
+
+curl -sS -X POST "$API_BASE/api/clients/$CLIENT_UUID/restart" \
+  -H "Content-Type: application/json" \
+  -b /tmp/infoscreen-cookies.txt \
+  -d '{"reason":"canary_restart_validation"}'
+```
+
+Expected:
+- HTTP `202` on success.
+- JSON includes `command.commandId` and initial status around `published`.
+- In MQTT monitor, a command payload with:
+  - `schema_version: "1.0"`
+  - `action: "reboot_host"`
+  - matching `command_id`.
+
+## 5) Poll Command Lifecycle Until Terminal
+
+```bash
+COMMAND_ID="<command_id_from_previous_step>"
+
+for i in $(seq 1 20); do
+  curl -sS "$API_BASE/api/clients/commands/$COMMAND_ID" -b /tmp/infoscreen-cookies.txt
+  echo
+  sleep 3
+done
+```
+
+Expected status progression (typical):
+- `queued` -> `publish_in_progress` -> `published` -> `ack_received` -> `execution_started` -> `completed`
+
+Failure/alternate terminal states:
+- `failed` (check `errorCode` / `errorMessage`)
+- `blocked_safety` (reboot lockout triggered)
+
+## 6) Validate Offline/Timeout Behavior
+
+- Repeat step 4 for an offline client (or stop client process first).
+- Confirm command does not falsely end as `completed`.
+- Confirm status remains non-success and has usable failure diagnostics.
+
+## 7) Validate Safety Lockout
+
+Current lockout in API route:
+- Threshold: 3 reboot commands
+- Window: 15 minutes
+
+Test:
+- Send 4 restart commands quickly for same `uuid`.
+
+Expected:
+- One request returns HTTP `429`.
+- Command entry state `blocked_safety` with lockout error details.
+
+## 8) Service Log Spot Check
+
+```bash
+docker compose logs --tail=150 server listener mqtt
+```
+
+Expected:
+- No MQTT auth errors (`Not authorized`, `Connection Refused: not authorised`).
+- Listener logs show ACK processing for `command_id`.
+
+## 9) Acceptance Criteria
+
+- Restart command publish is visible on MQTT.
+- ACK is received and mapped by listener.
+- Status endpoint reaches correct terminal state.
+- Safety lockout works under repeated restart attempts.
+- No auth regression in broker/service logs.
+
+## Cleanup
+
+```bash
+rm -f /tmp/infoscreen-cookies.txt
+```
--- a/TECH-CHANGELOG.md
+++ b/TECH-CHANGELOG.md
@@ -5,7 +5,73 @@

 This changelog documents technical and developer-relevant changes included in public releases. For development workspace changes, see DEV-CHANGELOG.md. Not all changes here are reflected in the user-facing changelog (`program-info.json`), and not all UI/feature changes are repeated here. Some changes (e.g., backend refactoring, API adjustments, infrastructure, developer tooling, or internal logic) may only appear in TECH-CHANGELOG.md. For UI/feature changes, see `dashboard/public/program-info.json`.

+## Unreleased
+- <20> **Crash detection, auto-recovery, and service_failed monitoring (2026-04-05)**:
+	- Added `GET /api/clients/crashed` endpoint: returns active clients with `process_status=crashed` or stale heartbeat beyond grace period, with `crash_reason` field.
+	- Added `restart_app` command action alongside existing `reboot_host`/`shutdown_host`; registered in `server/routes/clients.py` with same safety lockout.
+	- Scheduler: Added crash auto-recovery loop (feature-flagged via `CRASH_RECOVERY_ENABLED`): scans candidates via `get_crash_recovery_candidates()`, issues `reboot_host` command per client, publishes to primary + compat MQTT topics, updates command lifecycle.
+	- Scheduler: Added unconditional command expiry sweep each poll cycle via `sweep_expired_commands()` in `scheduler/db_utils.py`: marks non-terminal `ClientCommand` rows with `expires_at < now` as `expired`.
+	- Added `service_failed` topic ingestion in `listener/listener.py`: subscribe to `infoscreen/+/service_failed` on every connect; persist `service_failed_at` and `service_failed_unit` on Client; empty payload (retain clear) ignored.
+	- Added `broker_connection` block extraction in health payload handler: persists `mqtt_reconnect_count` and `mqtt_last_disconnect_at` from `infoscreen/{uuid}/health`.
+	- Added four new DB columns to `clients` table via migration `b1c2d3e4f5a6`: `service_failed_at`, `service_failed_unit`, `mqtt_reconnect_count`, `mqtt_last_disconnect_at`.
+	- Added `GET /api/clients/service_failed` endpoint: lists clients with `service_failed_at` set, ordered by event time desc.
+	- Added `POST /api/clients/<uuid>/clear_service_failed` endpoint: clears DB flag and publishes empty retained MQTT message to clear `infoscreen/{uuid}/service_failed`.
+	- Monitoring overview API (`GET /api/client-logs/monitoring-overview`) now includes `mqtt_reconnect_count` and `mqtt_last_disconnect_at` per client.
+	- Frontend: Added orange service-failed alert panel to monitoring page (hidden when empty, auto-refresh 15s, per-row Quittieren button with loading/success/error states).
+	- Frontend: Client detail panel in monitoring now shows MQTT reconnect count and last disconnect timestamp.
+	- Frontend: Added `ServiceFailedClient`, `ServiceFailedClientsResponse` types; `fetchServiceFailedClients()` and `clearServiceFailed()` API helpers in `dashboard/src/apiClients.ts`.
+	- Added `service_failed` topic contract to `MQTT_EVENT_PAYLOAD_GUIDE.md`.
+- <20>🔐 **MQTT auth hardening for server-side services (2026-04-03)**:
+	- `listener/listener.py` now uses env-based broker connectivity for host/port and credentials (`MQTT_BROKER_HOST`, `MQTT_BROKER_PORT`, `MQTT_USER`, `MQTT_PASSWORD`) instead of anonymous fixed `mqtt:1883`.
+	- `scheduler/scheduler.py` now uses the same env-based MQTT auth path and optional TLS toggles (`MQTT_TLS_ENABLED`, `MQTT_TLS_CA_CERT`, `MQTT_TLS_CERTFILE`, `MQTT_TLS_KEYFILE`, `MQTT_TLS_INSECURE`).
+	- `docker-compose.yml` and `docker-compose.override.yml` now pass MQTT credentials into listener and scheduler containers for consistent authenticated connections.
+	- Mosquitto is now configured for authenticated access (`allow_anonymous false`, `password_file /mosquitto/config/passwd`) and bootstraps credentials from env at container startup.
+	- MQTT healthcheck publish now authenticates with configured broker credentials.
+- 🔁 **Client command lifecycle foundation (restart/shutdown) (2026-04-03)**:
+	- Added persistent command tracking model `ClientCommand` in `models/models.py` and Alembic migration `aa12bb34cc56_add_client_commands_table.py`.
+	- Upgraded `POST /api/clients/<uuid>/restart` from fire-and-forget publish to lifecycle-aware command issuance with command metadata (`command_id`, `issued_at`, `expires_at`, `reason`, `requested_by`).
+	- Added `POST /api/clients/<uuid>/shutdown` endpoint with the same lifecycle contract.
+	- Added `GET /api/clients/commands/<command_id>` status endpoint for command-state polling.
+	- Added restart safety lockout in API path: max 3 restart commands per client in rolling 15 minutes, returning `blocked_safety` when threshold is exceeded.
+	- Added command MQTT publish to `infoscreen/{uuid}/commands` (QoS1, non-retained) and temporary legacy restart compatibility publish to `clients/{uuid}/restart`.
+	- Added temporary topic compatibility publish to `infoscreen/{uuid}/command` and listener acceptance of `infoscreen/{uuid}/command/ack` to bridge singular/plural naming assumptions.
+	- Canonicalized command payload action values to host-level semantics: `reboot_host` and `shutdown_host` (API routes remain `/restart` and `/shutdown` for operator UX compatibility).
+	- Added frozen payload validation snippets for integration/client tooling in `implementation-plans/reboot-command-payload-schemas.md` and `implementation-plans/reboot-command-payload-schemas.json`.
+	- Listener now subscribes to `infoscreen/{uuid}/commands/ack` and maps client acknowledgements into command lifecycle states (`ack_received`, `execution_started`, `completed`, `failed`).
+	- Initial lifecycle statuses implemented server-side: `queued`, `publish_in_progress`, `published`, `failed`, and `blocked_safety`.
+	- Frontend API helper extended in `dashboard/src/apiClients.ts` with `ClientCommand` typing plus command APIs for shutdown and status polling preparation.
+
+## 2026.1.0-alpha.16 (2026-04-02)
+- 🐛 **Dashboard holiday banner refactoring and state fix (`dashboard/src/dashboard.tsx`)**:
+	- **Motivation — unstable fetch function:** `loadHolidayStatus` had `location.pathname` in its `useCallback` dependency array, causing a new function reference to be created on every navigation event. The `useEffect` depending on that reference then re-fired, producing overlapping API calls at mount that cancelled each other via the request-sequence guard, leaving the banner unresolved.
+	- **Refactoring:** Removed `location.pathname` from `useCallback` deps (it was unused inside the function body). The callback now has an empty dependency array, making its reference stable across the component lifetime. The `useEffect` is keyed only to the stable callback reference — no spurious re-fires.
+	- **Motivation — Syncfusion stale render:** Syncfusion's `MessageComponent` caches its rendered DOM internally and does not reactively update when React passes new children or props. Even after React state changed, the component displayed whatever text was rendered on first mount.
+	- **Fix:** Added a `key` prop derived from `${severity}:${text}` to `MessageComponent`. React unmounts and remounts the component whenever the key changes, bypassing Syncfusion's internal caching and ensuring the correct message is always visible.
+	- **Result:** Active-period name and holiday overlap details now render correctly on hard refresh, initial load, and route transitions without additional API calls.
+- 🗓️ **Academic period bootstrap hardening (`server/init_academic_periods.py`)**:
+	- Refactored initialization into idempotent flow:
+		- seed default periods only when table is empty,
+		- on every run, activate exactly the non-archived period covering `date.today()`.
+	- Enforces single-active behavior by deactivating all previously active periods before setting the period for today.
+	- Emits explicit warning if no period covers current date (all remain inactive), improving operational diagnostics.
+- 🚀 **Production startup alignment (`docker-compose.prod.yml`)**:
+	- Server startup command now runs `python /app/server/init_academic_periods.py` after migrations and default settings bootstrap.
+	- Removes manual post-deploy step to set an active academic period.
+- 🌐 **Dashboard UX/text refinement (`dashboard/src/dashboard.tsx`)**:
+	- Converted user-facing transliterated German strings to proper Umlauts in the dashboard (for example: "für", "prüfen", "Ferienüberschneidungen", "Vorfälle", "Ausfälle").
+
+Notes for integrators:
+- On production boot, the active period is now derived from current date coverage automatically.
+- If customer calendars do not include today, startup logs a warning and dashboard banner will still guide admins to configure periods.
+
 ## 2026.1.0-alpha.15 (2026-03-31)
+- 🔌 **TV Power Intent Phase 1 (server-side)**:
+	- Scheduler now publishes retained QoS1 group-level intents to `infoscreen/groups/{group_id}/power/intent`.
+	- Implemented deterministic intent computation helpers in `scheduler/db_utils.py` (`compute_group_power_intent_basis`, `build_group_power_intent_body`, `compute_group_power_intent_fingerprint`).
+	- Implemented transition + heartbeat semantics in `scheduler/scheduler.py`: stable `intent_id` on heartbeat; new `intent_id` only on semantic transition.
+	- Added startup publish and MQTT reconnect republish behavior for retained intent continuity.
+	- Added poll-based expiry rule: `expires_at = issued_at + max(3 × poll_interval_sec, 90s)`.
+	- Added scheduler tests and canary validation artifacts: `scheduler/test_power_intent_utils.py`, `scheduler/test_power_intent_scheduler.py`, `test_power_intent_canary.py`, and `TV_POWER_CANARY_VALIDATION_CHECKLIST.md`.
 - 🗃️ **Holiday data model scoping to academic periods**:
 	- Added period scoping for holidays via `SchoolHoliday.academic_period_id` (FK to academic periods) in `models/models.py`.
 	- Added Alembic migration `f3c4d5e6a7b8_scope_school_holidays_to_academic_.py` to introduce FK/index/constraint updates for period-aware holiday storage.
--- a/TODO.md
+++ b/TODO.md
@@ -0,0 +1,55 @@
+# TODO
+
+## MQTT TLS Hardening (Production)
+
+- [ ] Enable TLS listener in `mosquitto/config/mosquitto.conf` (e.g., port 8883) while keeping 1883 only for temporary migration if needed.
+- [ ] Generate and deploy server certificate + private key for Mosquitto (CA-signed or internal PKI).
+- [ ] Add CA certificate distribution strategy for all clients and services (server, listener, scheduler, external monitors).
+- [ ] Set strict file permissions for cert/key material (`chmod 600` for keys, least-privilege ownership).
+- [ ] Update Docker Compose MQTT service to mount TLS cert/key/CA paths read-only.
+- [ ] Add environment variables for TLS in `.env` / `.env.example`:
+  - `MQTT_TLS_ENABLED=true`
+  - `MQTT_TLS_CA_CERT=<path>`
+  - `MQTT_TLS_CERTFILE=<path>` (if mutual TLS used)
+  - `MQTT_TLS_KEYFILE=<path>` (if mutual TLS used)
+  - `MQTT_TLS_INSECURE=false`
+- [ ] Switch internal services to TLS connection settings and verify authenticated reconnect behavior.
+- [ ] Decide policy: TLS-only auth (username/password over TLS) vs mutual TLS + username/password.
+- [ ] Disable non-TLS listener (1883) after all clients migrated.
+- [ ] Restrict MQTT firewall ingress to trusted source ranges only.
+- [ ] Add Mosquitto ACL file for topic-level permissions per role/client type.
+- [ ] Add cert rotation process (renewal schedule, rollout, rollback steps).
+- [ ] Add monitoring/alerting for certificate expiry and broker auth failures.
+- [ ] Add runbook section for external monitoring clients (how to connect with CA validation).
+- [ ] Perform a staged rollout (canary group first), then full migration.
+- [ ] Document final TLS contract in `MQTT_EVENT_PAYLOAD_GUIDE.md` and deployment docs.
+
+## Client Recovery Paths
+
+### Path 1 — Software running → restart via MQTT ✅
+- Server-side fully implemented (`restart_app` action, command lifecycle, monitoring panel).
+- [ ] Client team: handle `restart_app` action in command handler (soft app restart, no reboot).
+
+### Path 2 — Software crashed → MQTT unavailable
+- Robust solution is **systemd `Restart=always`** (or `Restart=on-failure`) on the client device — no server involvement, OS init system restarts the process automatically.
+- Server detects the crash via missing heartbeat (`process_status=crashed`), records it, and shows it in the monitoring panel. Recovery is confirmed when heartbeats resume.
+- [ ] Client team: ensure the infoscreen service unit has `Restart=always` and `RestartSec=<delay>` configured in its systemd unit file.
+- [ ] Evaluate whether MQTT `clean_session=False` + fixed `client_id` is worth adding for cases where the app crashes but the MQTT connection briefly survives (would allow QoS1 command delivery on reconnect).
+- Note: the existing scheduler crash recovery (`reboot_host` via MQTT) is unreliable for a fully crashed app unless the client uses a persistent MQTT session. Revisit if client team enables `clean_session=False`.
+
+### Path 3 — OS crashed / hung → power cycle needed (customer-dependent)
+- No software-based recovery path is possible when the OS is unresponsive.
+- Recovery requires external hardware intervention; options depend on customer infrastructure:
+  - Smart plug / PDU with API (e.g., Shelly, Tasmota, APC, Raritan)
+  - IPMI / iDRAC / BMC (server-class hardware)
+  - CEC power command from another device on the same HDMI chain
+  - Wake-on-LAN after a scheduled power-cut (limited applicability)
+- [ ] Clarify with customer which hardware is available / acceptable.
+- [ ] If a smart plug or PDU API is chosen: design a server-side "hard power cycle" command type and integration (out of scope until hardware is confirmed).
+- [ ] Document chosen solution and integrate into monitoring runbook once decided.
+
+## Optional Security Follow-ups
+
+- [ ] Move MQTT credentials to Docker secrets or a vault-backed secret source.
+- [ ] Rotate `MQTT_USER`/`MQTT_PASSWORD` on a fixed schedule.
+- [ ] Add fail2ban/rate-limiting protections for exposed broker ports.
--- a/TV_POWER_HANDOFF_SERVER.md
+++ b/TV_POWER_HANDOFF_SERVER.md
@@ -1,83 +0,0 @@
-# Server Handoff: TV Power Coordination
-
-## Purpose
-Implement server-side MQTT power intent publishing so clients can keep TVs on across adjacent events and power off safely after schedules end.
-
-## Source of Truth
- Shared full plan: TV_POWER_COORDINATION_TASKLIST.md
-
-## Scope (Server Team)
- Scheduler-to-intent mapping
- MQTT publishing semantics (retain, QoS, expiry)
- Conflict handling (group vs client)
- Observability for intent lifecycle
-
-## MQTT Contract (Server Responsibilities)
-
-### Topics
- Primary (per-client): infoscreen/{client_id}/power/intent
- Optional (group-level): infoscreen/groups/{group_id}/power/intent
-
-### Delivery Semantics
- QoS: 1
- retained: true
- Always publish UTC timestamps (ISO 8601 with Z)
-
-### Intent Payload (v1)
-```json
-{
-  "schema_version": "1.0",
-  "intent_id": "uuid-or-monotonic-id",
-  "issued_at": "2026-03-31T12:00:00Z",
-  "expires_at": "2026-03-31T12:10:00Z",
-  "target": {
-    "client_id": "optional-if-group-topic",
-    "group_id": "optional"
-  },
-  "power": {
-    "desired_state": "on",
-    "reason": "event_window_active",
-    "grace_seconds": 30
-  },
-  "event_window": {
-    "start": "2026-03-31T12:00:00Z",
-    "end": "2026-03-31T13:00:00Z"
-  }
-}
-```
-
-## Required Behavior
-
-### Adjacent/Overlapping Events
- Never publish an intermediate off intent when windows are contiguous/overlapping.
- Maintain continuous desired_state=on coverage across adjacent windows.
-
-### Reconnect/Restart
- On scheduler restart, republish effective retained intent.
- On event edits/cancellations, replace retained intent with a fresh intent_id.
-
-### Conflict Policy
- If both group and client intent exist: per-client overrides group.
-
-### Expiry Safety
- expires_at must be set for every intent.
- Server should avoid publishing already-expired intents.
-
-## Implementation Tasks
-1. Add scheduler mapping layer that computes effective desired_state per client timeline.
-2. Add intent publisher with retained QoS1 delivery.
-3. Generate unique intent_id for each semantic transition.
-4. Emit issued_at/expires_at and event_window consistently in UTC.
-5. Add group-vs-client precedence logic.
-6. Add logs/metrics for publish success, retained payload age, and transition count.
-7. Add integration tests for adjacent events and reconnect replay.
-
-## Acceptance Criteria
-1. Adjacent events do not create OFF gap intents.
-2. Fresh client receives retained intent after reconnect and gets correct desired state.
-3. Intent payloads are schema-valid, UTC-formatted, and include expiry.
-4. Publish logs and metrics allow intent timeline reconstruction.
-
-## Operational Notes
- Keep intent publishing idempotent and deterministic.
- Preserve backward compatibility while clients run in hybrid mode.
--- a/dashboard/index.html
+++ b/dashboard/index.html
@@ -2,9 +2,9 @@
 <html lang="en">
  <head>
    <meta charset="UTF-8" />
-    <link rel="icon" type="image/svg+xml" href="/vite.svg" />
+    <link rel="icon" type="image/png" href="/favicon.png" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
-    <title>Vite + React + TS</title>
+    <title>Infoscreen</title>
  </head>
  <body>
    <div id="root"></div>
--- a/dashboard/public/favicon.png
+++ b/dashboard/public/favicon.png
--- a/dashboard/public/program-info.json
+++ b/dashboard/public/program-info.json
@@ -1,6 +1,6 @@
 {
  "appName": "Infoscreen-Management",
-  "version": "2026.1.0-alpha.15",
+  "version": "2026.1.0-alpha.16",
  "copyright": "© 2026 Third-Age-Applications",
  "supportContact": "support@third-age-applications.com",
  "description": "Eine zentrale Verwaltungsoberfläche für digitale Informationsbildschirme.",
@@ -25,11 +25,17 @@
      { "name": "Alembic", "license": "MIT" }
    ]
  },
-  "buildInfo": {
-    "buildDate": "2025-12-29T12:00:00Z",
-    "commitId": "9f2ae8b44c3a"
-  },
  "changelog": [
+    {
+      "version": "2026.1.0-alpha.16",
+      "date": "2026-04-02",
+      "changes": [
+        "✅ Dashboard: Der Ferienstatus-Banner zeigt die aktive akademische Periode jetzt zuverlässig nach Hard-Refresh und beim Wechsel zwischen Dashboard und Einstellungen.",
+        "🧭 Navigation: Der Link vom Ferienstatus-Banner zu den Einstellungen bleibt stabil und funktioniert konsistent für Admin-Rollen.",
+        "🚀 Deployment: Akademische Perioden werden nach Initialisierung automatisch für das aktuelle Datum aktiviert (kein manueller Aktivierungsschritt direkt nach Rollout mehr nötig).",
+        "🔤 Sprache: Mehrere deutsche UI-Texte im Dashboard wurden auf korrekte Umlaute umgestellt (zum Beispiel für, prüfen, Vorfälle und Ausfälle)."
+      ]
+    },
    {
      "version": "2026.1.0-alpha.15",
      "date": "2026-03-31",
--- a/dashboard/src/apiAcademicPeriods.ts
+++ b/dashboard/src/apiAcademicPeriods.ts
@@ -19,8 +19,25 @@ export type PeriodUsage = {
  blockers: string[];
 };

+function normalizeAcademicPeriod(period: any): AcademicPeriod {
+  return {
+    id: Number(period.id),
+    name: period.name,
+    displayName: period.displayName ?? period.display_name ?? null,
+    startDate: period.startDate ?? period.start_date,
+    endDate: period.endDate ?? period.end_date,
+    periodType: period.periodType ?? period.period_type,
+    isActive: Boolean(period.isActive ?? period.is_active),
+    isArchived: Boolean(period.isArchived ?? period.is_archived),
+    archivedAt: period.archivedAt ?? period.archived_at ?? null,
+    archivedBy: period.archivedBy ?? period.archived_by ?? null,
+    createdAt: period.createdAt ?? period.created_at,
+    updatedAt: period.updatedAt ?? period.updated_at,
+  };
+}
+
 async function api<T>(url: string, init?: RequestInit): Promise<T> {
-  const res = await fetch(url, { credentials: 'include', ...init });
+  const res = await fetch(url, { credentials: 'include', cache: 'no-store', ...init });
  if (!res.ok) {
    const text = await res.text();
    try {
@@ -35,10 +52,10 @@ async function api<T>(url: string, init?: RequestInit): Promise<T> {

 export async function getAcademicPeriodForDate(date: Date): Promise<AcademicPeriod | null> {
  const iso = date.toISOString().slice(0, 10);
-  const { period } = await api<{ period: AcademicPeriod | null }>(
+  const { period } = await api<{ period: any | null }>(
    `/api/academic_periods/for_date?date=${iso}`
  );
-  return period ?? null;
+  return period ? normalizeAcademicPeriod(period) : null;
 }

 export async function listAcademicPeriods(options?: {
@@ -53,20 +70,20 @@ export async function listAcademicPeriods(options?: {
    params.set('archivedOnly', '1');
  }
  const query = params.toString();
-  const { periods } = await api<{ periods: AcademicPeriod[] }>(
+  const { periods } = await api<{ periods: any[] }>(
    `/api/academic_periods${query ? `?${query}` : ''}`
  );
-  return Array.isArray(periods) ? periods : [];
+  return Array.isArray(periods) ? periods.map(normalizeAcademicPeriod) : [];
 }

 export async function getAcademicPeriod(id: number): Promise<AcademicPeriod> {
-  const { period } = await api<{ period: AcademicPeriod }>(`/api/academic_periods/${id}`);
-  return period;
+  const { period } = await api<{ period: any }>(`/api/academic_periods/${id}`);
+  return normalizeAcademicPeriod(period);
 }

 export async function getActiveAcademicPeriod(): Promise<AcademicPeriod | null> {
-  const { period } = await api<{ period: AcademicPeriod | null }>(`/api/academic_periods/active`);
-  return period ?? null;
+  const { period } = await api<{ period: any | null }>(`/api/academic_periods/active`);
+  return period ? normalizeAcademicPeriod(period) : null;
 }

 export async function createAcademicPeriod(payload: {
@@ -76,12 +93,12 @@ export async function createAcademicPeriod(payload: {
  endDate: string;
  periodType: 'schuljahr' | 'semester' | 'trimester';
 }): Promise<AcademicPeriod> {
-  const { period } = await api<{ period: AcademicPeriod }>(`/api/academic_periods`, {
+  const { period } = await api<{ period: any }>(`/api/academic_periods`, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify(payload),
  });
-  return period;
+  return normalizeAcademicPeriod(period);
 }

 export async function updateAcademicPeriod(
@@ -94,36 +111,36 @@ export async function updateAcademicPeriod(
    periodType: 'schuljahr' | 'semester' | 'trimester';
  }>
 ): Promise<AcademicPeriod> {
-  const { period } = await api<{ period: AcademicPeriod }>(`/api/academic_periods/${id}`, {
+  const { period } = await api<{ period: any }>(`/api/academic_periods/${id}`, {
    method: 'PUT',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify(payload),
  });
-  return period;
+  return normalizeAcademicPeriod(period);
 }

 export async function setActiveAcademicPeriod(id: number): Promise<AcademicPeriod> {
-  const { period } = await api<{ period: AcademicPeriod }>(`/api/academic_periods/${id}/activate`, {
+  const { period } = await api<{ period: any }>(`/api/academic_periods/${id}/activate`, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
  });
-  return period;
+  return normalizeAcademicPeriod(period);
 }

 export async function archiveAcademicPeriod(id: number): Promise<AcademicPeriod> {
-  const { period } = await api<{ period: AcademicPeriod }>(`/api/academic_periods/${id}/archive`, {
+  const { period } = await api<{ period: any }>(`/api/academic_periods/${id}/archive`, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
  });
-  return period;
+  return normalizeAcademicPeriod(period);
 }

 export async function restoreAcademicPeriod(id: number): Promise<AcademicPeriod> {
-  const { period } = await api<{ period: AcademicPeriod }>(`/api/academic_periods/${id}/restore`, {
+  const { period } = await api<{ period: any }>(`/api/academic_periods/${id}/restore`, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
  });
-  return period;
+  return normalizeAcademicPeriod(period);
 }

 export async function getAcademicPeriodUsage(id: number): Promise<PeriodUsage> {
--- a/dashboard/src/apiClientMonitoring.ts
+++ b/dashboard/src/apiClientMonitoring.ts
@@ -39,6 +39,8 @@ export interface MonitoringClient {
  };
  latestLog?: MonitoringLogEntry | null;
  latestError?: MonitoringLogEntry | null;
+  mqttReconnectCount?: number | null;
+  mqttLastDisconnectAt?: string | null;
 }

 export interface MonitoringOverview {
--- a/dashboard/src/apiClients.ts
+++ b/dashboard/src/apiClients.ts
@@ -24,6 +24,62 @@ export interface Group {
  is_active?: boolean;
  clients: Client[];
 }
+
+export interface CrashedClient {
+  uuid: string;
+  description?: string | null;
+  hostname?: string | null;
+  ip?: string | null;
+  group_id?: number | null;
+  is_alive: boolean;
+  process_status?: string | null;
+  screen_health_status?: string | null;
+  last_alive?: string | null;
+  crash_reason: 'process_crashed' | 'heartbeat_stale';
+}
+
+export interface CrashedClientsResponse {
+  crashed_count: number;
+  grace_period_seconds: number;
+  clients: CrashedClient[];
+}
+
+export interface ServiceFailedClient {
+  uuid: string;
+  description?: string | null;
+  hostname?: string | null;
+  ip?: string | null;
+  group_id?: number | null;
+  is_alive: boolean;
+  last_alive?: string | null;
+  service_failed_at: string;
+  service_failed_unit?: string | null;
+}
+
+export interface ServiceFailedClientsResponse {
+  service_failed_count: number;
+  clients: ServiceFailedClient[];
+}
+
+export interface ClientCommand {
+  commandId: string;
+  clientUuid: string;
+  action: 'reboot_host' | 'shutdown_host' | 'restart_app';
+  status: string;
+  reason?: string | null;
+  requestedBy?: number | null;
+  issuedAt?: string | null;
+  expiresAt?: string | null;
+  publishedAt?: string | null;
+  ackedAt?: string | null;
+  executionStartedAt?: string | null;
+  completedAt?: string | null;
+  failedAt?: string | null;
+  errorCode?: string | null;
+  errorMessage?: string | null;
+  createdAt?: string | null;
+  updatedAt?: string | null;
+}
 // Liefert alle Gruppen mit zugehörigen Clients
 export async function fetchGroupsWithClients(): Promise<Group[]> {
  const response = await fetch('/api/groups/with_clients');
@@ -79,9 +135,11 @@ export async function updateClient(uuid: string, data: { description?: string; m
  return await res.json();
 }

-export async function restartClient(uuid: string): Promise<{ success: boolean; message?: string }> {
+export async function restartClient(uuid: string, reason?: string): Promise<{ success: boolean; message?: string; command?: ClientCommand }> {
  const response = await fetch(`/api/clients/${uuid}/restart`, {
    method: 'POST',
+    headers: { 'Content-Type': 'application/json' },
+    body: JSON.stringify({ reason: reason || null }),
  });
  if (!response.ok) {
    const error = await response.json();
@@ -90,6 +148,58 @@ export async function restartClient(uuid: string): Promise<{ success: boolean; m
  return await response.json();
 }

+export async function shutdownClient(uuid: string, reason?: string): Promise<{ success: boolean; message?: string; command?: ClientCommand }> {
+  const response = await fetch(`/api/clients/${uuid}/shutdown`, {
+    method: 'POST',
+    headers: { 'Content-Type': 'application/json' },
+    body: JSON.stringify({ reason: reason || null }),
+  });
+  if (!response.ok) {
+    const error = await response.json();
+    throw new Error(error.error || 'Fehler beim Herunterfahren des Clients');
+  }
+  return await response.json();
+}
+
+export async function fetchClientCommandStatus(commandId: string): Promise<ClientCommand> {
+  const response = await fetch(`/api/clients/commands/${commandId}`);
+  if (!response.ok) {
+    const error = await response.json();
+    throw new Error(error.error || 'Fehler beim Laden des Command-Status');
+  }
+  return await response.json();
+}
+
+export async function fetchCrashedClients(): Promise<CrashedClientsResponse> {
+  const response = await fetch('/api/clients/crashed', { credentials: 'include' });
+  if (!response.ok) {
+    const err = await response.json().catch(() => ({}));
+    throw new Error(err.error || 'Fehler beim Laden der abgestürzten Clients');
+  }
+  return await response.json();
+}
+
+export async function fetchServiceFailedClients(): Promise<ServiceFailedClientsResponse> {
+  const response = await fetch('/api/clients/service_failed', { credentials: 'include' });
+  if (!response.ok) {
+    const err = await response.json().catch(() => ({}));
+    throw new Error(err.error || 'Fehler beim Laden der service_failed Clients');
+  }
+  return await response.json();
+}
+
+export async function clearServiceFailed(uuid: string): Promise<{ success: boolean; message?: string }> {
+  const response = await fetch(`/api/clients/${uuid}/clear_service_failed`, {
+    method: 'POST',
+    credentials: 'include',
+  });
+  if (!response.ok) {
+    const err = await response.json().catch(() => ({}));
+    throw new Error(err.error || 'Fehler beim Quittieren des service_failed Flags');
+  }
+  return await response.json();
+}
+
 export async function deleteClient(uuid: string) {
  const res = await fetch(`/api/clients/${uuid}`, {
    method: 'DELETE',
--- a/dashboard/src/dashboard.tsx
+++ b/dashboard/src/dashboard.tsx
--- a/dashboard/src/monitoring.css
+++ b/dashboard/src/monitoring.css
@@ -370,4 +370,49 @@
  .monitoring-log-dialog-actions {
    padding: 0 0.2rem 0.4rem;
  }
+}
+
+/* Crash recovery panel */
+.monitoring-crash-panel {
+  border-left: 4px solid #dc2626;
+  margin-bottom: 1.5rem;
+}
+
+.monitoring-service-failed-panel {
+  border-left: 4px solid #ea580c;
+  margin-bottom: 1.5rem;
+}
+
+.monitoring-crash-table {
+  width: 100%;
+  border-collapse: collapse;
+  font-size: 0.875rem;
+}
+
+.monitoring-crash-table th {
+  text-align: left;
+  padding: 0.5rem 0.75rem;
+  font-weight: 600;
+  color: #64748b;
+  border-bottom: 1px solid #e2e8f0;
+  background: #f8fafc;
+}
+
+.monitoring-crash-table td {
+  padding: 0.55rem 0.75rem;
+  border-bottom: 1px solid #f1f5f9;
+  vertical-align: middle;
+}
+
+.monitoring-crash-table tr:last-child td {
+  border-bottom: none;
+}
+
+.monitoring-crash-table tr:hover td {
+  background: #fef2f2;
+}
+
+.monitoring-meta-hint {
+  color: #94a3b8;
+  font-size: 0.8rem;
 }
--- a/dashboard/src/monitoring.tsx
+++ b/dashboard/src/monitoring.tsx
@@ -7,6 +7,16 @@ import {
  type MonitoringLogEntry,
  type MonitoringOverview,
 } from './apiClientMonitoring';
+import {
+  fetchCrashedClients,
+  fetchServiceFailedClients,
+  clearServiceFailed,
+  restartClient,
+  type CrashedClient,
+  type CrashedClientsResponse,
+  type ServiceFailedClient,
+  type ServiceFailedClientsResponse,
+} from './apiClients';
 import { useAuth } from './useAuth';
 import { ButtonComponent } from '@syncfusion/ej2-react-buttons';
 import { DropDownListComponent } from '@syncfusion/ej2-react-dropdowns';
@@ -156,6 +166,12 @@ const MonitoringDashboard: React.FC = () => {
  const [screenshotErrored, setScreenshotErrored] = React.useState<boolean>(false);
  const selectedClientUuidRef = React.useRef<string | null>(null);
  const [selectedLogEntry, setSelectedLogEntry] = React.useState<MonitoringLogEntry | null>(null);
+  const [crashedClients, setCrashedClients] = React.useState<CrashedClientsResponse | null>(null);
+  const [restartStates, setRestartStates] = React.useState<Record<string, 'idle' | 'loading' | 'success' | 'failed'>>({});
+  const [restartErrors, setRestartErrors] = React.useState<Record<string, string>>({});
+  const [serviceFailedClients, setServiceFailedClients] = React.useState<ServiceFailedClientsResponse | null>(null);
+  const [clearStates, setClearStates] = React.useState<Record<string, 'idle' | 'loading' | 'success' | 'failed'>>({});
+  const [clearErrors, setClearErrors] = React.useState<Record<string, string>>({});

  const selectedClient = React.useMemo<MonitoringClient | null>(() => {
    if (!overview || !selectedClientUuid) return null;
@@ -197,9 +213,37 @@ const MonitoringDashboard: React.FC = () => {
    }
  }, []);

+  const loadCrashedClients = React.useCallback(async () => {
+    try {
+      const data = await fetchCrashedClients();
+      setCrashedClients(data);
+    } catch {
+      // non-fatal: crashes panel just stays stale
+    }
+  }, []);
+
+  const loadServiceFailedClients = React.useCallback(async () => {
+    try {
+      const data = await fetchServiceFailedClients();
+      setServiceFailedClients(data);
+    } catch {
+      // non-fatal
+    }
+  }, []);
+
  React.useEffect(() => {
    loadOverview(hours, false);
-  }, [hours, loadOverview]);
+    loadCrashedClients();
+    loadServiceFailedClients();
+  }, [hours, loadOverview, loadCrashedClients, loadServiceFailedClients]);
+
+  React.useEffect(() => {
+    const id = window.setInterval(() => {
+      loadCrashedClients();
+      loadServiceFailedClients();
+    }, REFRESH_INTERVAL_MS);
+    return () => window.clearInterval(id);
+  }, [loadCrashedClients, loadServiceFailedClients]);

  React.useEffect(() => {
    const hasActivePriorityScreenshots = (overview?.summary.activePriorityScreenshots || 0) > 0;
@@ -308,6 +352,194 @@ const MonitoringDashboard: React.FC = () => {
        {renderMetricCard('Fehler-Logs', overview?.summary.errorLogs || 0, 'Im gewählten Zeitraum', '#b91c1c')}
      </div>

+      {crashedClients && crashedClients.crashed_count > 0 && (
+        <div className="monitoring-panel monitoring-crash-panel">
+          <div className="monitoring-panel-header">
+            <h3 style={{ color: '#dc2626' }}>
+              Abgestürzte / Nicht erreichbare Clients
+            </h3>
+            <span
+              style={{
+                background: '#fee2e2',
+                color: '#991b1b',
+                padding: '2px 10px',
+                borderRadius: '12px',
+                fontWeight: 600,
+                fontSize: '0.85rem',
+              }}
+            >
+              {crashedClients.crashed_count}
+            </span>
+          </div>
+          <table className="monitoring-crash-table">
+            <thead>
+              <tr>
+                <th>Client</th>
+                <th>Gruppe</th>
+                <th>Ursache</th>
+                <th>Prozessstatus</th>
+                <th>Letztes Signal</th>
+                <th>Aktion</th>
+              </tr>
+            </thead>
+            <tbody>
+              {crashedClients.clients.map((c: CrashedClient) => {
+                const state = restartStates[c.uuid] || 'idle';
+                const errMsg = restartErrors[c.uuid];
+                const displayName = c.description || c.hostname || c.uuid;
+                return (
+                  <tr key={c.uuid}>
+                    <td>
+                      <span title={c.uuid}>{displayName}</span>
+                      {c.ip && <span className="monitoring-meta-hint"> ({c.ip})</span>}
+                    </td>
+                    <td>{c.group_id ?? '—'}</td>
+                    <td>
+                      <span
+                        className="monitoring-status-badge"
+                        style={
+                          c.crash_reason === 'process_crashed'
+                            ? { color: '#991b1b', backgroundColor: '#fee2e2' }
+                            : { color: '#78350f', backgroundColor: '#fef3c7' }
+                        }
+                      >
+                        {c.crash_reason === 'process_crashed' ? 'Prozess abgestürzt' : 'Heartbeat veraltet'}
+                      </span>
+                    </td>
+                    <td>{c.process_status || '—'}</td>
+                    <td>{formatRelative(c.last_alive)}</td>
+                    <td>
+                      {state === 'loading' && <span style={{ color: '#6b7280', fontSize: '0.85rem' }}>Wird gesendet…</span>}
+                      {state === 'success' && <span style={{ color: '#15803d', fontSize: '0.85rem' }}>✓ Neustart gesendet</span>}
+                      {state === 'failed' && (
+                        <span style={{ color: '#dc2626', fontSize: '0.85rem' }} title={errMsg}>
+                          ✗ Fehler
+                        </span>
+                      )}
+                      {(state === 'idle' || state === 'failed') && (
+                        <ButtonComponent
+                          cssClass="e-small e-danger"
+                          disabled={state === 'loading'}
+                          onClick={async () => {
+                            setRestartStates(prev => ({ ...prev, [c.uuid]: 'loading' }));
+                            setRestartErrors(prev => { const n = { ...prev }; delete n[c.uuid]; return n; });
+                            try {
+                              await restartClient(c.uuid, c.crash_reason);
+                              setRestartStates(prev => ({ ...prev, [c.uuid]: 'success' }));
+                              setTimeout(() => {
+                                setRestartStates(prev => ({ ...prev, [c.uuid]: 'idle' }));
+                                loadCrashedClients();
+                              }, 8000);
+                            } catch (e) {
+                              const msg = e instanceof Error ? e.message : 'Unbekannter Fehler';
+                              setRestartStates(prev => ({ ...prev, [c.uuid]: 'failed' }));
+                              setRestartErrors(prev => ({ ...prev, [c.uuid]: msg }));
+                            }
+                          }}
+                        >
+                          Neustart
+                        </ButtonComponent>
+                      )}
+                    </td>
+                  </tr>
+                );
+              })}
+            </tbody>
+          </table>
+        </div>
+      )}
+
+      {serviceFailedClients && serviceFailedClients.service_failed_count > 0 && (
+        <div className="monitoring-panel monitoring-service-failed-panel">
+          <div className="monitoring-panel-header">
+            <h3 style={{ color: '#7c2d12' }}>
+              Service dauerhaft ausgefallen (systemd hat aufgegeben)
+            </h3>
+            <span
+              style={{
+                background: '#ffedd5',
+                color: '#7c2d12',
+                padding: '2px 10px',
+                borderRadius: '12px',
+                fontWeight: 600,
+                fontSize: '0.85rem',
+              }}
+            >
+              {serviceFailedClients.service_failed_count}
+            </span>
+          </div>
+          <p className="monitoring-meta-hint" style={{ marginBottom: '0.75rem' }}>
+            Diese Clients konnten von systemd nicht mehr automatisch neugestartet werden.
+            Manuelle Intervention erforderlich. Nach Behebung bitte quittieren.
+          </p>
+          <table className="monitoring-crash-table">
+            <thead>
+              <tr>
+                <th>Client</th>
+                <th>Gruppe</th>
+                <th>Unit</th>
+                <th>Ausgefallen am</th>
+                <th>Letztes Signal</th>
+                <th>Aktion</th>
+              </tr>
+            </thead>
+            <tbody>
+              {serviceFailedClients.clients.map((c: ServiceFailedClient) => {
+                const state = clearStates[c.uuid] || 'idle';
+                const errMsg = clearErrors[c.uuid];
+                const displayName = c.description || c.hostname || c.uuid;
+                const failedAt = c.service_failed_at
+                  ? new Date(c.service_failed_at.endsWith('Z') ? c.service_failed_at : c.service_failed_at + 'Z').toLocaleString('de-DE')
+                  : '—';
+                return (
+                  <tr key={c.uuid}>
+                    <td>
+                      <span title={c.uuid}>{displayName}</span>
+                      {c.ip && <span className="monitoring-meta-hint"> ({c.ip})</span>}
+                    </td>
+                    <td>{c.group_id ?? '—'}</td>
+                    <td><code style={{ fontSize: '0.8rem' }}>{c.service_failed_unit || '—'}</code></td>
+                    <td>{failedAt}</td>
+                    <td>{formatRelative(c.last_alive)}</td>
+                    <td>
+                      {state === 'loading' && <span style={{ color: '#6b7280', fontSize: '0.85rem' }}>Wird quittiert…</span>}
+                      {state === 'success' && <span style={{ color: '#15803d', fontSize: '0.85rem' }}>✓ Quittiert</span>}
+                      {state === 'failed' && (
+                        <span style={{ color: '#dc2626', fontSize: '0.85rem' }} title={errMsg}>✗ Fehler</span>
+                      )}
+                      {(state === 'idle' || state === 'failed') && (
+                        <ButtonComponent
+                          cssClass="e-small e-warning"
+                          disabled={state === 'loading'}
+                          onClick={async () => {
+                            setClearStates(prev => ({ ...prev, [c.uuid]: 'loading' }));
+                            setClearErrors(prev => { const n = { ...prev }; delete n[c.uuid]; return n; });
+                            try {
+                              await clearServiceFailed(c.uuid);
+                              setClearStates(prev => ({ ...prev, [c.uuid]: 'success' }));
+                              setTimeout(() => {
+                                setClearStates(prev => ({ ...prev, [c.uuid]: 'idle' }));
+                                loadServiceFailedClients();
+                              }, 4000);
+                            } catch (e) {
+                              const msg = e instanceof Error ? e.message : 'Unbekannter Fehler';
+                              setClearStates(prev => ({ ...prev, [c.uuid]: 'failed' }));
+                              setClearErrors(prev => ({ ...prev, [c.uuid]: msg }));
+                            }
+                          }}
+                        >
+                          Quittieren
+                        </ButtonComponent>
+                      )}
+                    </td>
+                  </tr>
+                );
+              })}
+            </tbody>
+          </table>
+        </div>
+      )}
+
      {loading && !overview ? (
        <MessageComponent severity="Info" content="Monitoring-Daten werden geladen ..." />
      ) : (
@@ -393,6 +625,16 @@ const MonitoringDashboard: React.FC = () => {
                      <span>Bildschirmstatus</span>
                      <strong>{selectedClient.screenHealthStatus || 'UNKNOWN'}</strong>
                    </div>
+                    <div className="monitoring-detail-row">
+                      <span>MQTT Reconnects</span>
+                      <strong>{selectedClient.mqttReconnectCount != null ? selectedClient.mqttReconnectCount : '—'}</strong>
+                    </div>
+                    {selectedClient.mqttLastDisconnectAt && (
+                      <div className="monitoring-detail-row">
+                        <span>Letzter Disconnect</span>
+                        <strong>{formatTimestamp(selectedClient.mqttLastDisconnectAt)}</strong>
+                      </div>
+                    )}
                    <div className="monitoring-detail-row">
                      <span>Letzte Analyse</span>
                      <strong>{formatTimestamp(selectedClient.lastScreenshotAnalyzed)}</strong>
--- a/dashboard/src/programminfo.tsx
+++ b/dashboard/src/programminfo.tsx
@@ -12,10 +12,6 @@ interface ProgramInfo {
    frontend: { name: string; license: string }[];
    backend: { name: string; license: string }[];
  };
-  buildInfo: {
-    buildDate: string;
-    commitId: string;
-  };
  changelog: {
    version: string;
    date: string;
@@ -85,30 +81,30 @@ const Programminfo: React.FC = () => {
              </div>
            </div>
            <div className="e-card-content">
-              <div style={{ display: 'flex', flexDirection: 'column', gap: '0.5rem' }}>
-                <p>
-                  <strong>Version:</strong> {info.version}
-                </p>
-                <p>
-                  <strong>Copyright:</strong> {info.copyright}
-                </p>
-                <p>
+              <div style={{ display: 'flex', flexDirection: 'column', gap: '0.25rem' }}>
+                <div><strong>Version:</strong> {info.version}</div>
+                <div><strong>Copyright:</strong> {info.copyright}</div>
+                <div>
                  <strong>Support:</strong>{' '}
                  <a href={`mailto:${info.supportContact}`} style={{ color: '#2563eb', textDecoration: 'none' }}>
                    {info.supportContact}
                  </a>
-                </p>
-                <hr style={{ margin: '1rem 0' }} />
-                <h4 style={{ fontWeight: 600 }}>Build-Informationen</h4>
-                <p>
-                  <strong>Build-Datum:</strong> {new Date(info.buildInfo.buildDate).toLocaleString('de-DE')}
-                </p>
-                <p>
-                  <strong>Commit-ID:</strong>{' '}
+                </div>
+                <hr style={{ margin: '0.5rem 0' }} />
+                <div style={{ fontWeight: 600, fontSize: '0.875rem', marginBottom: '0.125rem' }}>Build-Informationen</div>
+                <div><strong>Build-Datum:</strong> {new Date(__BUILD_DATE__).toLocaleString('de-DE')}</div>
+                <div>
+                  <strong>Umgebung:</strong>{' '}
                  <span style={{ fontFamily: monoFont, fontSize: '0.875rem', background: '#f3f4f6', padding: '0.125rem 0.25rem', borderRadius: '0.25rem' }}>
-                    {info.buildInfo.commitId}
+                    {__BUILD_ENV__}
                  </span>
-                </p>
+                </div>
+                <div>
+                  <strong>Node.js:</strong>{' '}
+                  <span style={{ fontFamily: monoFont, fontSize: '0.875rem', background: '#f3f4f6', padding: '0.125rem 0.25rem', borderRadius: '0.25rem' }}>
+                    {__NODE_VERSION__}
+                  </span>
+                </div>
              </div>
            </div>
          </div>
--- a/dashboard/src/settings.tsx
+++ b/dashboard/src/settings.tsx
@@ -21,12 +21,14 @@ import {
  type PeriodUsage
 } from './apiAcademicPeriods';
 import { formatIsoDateForDisplay } from './dateFormatting';
-import { Link } from 'react-router-dom';
+import { Link, useLocation } from 'react-router-dom';

 // Minimal event type for Syncfusion Tab 'selected' callback
 type TabSelectedEvent = { selectedIndex?: number };

 const Einstellungen: React.FC = () => {
+  const location = useLocation();
+
  // Presentation settings state
  const [presentationInterval, setPresentationInterval] = React.useState(10);
  const [presentationPageProgress, setPresentationPageProgress] = React.useState(true);
@@ -670,6 +672,8 @@ const Einstellungen: React.FC = () => {
  const isAdmin = !!(user && ['admin', 'superadmin'].includes(user.role));
  const isSuperadmin = !!(user && user.role === 'superadmin');

+  const [rootTabIndex, setRootTabIndex] = React.useState(0);
+
  // Preserve selected nested-tab indices to avoid resets on parent re-render
  const [academicTabIndex, setAcademicTabIndex] = React.useState(0);
  const [displayTabIndex, setDisplayTabIndex] = React.useState(0);
@@ -678,6 +682,22 @@ const Einstellungen: React.FC = () => {
  const [usersTabIndex, setUsersTabIndex] = React.useState(0);
  const [systemTabIndex, setSystemTabIndex] = React.useState(0);

+  React.useEffect(() => {
+    const params = new URLSearchParams(location.search);
+    const focus = params.get('focus');
+
+    if (focus === 'holidays') {
+      setRootTabIndex(0);
+      setAcademicTabIndex(1);
+      return;
+    }
+
+    if (focus === 'academic-periods') {
+      setRootTabIndex(0);
+      setAcademicTabIndex(0);
+    }
+  }, [location.search]);
+
  // ---------- Leaf content functions (second-level tabs) ----------
  // Academic Calendar
  // (Old separate Import/List tab contents removed in favor of combined tab)
@@ -1695,7 +1715,11 @@ const Einstellungen: React.FC = () => {

      <h2 style={{ marginBottom: 20, fontSize: '24px', fontWeight: 600 }}>Einstellungen</h2>

-      <TabComponent heightAdjustMode="Auto">
+      <TabComponent
+        heightAdjustMode="Auto"
+        selectedItem={rootTabIndex}
+        selected={(e: TabSelectedEvent) => setRootTabIndex(e.selectedIndex ?? 0)}
+      >
        <TabItemsDirective>
          <TabItemDirective header={{ text: '📅 Akademischer Kalender' }} content={AcademicCalendarTabs} />
          {isAdmin && (
--- a/dashboard/src/vite-env.d.ts
+++ b/dashboard/src/vite-env.d.ts
@@ -1 +1,5 @@
 /// <reference types="vite/client" />
+
+declare const __BUILD_DATE__: string;
+declare const __NODE_VERSION__: string;
+declare const __BUILD_ENV__: string;
--- a/dashboard/vite.config.ts
+++ b/dashboard/vite.config.ts
@@ -6,6 +6,11 @@ import react from '@vitejs/plugin-react';
 export default defineConfig({
  cacheDir: './.vite',
  plugins: [react()],
+  define: {
+    __BUILD_DATE__: JSON.stringify(new Date().toISOString()),
+    __NODE_VERSION__: JSON.stringify(process.version),
+    __BUILD_ENV__: JSON.stringify(process.env.NODE_ENV ?? 'development'),
+  },
  resolve: {
    // 🔧 KORRIGIERT: Entferne die problematischen Aliases komplett
    // Diese verursachen das "not an absolute path" Problem
--- a/docker-compose.prod.yml
+++ b/docker-compose.prod.yml
@@ -45,15 +45,37 @@ services:
    image: eclipse-mosquitto:2.0.21
    container_name: infoscreen-mqtt
    restart: unless-stopped
+    command: >
+      sh -c 'set -eu;
+      : "$${MQTT_USER:?MQTT_USER not set}";
+      : "$${MQTT_PASSWORD:?MQTT_PASSWORD not set}";
+      touch /mosquitto/config/passwd;
+      chmod 600 /mosquitto/config/passwd;
+      mosquitto_passwd -b /mosquitto/config/passwd "$${MQTT_USER}" "$${MQTT_PASSWORD}";
+      if [ -n "$${MQTT_CANARY_USER:-}" ] && [ -n "$${MQTT_CANARY_PASSWORD:-}" ]; then
+        mosquitto_passwd -b /mosquitto/config/passwd "$${MQTT_CANARY_USER}" "$${MQTT_CANARY_PASSWORD}";
+      fi;
+      exec mosquitto -c /mosquitto/config/mosquitto.conf'
    volumes:
-      - ./mosquitto/config/mosquitto.conf:/mosquitto/config/mosquitto.conf:ro
+      - ./mosquitto/config:/mosquitto/config
+      - ./mosquitto/data:/mosquitto/data
+      - ./mosquitto/log:/mosquitto/log
    ports:
      - "1883:1883"
      - "9001:9001"
+    environment:
+      - MQTT_USER=${MQTT_USER}
+      - MQTT_PASSWORD=${MQTT_PASSWORD}
+      - MQTT_CANARY_USER=${MQTT_CANARY_USER:-}
+      - MQTT_CANARY_PASSWORD=${MQTT_CANARY_PASSWORD:-}
    networks:
      - infoscreen-net
    healthcheck:
-      test: ["CMD-SHELL", "mosquitto_pub -h localhost -t test -m 'health' || exit 1"]
+      test:
+        [
+          "CMD-SHELL",
+          "mosquitto_pub -h localhost -u $$MQTT_USER -P $$MQTT_PASSWORD -t test -m 'health' || exit 1",
+        ]
      interval: 30s
      timeout: 5s
      retries: 3
@@ -94,6 +116,7 @@ services:
    command: >
      bash -c "alembic -c /app/server/alembic.ini upgrade head &&
               python /app/server/init_defaults.py &&
+               python /app/server/init_academic_periods.py &&
               exec gunicorn server.wsgi:app --bind 0.0.0.0:8000"

  dashboard:
@@ -124,6 +147,11 @@ services:
      DB_PASSWORD: ${DB_PASSWORD}
      DB_NAME: ${DB_NAME}
      DB_ROOT_PASSWORD: ${DB_ROOT_PASSWORD}
+      API_BASE_URL: http://server:8000
+      MQTT_BROKER_HOST: ${MQTT_BROKER_HOST:-mqtt}
+      MQTT_BROKER_PORT: ${MQTT_BROKER_PORT:-1883}
+      MQTT_USER: ${MQTT_USER}
+      MQTT_PASSWORD: ${MQTT_PASSWORD}
    networks:
      - infoscreen-net

@@ -140,7 +168,18 @@ services:
    environment:
      # HINZUGEFÜGT: Datenbank-Verbindungsstring
      DB_CONN: "mysql+pymysql://${DB_USER}:${DB_PASSWORD}@db/${DB_NAME}"
-      MQTT_PORT: 1883
+      MQTT_BROKER_HOST: ${MQTT_BROKER_HOST:-mqtt}
+      MQTT_BROKER_PORT: ${MQTT_BROKER_PORT:-1883}
+      MQTT_USER: ${MQTT_USER}
+      MQTT_PASSWORD: ${MQTT_PASSWORD}
+      POLL_INTERVAL_SECONDS: ${POLL_INTERVAL_SECONDS:-30}
+      POWER_INTENT_PUBLISH_ENABLED: ${POWER_INTENT_PUBLISH_ENABLED:-false}
+      POWER_INTENT_HEARTBEAT_ENABLED: ${POWER_INTENT_HEARTBEAT_ENABLED:-true}
+      POWER_INTENT_EXPIRY_MULTIPLIER: ${POWER_INTENT_EXPIRY_MULTIPLIER:-3}
+      POWER_INTENT_MIN_EXPIRY_SECONDS: ${POWER_INTENT_MIN_EXPIRY_SECONDS:-90}
+      CRASH_RECOVERY_ENABLED: ${CRASH_RECOVERY_ENABLED:-false}
+      CRASH_RECOVERY_GRACE_SECONDS: ${CRASH_RECOVERY_GRACE_SECONDS:-180}
+      CRASH_RECOVERY_LOCKOUT_MINUTES: ${CRASH_RECOVERY_LOCKOUT_MINUTES:-15}
    networks:
      - infoscreen-net

--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -19,6 +19,10 @@ services:
      - DB_CONN=mysql+pymysql://${DB_USER}:${DB_PASSWORD}@db/${DB_NAME}
      - DB_URL=mysql+pymysql://${DB_USER}:${DB_PASSWORD}@db/${DB_NAME}
      - API_BASE_URL=http://server:8000
+      - MQTT_BROKER_HOST=${MQTT_BROKER_HOST:-mqtt}
+      - MQTT_BROKER_PORT=${MQTT_BROKER_PORT:-1883}
+      - MQTT_USER=${MQTT_USER}
+      - MQTT_PASSWORD=${MQTT_PASSWORD}
      - ENV=${ENV:-development}
      - FLASK_SECRET_KEY=${FLASK_SECRET_KEY:-dev-secret-key-change-in-production}
      - DEFAULT_SUPERADMIN_USERNAME=${DEFAULT_SUPERADMIN_USERNAME:-superadmin}
@@ -70,6 +74,17 @@ services:
    image: eclipse-mosquitto:2.0.21 # ✅ GUT: Version ist bereits spezifisch
    container_name: infoscreen-mqtt
    restart: unless-stopped
+    command: >
+      sh -c 'set -eu;
+      : "$${MQTT_USER:?MQTT_USER not set}";
+      : "$${MQTT_PASSWORD:?MQTT_PASSWORD not set}";
+      touch /mosquitto/config/passwd;
+      chmod 600 /mosquitto/config/passwd;
+      mosquitto_passwd -b /mosquitto/config/passwd "$${MQTT_USER}" "$${MQTT_PASSWORD}";
+      if [ -n "$${MQTT_CANARY_USER:-}" ] && [ -n "$${MQTT_CANARY_PASSWORD:-}" ]; then
+        mosquitto_passwd -b /mosquitto/config/passwd "$${MQTT_CANARY_USER}" "$${MQTT_CANARY_PASSWORD}";
+      fi;
+      exec mosquitto -c /mosquitto/config/mosquitto.conf'
    volumes:
      - ./mosquitto/config:/mosquitto/config
      - ./mosquitto/data:/mosquitto/data
@@ -77,13 +92,18 @@ services:
    ports:
      - "1883:1883" # Standard MQTT
      - "9001:9001" # WebSocket (falls benötigt)
+    environment:
+      - MQTT_USER=${MQTT_USER}
+      - MQTT_PASSWORD=${MQTT_PASSWORD}
+      - MQTT_CANARY_USER=${MQTT_CANARY_USER:-}
+      - MQTT_CANARY_PASSWORD=${MQTT_CANARY_PASSWORD:-}
    networks:
      - infoscreen-net
    healthcheck:
      test:
        [
          "CMD-SHELL",
-          "mosquitto_pub -h localhost -t test -m 'health' || exit 1",
+          "mosquitto_pub -h localhost -u $$MQTT_USER -P $$MQTT_PASSWORD -t test -m 'health' || exit 1",
        ]
      interval: 30s
      timeout: 5s
@@ -169,13 +189,18 @@ services:
    environment:
      # HINZUGEFÜGT: Datenbank-Verbindungsstring
      - DB_CONN=mysql+pymysql://${DB_USER}:${DB_PASSWORD}@db/${DB_NAME}
-      - MQTT_BROKER_URL=mqtt
-      - MQTT_PORT=1883
+      - MQTT_BROKER_HOST=${MQTT_BROKER_HOST:-mqtt}
+      - MQTT_BROKER_PORT=${MQTT_BROKER_PORT:-1883}
+      - MQTT_USER=${MQTT_USER}
+      - MQTT_PASSWORD=${MQTT_PASSWORD}
      - POLL_INTERVAL_SECONDS=${POLL_INTERVAL_SECONDS:-30}
      - POWER_INTENT_PUBLISH_ENABLED=${POWER_INTENT_PUBLISH_ENABLED:-false}
      - POWER_INTENT_HEARTBEAT_ENABLED=${POWER_INTENT_HEARTBEAT_ENABLED:-true}
      - POWER_INTENT_EXPIRY_MULTIPLIER=${POWER_INTENT_EXPIRY_MULTIPLIER:-3}
      - POWER_INTENT_MIN_EXPIRY_SECONDS=${POWER_INTENT_MIN_EXPIRY_SECONDS:-90}
+      - CRASH_RECOVERY_ENABLED=${CRASH_RECOVERY_ENABLED:-false}
+      - CRASH_RECOVERY_GRACE_SECONDS=${CRASH_RECOVERY_GRACE_SECONDS:-180}
+      - CRASH_RECOVERY_LOCKOUT_MINUTES=${CRASH_RECOVERY_LOCKOUT_MINUTES:-15}
    networks:
      - infoscreen-net

--- a/docs/archive/CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md
+++ b/docs/archive/CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md
--- a/docs/archive/MQTT_DASHBOARD_V1_TO_V2_MIGRATION.md
+++ b/docs/archive/MQTT_DASHBOARD_V1_TO_V2_MIGRATION.md
--- a/docs/archive/PPTX_CONVERSION_LEGACY_APPROACH.md
+++ b/docs/archive/PPTX_CONVERSION_LEGACY_APPROACH.md
--- a/docs/archive/TV_POWER_PHASE_1_CANARY_VALIDATION.md
+++ b/docs/archive/TV_POWER_PHASE_1_CANARY_VALIDATION.md
@@ -16,7 +16,7 @@ Manual verification checklist for Phase-1 server-side group-level power-intent p
 Instructions:
 1. Subscribe to `infoscreen/groups/1/power/intent` (canary group, QoS 1)
 2. Verify received payload contains:
-   - `schema_version: "v1"`
+   - `schema_version: "1.0"`
   - `group_id: 1`
   - `desired_state: "on"` or `"off"` (string)
   - `reason: "active_event"` or `"no_active_event"` (string)
@@ -24,6 +24,9 @@ Instructions:
   - `issued_at: "2026-03-31T14:22:15Z"` (ISO 8601 with Z suffix)
   - `expires_at: "2026-03-31T14:24:00Z"` (ISO 8601 with Z suffix, always > issued_at)
   - `poll_interval_sec: 30` (integer, matches scheduler poll interval)
+   - `active_event_ids: [...]` (array; empty when off)
+   - `event_window_start: "...Z"` or `null`
+   - `event_window_end: "...Z"` or `null`

 **Pass criteria**: All fields present, correct types and formats, no extra/malformed fields.

--- a/docs/archive/TV_POWER_PHASE_1_COORDINATION.md
+++ b/docs/archive/TV_POWER_PHASE_1_COORDINATION.md
@@ -15,7 +15,7 @@ Prevent unintended TV power-off during adjacent events while enabling coordinate

 ## Server PR-1 Pointer
 - For the strict, agreed server-first implementation path, use:
-  - `TV_POWER_SERVER_PR1_IMPLEMENTATION_CHECKLIST.md`
+  - `TV_POWER_PHASE_1_IMPLEMENTATION_CHECKLIST.md`
 - Treat that checklist as the execution source of truth for Phase 1.

 ---
--- a/docs/archive/TV_POWER_PHASE_1_IMPLEMENTATION_CHECKLIST.md
+++ b/docs/archive/TV_POWER_PHASE_1_IMPLEMENTATION_CHECKLIST.md
@@ -172,8 +172,8 @@ All PR-1 server-side items are complete. Below is a summary of deliverables:
 - **MQTT_EVENT_PAYLOAD_GUIDE.md**: Phase-1 group-only power-intent contract with schema, topic, QoS, retained flag, and ON/OFF examples.
 - **README.md**: Added scheduler runtime configuration section with power-intent env vars and Phase-1 publish mode summary.
 - **AI-INSTRUCTIONS-MAINTENANCE.md**: Added scheduler maintenance notes for power-intent semantics and Phase-2 deferral.
- **TV_POWER_CANARY_VALIDATION_CHECKLIST.md**: 10-scenario manual validation matrix for operators.
- **TV_POWER_SERVER_PR1_IMPLEMENTATION_CHECKLIST.md**: This file; source of truth for PR-1 scope and acceptance criteria.
+- **TV_POWER_PHASE_1_CANARY_VALIDATION.md**: 10-scenario manual validation matrix for operators.
+- **TV_POWER_PHASE_1_IMPLEMENTATION_CHECKLIST.md**: This file; source of truth for PR-1 scope and acceptance criteria.

 ### Validation Artifacts
 - **test_power_intent_canary.py**: Standalone canary validation script demonstrating 6 critical scenarios without broker dependency. All scenarios pass.
@@ -195,5 +195,5 @@ All PR-1 server-side items are complete. Below is a summary of deliverables:

 ### Next Phase
 - Phase 2 (deferred): Per-client override intent, client state acknowledgments, listener persistence of state
- Canary rollout strategy documented in `TV_POWER_CANARY_VALIDATION_CHECKLIST.md`
+- Canary rollout strategy documented in `TV_POWER_PHASE_1_CANARY_VALIDATION.md`

--- a/docs/archive/TV_POWER_PHASE_1_SERVER_HANDOFF.md
+++ b/docs/archive/TV_POWER_PHASE_1_SERVER_HANDOFF.md
@@ -0,0 +1,56 @@
+# Server Handoff: TV Power Coordination
+
+## Status
+Server PR-1 is implemented and merged (Phase 1).
+
+## Source of Truth
+- Contract: TV_POWER_INTENT_SERVER_CONTRACT_V1.md
+- Implementation: scheduler/scheduler.py and scheduler/db_utils.py
+- Validation checklist: TV_POWER_PHASE_1_CANARY_VALIDATION.md
+
+## Active Phase 1 Scope
+- Topic: infoscreen/groups/{group_id}/power/intent
+- QoS: 1
+- Retained: true
+- Scope: group-level only
+- Per-client intent/state topics: deferred to Phase 2
+
+## Publish Semantics (Implemented)
+- Semantic transition (`desired_state` or `reason` changed): new `intent_id` and immediate publish
+- Heartbeat (no semantic change): same `intent_id`, refreshed `issued_at` and `expires_at`
+- Scheduler startup: immediate publish before first poll wait
+- MQTT reconnect: immediate retained republish of cached intents
+
+## Payload Contract (Phase 1)
+```json
+{
+  "schema_version": "1.0",
+  "intent_id": "uuid4",
+  "group_id": 12,
+  "desired_state": "on",
+  "reason": "active_event",
+  "issued_at": "2026-04-01T06:00:03.496Z",
+  "expires_at": "2026-04-01T06:01:33.496Z",
+  "poll_interval_sec": 15,
+  "active_event_ids": [148],
+  "event_window_start": "2026-04-01T06:00:00Z",
+  "event_window_end": "2026-04-01T07:00:00Z"
+}
+```
+
+Expiry rule:
+- expires_at = issued_at + max(3 x poll_interval_sec, 90 seconds)
+
+## Operational Notes
+- Adjacent/overlapping events are merged into one active coverage window; no OFF blip at boundaries.
+- Feature flag defaults are safe for rollout:
+  - POWER_INTENT_PUBLISH_ENABLED=false
+  - POWER_INTENT_HEARTBEAT_ENABLED=true
+  - POWER_INTENT_EXPIRY_MULTIPLIER=3
+  - POWER_INTENT_MIN_EXPIRY_SECONDS=90
+- Keep this handoff concise and defer full details to the stable contract document.
+
+## Phase 2 (Deferred)
+- Per-client override topic: infoscreen/{client_uuid}/power/intent
+- Client power state topic and acknowledgments
+- Listener persistence of client-level power state
--- a/implementation-plans/reboot-command-payload-schemas.json
+++ b/implementation-plans/reboot-command-payload-schemas.json
@@ -0,0 +1,149 @@
+{
+  "$schema": "https://json-schema.org/draft/2020-12/schema",
+  "$id": "https://infoscreen.local/schemas/reboot-command-payload-schemas.json",
+  "title": "Infoscreen Reboot Command Payload Schemas",
+  "description": "Frozen v1 schemas for per-client command and command acknowledgement payloads.",
+  "$defs": {
+    "commandPayloadV1": {
+      "type": "object",
+      "additionalProperties": false,
+      "required": [
+        "schema_version",
+        "command_id",
+        "client_uuid",
+        "action",
+        "issued_at",
+        "expires_at",
+        "requested_by",
+        "reason"
+      ],
+      "properties": {
+        "schema_version": {
+          "type": "string",
+          "const": "1.0"
+        },
+        "command_id": {
+          "type": "string",
+          "format": "uuid"
+        },
+        "client_uuid": {
+          "type": "string",
+          "format": "uuid"
+        },
+        "action": {
+          "type": "string",
+          "enum": [
+            "reboot_host",
+            "shutdown_host"
+          ]
+        },
+        "issued_at": {
+          "type": "string",
+          "format": "date-time"
+        },
+        "expires_at": {
+          "type": "string",
+          "format": "date-time"
+        },
+        "requested_by": {
+          "type": [
+            "integer",
+            "null"
+          ],
+          "minimum": 1
+        },
+        "reason": {
+          "type": [
+            "string",
+            "null"
+          ],
+          "maxLength": 2000
+        }
+      }
+    },
+    "commandAckPayloadV1": {
+      "type": "object",
+      "additionalProperties": false,
+      "required": [
+        "command_id",
+        "status",
+        "error_code",
+        "error_message"
+      ],
+      "properties": {
+        "command_id": {
+          "type": "string",
+          "format": "uuid"
+        },
+        "status": {
+          "type": "string",
+          "enum": [
+            "accepted",
+            "execution_started",
+            "completed",
+            "failed"
+          ]
+        },
+        "error_code": {
+          "type": [
+            "string",
+            "null"
+          ],
+          "maxLength": 128
+        },
+        "error_message": {
+          "type": [
+            "string",
+            "null"
+          ],
+          "maxLength": 4000
+        }
+      },
+      "allOf": [
+        {
+          "if": {
+            "properties": {
+              "status": {
+                "const": "failed"
+              }
+            }
+          },
+          "then": {
+            "properties": {
+              "error_code": {
+                "type": "string",
+                "minLength": 1
+              },
+              "error_message": {
+                "type": "string",
+                "minLength": 1
+              }
+            }
+          }
+        }
+      ]
+    }
+  },
+  "examples": [
+    {
+      "commandPayloadV1": {
+        "schema_version": "1.0",
+        "command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
+        "client_uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
+        "action": "reboot_host",
+        "issued_at": "2026-04-03T12:48:10Z",
+        "expires_at": "2026-04-03T12:52:10Z",
+        "requested_by": 1,
+        "reason": "operator_request"
+      }
+    },
+    {
+      "commandAckPayloadV1": {
+        "command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
+        "status": "execution_started",
+        "error_code": null,
+        "error_message": null
+      }
+    }
+  ]
+}
--- a/implementation-plans/reboot-command-payload-schemas.md
+++ b/implementation-plans/reboot-command-payload-schemas.md
@@ -0,0 +1,59 @@
+## Reboot Command Payload Schema Snippets
+
+This file provides copy-ready validation snippets for client and integration test helpers.
+
+### Canonical Topics (v1)
+1. Command topic: infoscreen/{client_uuid}/commands
+2. Ack topic: infoscreen/{client_uuid}/commands/ack
+
+### Transitional Compatibility Topics
+1. Command topic alias: infoscreen/{client_uuid}/command
+2. Ack topic alias: infoscreen/{client_uuid}/command/ack
+
+### Canonical Action Values
+1. reboot_host
+2. shutdown_host
+
+### Ack Status Values
+1. accepted
+2. execution_started
+3. completed
+4. failed
+
+### JSON Schema Source
+Use this file for machine validation:
+1. implementation-plans/reboot-command-payload-schemas.json
+
+### Minimal Command Schema Snippet
+```json
+{
+  "type": "object",
+  "additionalProperties": false,
+  "required": ["schema_version", "command_id", "client_uuid", "action", "issued_at", "expires_at", "requested_by", "reason"],
+  "properties": {
+    "schema_version": { "const": "1.0" },
+    "command_id": { "type": "string", "format": "uuid" },
+    "client_uuid": { "type": "string", "format": "uuid" },
+    "action": { "enum": ["reboot_host", "shutdown_host"] },
+    "issued_at": { "type": "string", "format": "date-time" },
+    "expires_at": { "type": "string", "format": "date-time" },
+    "requested_by": { "type": ["integer", "null"] },
+    "reason": { "type": ["string", "null"] }
+  }
+}
+```
+
+### Minimal Ack Schema Snippet
+```json
+{
+  "type": "object",
+  "additionalProperties": false,
+  "required": ["command_id", "status", "error_code", "error_message"],
+  "properties": {
+    "command_id": { "type": "string", "format": "uuid" },
+    "status": { "enum": ["accepted", "execution_started", "completed", "failed"] },
+    "error_code": { "type": ["string", "null"] },
+    "error_message": { "type": ["string", "null"] }
+  }
+}
+```
--- a/implementation-plans/reboot-implementation-handoff-client-team.md
+++ b/implementation-plans/reboot-implementation-handoff-client-team.md
@@ -0,0 +1,146 @@
+## Client Team Implementation Spec (Raspberry Pi 5)
+
+### Mission
+Implement client-side command handling for reliable restart and shutdown with strict validation, idempotency, acknowledgements, and reboot recovery continuity.
+
+### Ownership Boundaries
+1. Client team owns command intake, execution, acknowledgement emission, and post-reboot continuity.
+2. Platform team owns command issuance, lifecycle aggregation, and server-side escalation logic.
+3. Client implementation must not assume managed PoE availability.
+
+### Required Client Behaviors
+
+### Frozen MQTT Topics and Schemas (v1)
+1. Canonical command topic: infoscreen/{client_uuid}/commands.
+2. Canonical ack topic: infoscreen/{client_uuid}/commands/ack.
+3. Transitional compatibility topics during migration:
+- infoscreen/{client_uuid}/command
+- infoscreen/{client_uuid}/command/ack
+4. QoS policy: command QoS 1, ack QoS 1 recommended.
+5. Retain policy: commands and acks are non-retained.
+
+Frozen command payload schema:
+
+```json
+{
+	"schema_version": "1.0",
+	"command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
+	"client_uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
+	"action": "reboot_host",
+	"issued_at": "2026-04-03T12:48:10Z",
+	"expires_at": "2026-04-03T12:52:10Z",
+	"requested_by": 1,
+	"reason": "operator_request"
+}
+```
+
+Frozen ack payload schema:
+
+```json
+{
+	"command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
+	"status": "execution_started",
+	"error_code": null,
+	"error_message": null
+}
+```
+
+Allowed ack status values:
+1. accepted
+2. execution_started
+3. completed
+4. failed
+
+Frozen command action values for v1:
+1. reboot_host
+2. shutdown_host
+
+Reserved but not emitted by server in v1:
+1. restart_service
+
+Validation snippets for helper scripts:
+1. Human-readable snippets: implementation-plans/reboot-command-payload-schemas.md
+2. Machine-validated JSON Schema: implementation-plans/reboot-command-payload-schemas.json
+
+### 1. Command Intake
+1. Subscribe to the canonical command topic with QoS 1.
+2. Parse required fields: schema_version, command_id, action, issued_at, expires_at, reason, requested_by, target metadata.
+3. Reject invalid payloads with failed acknowledgement including error_code and diagnostic message.
+4. Reject stale commands when current time exceeds expires_at.
+5. Ignore already-processed command_id values.
+
+### 2. Idempotency And Persistence
+1. Persist processed command_id and execution result on local storage.
+2. Persistence must survive service restart and full OS reboot.
+3. On restart, reload dedupe cache before processing newly delivered commands.
+
+### 3. Acknowledgement Contract Behavior
+1. Emit accepted immediately after successful validation and dedupe pass.
+2. Emit execution_started immediately before invoking the command action.
+3. Emit completed only when local success condition is confirmed.
+4. Emit failed with structured error_code on validation or execution failure.
+5. If MQTT is temporarily unavailable, retry ack publish with bounded backoff until command expiry.
+
+### 4. Execution Security Model
+1. Execute via systemd-managed privileged helper.
+2. Allow only whitelisted operations:
+- reboot_host
+- shutdown_host
+3. Optionally keep restart_service handler as reserved path, but do not require it for v1 conformance.
+4. Disallow arbitrary shell commands and untrusted arguments.
+5. Enforce per-command execution timeout and terminate hung child processes.
+
+### 5. Reboot Recovery Continuity
+1. For reboot_host action:
+- send execution_started
+- trigger reboot promptly
+2. During startup:
+- emit heartbeat early
+- emit process-health once service is ready
+3. Keep last command execution state available after reboot for reconciliation.
+
+### 6. Time And Timeout Semantics
+1. Use monotonic timers for local elapsed-time checks.
+2. Use UTC wall-clock only for protocol timestamps and expiry comparisons.
+3. Target reconnect baseline on Pi 5 USB-SATA SSD: 90 seconds.
+4. Accept cold-boot and USB enumeration ceiling up to 150 seconds.
+
+### 7. Capability Reporting
+1. Report recovery capability class:
+- software_only
+- managed_poe_available
+- manual_only
+2. Report watchdog enabled status.
+3. Report boot-source metadata for diagnostics.
+
+### 8. Error Codes Minimum Set
+1. invalid_schema
+2. missing_field
+3. stale_command
+4. duplicate_command
+5. permission_denied_local
+6. execution_timeout
+7. execution_failed
+8. broker_unavailable
+9. internal_error
+
+### Acceptance Tests (Client Team)
+1. Invalid schema payload is rejected and failed ack emitted.
+2. Expired command is rejected and not executed.
+3. Duplicate command_id is not executed twice.
+4. reboot_host emits execution_started and reconnects with heartbeat in expected window.
+5. restart_service action completes without host reboot and emits completed.
+6. MQTT outage during ack path retries correctly without duplicate execution.
+7. Boot-loop protection cooperates with server-side lockout semantics.
+
+### Delivery Artifacts
+1. Client protocol conformance checklist.
+2. Test evidence for all acceptance tests.
+3. Runtime logs showing full lifecycle for one restart and one reboot scenario.
+4. Known limitations list per image version.
+
+### Definition Of Done
+1. All acceptance tests pass on Pi 5 USB-SATA SSD test devices.
+2. No duplicate execution observed under reconnect and retained-delivery edge cases.
+3. Acknowledgement sequence is complete and machine-parseable for server correlation.
+4. Reboot recovery continuity works without managed PoE dependencies.
--- a/implementation-plans/reboot-implementation-handoff-share.md
+++ b/implementation-plans/reboot-implementation-handoff-share.md
@@ -0,0 +1,214 @@
+## Remote Reboot Reliability Handoff (Share Document)
+
+### Purpose
+This document defines the agreed implementation scope for reliable remote reboot and shutdown of Raspberry Pi 5 clients, with monitoring-first visibility and safe escalation paths.
+
+### Scope
+1. In scope: restart and shutdown command reliability.
+2. In scope: full lifecycle monitoring and audit visibility.
+3. In scope: capability-tier recovery model with optional managed PoE escalation.
+4. Out of scope: broader maintenance module in client-management for this cycle.
+5. Out of scope: mandatory dependency on customer-managed power switching.
+
+### Agreed Operating Model
+1. Command delivery is asynchronous and lifecycle-tracked, not fire-and-forget.
+2. Commands use idempotent command_id semantics with stale-command rejection by expires_at.
+3. Monitoring is authoritative for operational state and escalation decisions.
+4. Recovery must function even when no managed power switching is available.
+
+### Frozen Contract v1 (Effective Immediately)
+1. Canonical command topic: infoscreen/{client_uuid}/commands.
+2. Canonical ack topic: infoscreen/{client_uuid}/commands/ack.
+3. Transitional compatibility topics accepted during migration:
+- infoscreen/{client_uuid}/command
+- infoscreen/{client_uuid}/command/ack
+4. QoS policy: command QoS 1, ack QoS 1 recommended.
+5. Retain policy: commands and acks are non-retained.
+
+Command payload schema (frozen):
+
+```json
+{
+	"schema_version": "1.0",
+	"command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
+	"client_uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
+	"action": "reboot_host",
+	"issued_at": "2026-04-03T12:48:10Z",
+	"expires_at": "2026-04-03T12:52:10Z",
+	"requested_by": 1,
+	"reason": "operator_request"
+}
+```
+
+Ack payload schema (frozen):
+
+```json
+{
+	"command_id": "5d1f8b4b-7e85-44fb-8f38-3f5d5da5e2e4",
+	"status": "execution_started",
+	"error_code": null,
+	"error_message": null
+}
+```
+
+Allowed ack status values:
+1. accepted
+2. execution_started
+3. completed
+4. failed
+
+Frozen command action values:
+1. reboot_host
+2. shutdown_host
+
+API endpoint mapping:
+1. POST /api/clients/{uuid}/restart -> action reboot_host
+2. POST /api/clients/{uuid}/shutdown -> action shutdown_host
+
+Validation snippets:
+1. Human-readable snippets: implementation-plans/reboot-command-payload-schemas.md
+2. Machine-validated JSON Schema: implementation-plans/reboot-command-payload-schemas.json
+
+### Command Lifecycle States
+1. queued
+2. publish_in_progress
+3. published
+4. ack_received
+5. execution_started
+6. awaiting_reconnect
+7. recovered
+8. completed
+9. failed
+10. expired
+11. timed_out
+12. canceled
+13. blocked_safety
+14. manual_intervention_required
+
+### Timeout Defaults (Pi 5, USB-SATA SSD baseline)
+1. queued to publish_in_progress: immediate, timeout 5 seconds.
+2. publish_in_progress to published: timeout 8 seconds.
+3. published to ack_received: timeout 20 seconds.
+4. ack_received to execution_started: 15 seconds for service restart, 25 seconds for host reboot.
+5. execution_started to awaiting_reconnect: timeout 10 seconds.
+6. awaiting_reconnect to recovered: baseline 90 seconds after validation, cold-boot ceiling 150 seconds.
+7. recovered to completed: 15 to 20 seconds based on fleet stability.
+8. command expires_at default: 240 seconds, bounded 180 to 360 seconds.
+
+### Recovery Tiers
+1. Tier 0 baseline, always required: watchdog, systemd auto-restart, lifecycle tracking, manual intervention fallback.
+2. Tier 1 optional: managed PoE per-port power-cycle escalation where customer infrastructure supports it.
+3. Tier 2 no remote power control: direct manual intervention workflow.
+
+### Governance And Safety
+1. Role access: admin and superadmin.
+2. Bulk actions require reason capture.
+3. Safety lockout: maximum 3 reboot commands per client in 15 minutes.
+4. Escalation cooldown: 60 seconds before automatic move to manual_intervention_required.
+
+### MQTT Auth Hardening (Phase 1, Required Before Broad Rollout)
+1. Intranet-only deployment is not sufficient protection for privileged MQTT actions by itself.
+2. Phase 1 hardening scope is broker authentication, authorization, and network restriction; payload URL allowlisting is deferred to a later client/server feature.
+3. MQTT broker must disable anonymous publish/subscribe access in production.
+4. MQTT broker must require authenticated identities for server-side publishers and client devices.
+5. MQTT broker must enforce ACLs so that:
+- only server-side services can publish to `infoscreen/{client_uuid}/commands`
+- only server-side services can publish scheduler event topics
+- each client can subscribe only to its own command topics and assigned event topics
+- each client can publish only its own ack, heartbeat, health, dashboard, and telemetry topics
+6. Broker port exposure must be restricted to the management network and approved hosts only.
+7. TLS support is strongly recommended in this phase and should be enabled when operationally feasible.
+
+### Server Team Actions For Auth Hardening
+1. Provision broker credentials for command/event publishers and for client devices.
+2. Configure Mosquitto or equivalent broker ACLs for per-topic publish and subscribe restrictions.
+3. Disable anonymous access on production brokers.
+4. Restrict broker network exposure with firewall rules, VLAN policy, or equivalent network controls.
+5. Update server/frontend deployment to publish MQTT with authenticated credentials.
+6. Validate that server-side event publishing and reboot/shutdown command publishing still work under the new ACL policy.
+7. Coordinate credential distribution and rotation with the client deployment process.
+
+### MQTT ACL Matrix (Canonical Baseline)
+| Actor | Topic Pattern | Publish | Subscribe | Notes |
+| --- | --- | --- | --- | --- |
+| scheduler-service | infoscreen/events/+ | Yes | No | Publishes retained active event list per group. |
+| api-command-publisher | infoscreen/+/commands | Yes | No | Publishes canonical reboot/shutdown commands. |
+| api-command-publisher | infoscreen/+/command | Yes | No | Transitional compatibility publish only. |
+| api-group-assignment | infoscreen/+/group_id | Yes | No | Publishes retained client-to-group assignment. |
+| listener-service | infoscreen/+/commands/ack | No | Yes | Consumes canonical client command acknowledgements. |
+| listener-service | infoscreen/+/command/ack | No | Yes | Consumes transitional compatibility acknowledgements. |
+| listener-service | infoscreen/+/heartbeat | No | Yes | Consumes heartbeat telemetry. |
+| listener-service | infoscreen/+/health | No | Yes | Consumes health telemetry. |
+| listener-service | infoscreen/+/dashboard | No | Yes | Consumes dashboard screenshot payloads. |
+| listener-service | infoscreen/+/screenshot | No | Yes | Consumes screenshot payloads (if enabled). |
+| listener-service | infoscreen/+/logs/error | No | Yes | Consumes client error logs. |
+| listener-service | infoscreen/+/logs/warn | No | Yes | Consumes client warn logs. |
+| listener-service | infoscreen/+/logs/info | No | Yes | Consumes client info logs. |
+| listener-service | infoscreen/discovery | No | Yes | Consumes discovery announcements. |
+| listener-service | infoscreen/+/discovery_ack | Yes | No | Publishes discovery acknowledgements. |
+| client-<uuid> | infoscreen/<uuid>/commands | No | Yes | Canonical command intake for this client only. |
+| client-<uuid> | infoscreen/<uuid>/command | No | Yes | Transitional compatibility intake for this client only. |
+| client-<uuid> | infoscreen/events/<group_id> | No | Yes | Assigned group event feed only; dynamic per assignment. |
+| client-<uuid> | infoscreen/<uuid>/commands/ack | Yes | No | Canonical command acknowledgements for this client only. |
+| client-<uuid> | infoscreen/<uuid>/command/ack | Yes | No | Transitional compatibility acknowledgements for this client only. |
+| client-<uuid> | infoscreen/<uuid>/heartbeat | Yes | No | Heartbeat telemetry. |
+| client-<uuid> | infoscreen/<uuid>/health | Yes | No | Health telemetry. |
+| client-<uuid> | infoscreen/<uuid>/dashboard | Yes | No | Dashboard status and screenshot payloads. |
+| client-<uuid> | infoscreen/<uuid>/screenshot | Yes | No | Screenshot payloads (if enabled). |
+| client-<uuid> | infoscreen/<uuid>/logs/error | Yes | No | Error log stream. |
+| client-<uuid> | infoscreen/<uuid>/logs/warn | Yes | No | Warning log stream. |
+| client-<uuid> | infoscreen/<uuid>/logs/info | Yes | No | Info log stream. |
+| client-<uuid> | infoscreen/discovery | Yes | No | Discovery announcement. |
+| client-<uuid> | infoscreen/<uuid>/discovery_ack | No | Yes | Discovery acknowledgment from listener. |
+
+ACL implementation notes:
+1. Use per-client identities; client ACLs must be scoped to exact client UUID and must not allow wildcard access to other clients.
+2. Event topic subscription (`infoscreen/events/<group_id>`) should be managed via broker-side ACL provisioning that updates when group assignment changes.
+3. Transitional singular command topics are temporary and should be removed after migration cutover.
+4. Deny by default: any topic not explicitly listed above should be blocked for each actor.
+
+### Credential Management Guidance
+1. Real MQTT passwords must not be stored in tracked documentation or committed templates.
+2. Each client device should receive a unique broker username and password, stored only in its local [/.env](.env).
+3. Server-side publisher credentials should be stored in the server team's secret-management path, not in source control.
+4. Recommended naming convention for client broker users: `infoscreen-client-<client-uuid-prefix>`.
+5. Client passwords should be random, at least 20 characters, and rotated through deployment tooling or broker administration procedures.
+6. The server/infrastructure team owns broker-side user creation, ACL assignment, rotation, and revocation.
+7. The client team owns loading credentials from local env files and validating connection behavior against the secured broker.
+
+### Client Team Actions For Auth Hardening
+1. Add MQTT username/password support in the client connection setup.
+2. Add client-side TLS configuration support from environment when certificates are provided.
+3. Update local test helpers to support authenticated MQTT publishing and subscription.
+4. Validate command and event intake against the authenticated broker configuration before canary rollout.
+
+### Ready For Server/Frontend Team (Auth Phase)
+1. Client implementation is ready to connect with MQTT auth from local `.env` (`MQTT_USERNAME`, `MQTT_PASSWORD`, optional TLS settings).
+2. Client command/event intake and client ack/telemetry publishing run over the authenticated MQTT session.
+3. Server/frontend team must now complete broker-side enforcement and publisher migration.
+
+Server/frontend done criteria:
+1. Anonymous broker access is disabled in production.
+2. Server-side publishers use authenticated broker credentials.
+3. ACLs are active and validated for command, event, and client telemetry topics.
+4. At least one canary client proves end-to-end flow under ACLs:
+- server publishes command/event with authenticated publisher
+- client receives payload
+- client sends ack/telemetry successfully
+5. Revocation test passes: disabling one client credential blocks only that client without impacting others.
+
+Operational note:
+1. Client-side auth support is necessary but not sufficient by itself; broker ACL/auth enforcement is the security control that must be enabled by the server/infrastructure team.
+
+### Rollout Plan
+1. Contract freeze and sign-off.
+2. Platform and client implementation against frozen schemas.
+3. One-group canary.
+4. Rollback if failed plus timed_out exceeds 5 percent.
+5. Expand only after 7 days below intervention threshold.
+
+### Success Criteria
+1. Deterministic command lifecycle visibility from enqueue to completion.
+2. No duplicate execution under reconnect or delayed-delivery conditions.
+3. Stable Pi 5 SSD reconnect behavior within defined baseline.
+4. Clear and actionable manual intervention states when automatic recovery is exhausted.
--- a/implementation-plans/reboot-kickoff-summary.md
+++ b/implementation-plans/reboot-kickoff-summary.md
@@ -0,0 +1,54 @@
+## Reboot Reliability Kickoff Summary
+
+### Objective
+Ship a reliable, observable restart and shutdown workflow for Raspberry Pi 5 clients, with safe escalation and clear operator outcomes.
+
+### What Is Included
+1. Asynchronous command lifecycle with idempotent command_id handling.
+2. Monitoring-first state visibility from queued to terminal outcomes.
+3. Client acknowledgements for accepted, execution_started, completed, and failed.
+4. Pi 5 USB-SATA SSD timeout baseline and tuning rules.
+5. Capability-tier recovery with optional managed PoE escalation.
+
+### What Is Not Included
+1. Full maintenance module in client-management.
+2. Required managed power-switch integration.
+3. Production Wake-on-LAN rollout.
+
+### Team Split
+1. Platform team: API command lifecycle, safety controls, listener ack ingestion.
+2. Web team: lifecycle-aware UX and command status display.
+3. Client team: strict validation, dedupe, ack sequence, secure execution helper, reboot continuity.
+
+### Ownership Matrix
+| Team | Primary Plan File | Main Deliverables |
+| --- | --- | --- |
+| Platform team | implementation-plans/reboot-implementation-handoff-share.md | Command lifecycle backend, policy enforcement, listener ack mapping, safety lockout and escalation |
+| Web team | implementation-plans/reboot-implementation-handoff-share.md | Lifecycle UI states, bulk safety UX, capability visibility, command status polling |
+| Client team | implementation-plans/reboot-implementation-handoff-client-team.md | Command validation, dedupe persistence, ack sequence, secure execution helper, reboot continuity |
+| Project coordination | implementation-plans/reboot-kickoff-summary.md | Phase sequencing, canary gates, rollback thresholds, cross-team sign-off tracking |
+
+### Baseline Operational Defaults
+1. Safety lockout: 3 reboot commands per client in rolling 15 minutes.
+2. Escalation cooldown: 60 seconds.
+3. Reconnect target on Pi 5 SSD: 90 seconds baseline, 150 seconds cold-boot ceiling.
+4. Rollback canary trigger: failed plus timed_out above 5 percent.
+
+### Frozen Contract Snapshot
+1. Canonical command topic: infoscreen/{client_uuid}/commands.
+2. Canonical ack topic: infoscreen/{client_uuid}/commands/ack.
+3. Transitional compatibility topics during migration:
+- infoscreen/{client_uuid}/command
+- infoscreen/{client_uuid}/command/ack
+4. Command schema version: 1.0.
+5. Allowed command actions: reboot_host, shutdown_host.
+6. Allowed ack status values: accepted, execution_started, completed, failed.
+7. Validation snippets:
+- implementation-plans/reboot-command-payload-schemas.md
+- implementation-plans/reboot-command-payload-schemas.json
+
+### Immediate Next Steps
+1. Continue implementation in parallel by team against frozen contract.
+2. Client team validates dedupe and expiry handling on canonical topics.
+3. Platform team verifies ack-state transitions for accepted, execution_started, completed, failed.
+4. Execute one-group canary and validate timing plus failure drills.
--- a/implementation-plans/server-team-actions.md
+++ b/implementation-plans/server-team-actions.md
@@ -0,0 +1,127 @@
+# Server Team Action Items — Infoscreen Client
+
+This document lists everything the server/infrastructure/frontend team must implement to complete the client integration. The client-side code is production-ready for all items listed here.
+
+---
+
+## 1. MQTT Broker Hardening (prerequisite for everything else)
+
+- Disable anonymous access on the broker.
+- Create one broker account **per client device**:
+  - Username convention: `infoscreen-client-<uuid-prefix>` (e.g. `infoscreen-client-9b8d1856`)
+  - Provision the password to the device `.env` as `MQTT_PASSWORD_BROKER=`
+- Create a **server/publisher account** (e.g. `infoscreen-server`) for all server-side publishes.
+- Enforce ACLs:
+
+| Topic | Publisher |
+|---|---|
+| `infoscreen/{uuid}/commands` | server only |
+| `infoscreen/{uuid}/command` (alias) | server only |
+| `infoscreen/{uuid}/group_id` | server only |
+| `infoscreen/events/{group_id}` | server only |
+| `infoscreen/groups/+/power/intent` | server only |
+| `infoscreen/{uuid}/commands/ack` | client only |
+| `infoscreen/{uuid}/command/ack` | client only |
+| `infoscreen/{uuid}/heartbeat` | client only |
+| `infoscreen/{uuid}/health` | client only |
+| `infoscreen/{uuid}/logs/#` | client only |
+| `infoscreen/{uuid}/service_failed` | client only |
+
+---
+
+## 2. Reboot / Shutdown Command — Ack Lifecycle
+
+Client publishes ack status updates to two topics per command (canonical + transitional alias):
+- `infoscreen/{uuid}/commands/ack`
+- `infoscreen/{uuid}/command/ack`
+
+**Ack payload schema (v1, frozen):**
+```json
+{
+  "command_id": "07aab032-53c2-45ef-a5a3-6aa58e9d9fae",
+  "status": "accepted | execution_started | completed | failed",
+  "error_code": null,
+  "error_message": null
+}
+```
+
+**Status lifecycle:**
+
+| Status | When | Notes |
+|---|---|---|
+| `accepted` | Command received and validated | Immediate |
+| `execution_started` | Helper invoked | Immediate after accepted |
+| `completed` | Execution confirmed | For `reboot_host`: arrives after reconnect (10–90 s after `execution_started`) |
+| `failed` | Helper returned error | `error_code` and `error_message` will be set |
+
+**Server must:**
+- Track `command_id` through the full lifecycle and update status in DB/UI.
+- Surface `failed` + `error_code` to the operator UI.
+- Expect `reboot_host` `completed` to arrive after a reconnect delay — do not treat the gap as a timeout.
+- Use `expires_at` from the original command to determine when to abandon waiting.
+
+---
+
+## 3. Health Dashboard — Broker Connection Fields (Gap 2)
+
+Every `infoscreen/{uuid}/health` payload now includes a `broker_connection` block:
+
+```json
+{
+  "timestamp": "2026-04-05T08:00:00.000000+00:00",
+  "expected_state": { "event_id": 42 },
+  "actual_state": {
+    "process": "display_manager",
+    "pid": 1234,
+    "status": "running"
+  },
+  "broker_connection": {
+    "broker_reachable": true,
+    "reconnect_count": 2,
+    "last_disconnect_at": "2026-04-04T10:30:00Z"
+  }
+}
+```
+
+**Server must:**
+- Display `reconnect_count` and `last_disconnect_at` per device in the health dashboard.
+- Implement alerting heuristic:
+  - **All** clients go silent simultaneously → likely broker outage, not device crash.
+  - **Single** client goes silent → device crash, network failure, or process hang.
+
+---
+
+## 4. Service-Failed MQTT Notification (Gap 3)
+
+When systemd gives up restarting a service after repeated crashes (`StartLimitBurst` exceeded), the client automatically publishes a **retained** message:
+
+**Topic:** `infoscreen/{uuid}/service_failed`
+
+**Payload:**
+```json
+{
+  "event": "service_failed",
+  "unit": "infoscreen-simclient.service",
+  "client_uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
+  "failed_at": "2026-04-05T08:00:00Z"
+}
+```
+
+**Server must:**
+- Subscribe to `infoscreen/+/service_failed` on startup (retained — message survives broker restart).
+- Alert the operator immediately when this topic receives a payload.
+- **Clear the retained message** once the device is acknowledged or recovered:
+  ```
+  mosquitto_pub -t "infoscreen/{uuid}/service_failed" -n --retain
+  ```
+
+---
+
+## 5. No Server Action Required
+
+These items are fully implemented client-side and require no server changes:
+
+- systemd watchdog (`WatchdogSec=60`) — hangs detected and process restarted automatically.
+- Command deduplication — `command_id` deduplicated with 24-hour TTL.
+- Ack retry backoff — client retries ack publish on broker disconnect until `expires_at`.
+- Mock helper / test mode (`COMMAND_MOCK_REBOOT_IMMEDIATE_COMPLETE`) — development only.
--- a/listener/listener.py
+++ b/listener/listener.py
@@ -4,11 +4,12 @@ import logging
 import datetime
 import base64
 import re
+import ssl
 import requests
 import paho.mqtt.client as mqtt
 from sqlalchemy import create_engine
 from sqlalchemy.orm import sessionmaker
-from models.models import Client, ClientLog, LogLevel, ProcessStatus, ScreenHealthStatus
+from models.models import Client, ClientLog, ClientCommand, LogLevel, ProcessStatus, ScreenHealthStatus
 logging.basicConfig(level=logging.DEBUG, format='%(asctime)s [%(levelname)s] %(message)s')

 # Load .env only when not already configured by Docker (API_BASE_URL not set by compose means we're outside a container)
@@ -32,6 +33,16 @@ Session = sessionmaker(bind=engine)
 # API configuration
 API_BASE_URL = os.getenv("API_BASE_URL", "http://server:8000")

+MQTT_BROKER_HOST = os.getenv("MQTT_BROKER_HOST", "mqtt")
+MQTT_BROKER_PORT = int(os.getenv("MQTT_BROKER_PORT", os.getenv("MQTT_PORT", "1883")))
+MQTT_USERNAME = os.getenv("MQTT_USER") or os.getenv("MQTT_USERNAME")
+MQTT_PASSWORD = os.getenv("MQTT_PASSWORD")
+MQTT_TLS_ENABLED = os.getenv("MQTT_TLS_ENABLED", "false").strip().lower() in ("1", "true", "yes", "on")
+MQTT_TLS_CA_CERT = os.getenv("MQTT_TLS_CA_CERT")
+MQTT_TLS_CERTFILE = os.getenv("MQTT_TLS_CERTFILE")
+MQTT_TLS_KEYFILE = os.getenv("MQTT_TLS_KEYFILE")
+MQTT_TLS_INSECURE = os.getenv("MQTT_TLS_INSECURE", "false").strip().lower() in ("1", "true", "yes", "on")
+
 # Dashboard payload migration observability
 DASHBOARD_METRICS_LOG_EVERY = int(os.getenv("DASHBOARD_METRICS_LOG_EVERY", "5"))
 DASHBOARD_PARSE_METRICS = {
@@ -376,8 +387,11 @@ def on_connect(client, userdata, flags, reasonCode, properties):
        client.subscribe("infoscreen/+/logs/warn")
        client.subscribe("infoscreen/+/logs/info")
        client.subscribe("infoscreen/+/health")
+        client.subscribe("infoscreen/+/commands/ack")
+        client.subscribe("infoscreen/+/command/ack")
+        client.subscribe("infoscreen/+/service_failed")
        
-        logging.info(f"MQTT connected (reasonCode: {reasonCode}); (re)subscribed to discovery, heartbeats, screenshots, dashboards, logs, and health")
+        logging.info(f"MQTT connected (reasonCode: {reasonCode}); (re)subscribed to discovery, heartbeats, screenshots, dashboards, logs, health, and service_failed")
    except Exception as e:
        logging.error(f"Subscribe failed on connect: {e}")

@@ -387,6 +401,72 @@ def on_message(client, userdata, msg):
    logging.debug(f"Empfangene Nachricht auf Topic: {topic}")

    try:
+        # Command acknowledgement handling
+        if topic.startswith("infoscreen/") and (topic.endswith("/commands/ack") or topic.endswith("/command/ack")):
+            uuid = topic.split("/")[1]
+            try:
+                payload = json.loads(msg.payload.decode())
+            except (json.JSONDecodeError, UnicodeDecodeError):
+                logging.error(f"Ungueltiges Command-ACK Payload von {uuid}")
+                return
+
+            command_id = payload.get("command_id")
+            ack_status = str(payload.get("status", "")).strip().lower()
+            error_code = payload.get("error_code")
+            error_message = payload.get("error_message")
+
+            if not command_id:
+                logging.warning(f"Command-ACK ohne command_id von {uuid}")
+                return
+
+            status_map = {
+                "accepted": "ack_received",
+                "execution_started": "execution_started",
+                "completed": "completed",
+                "failed": "failed",
+            }
+            mapped_status = status_map.get(ack_status)
+            if not mapped_status:
+                logging.warning(f"Unbekannter Command-ACK Status '{ack_status}' von {uuid}")
+                return
+
+            db_session = Session()
+            try:
+                command_obj = db_session.query(ClientCommand).filter_by(command_id=command_id).first()
+                if not command_obj:
+                    logging.warning(f"Command-ACK fuer unbekanntes command_id={command_id} von {uuid}")
+                    return
+
+                # Ignore stale/duplicate regressions.
+                terminal_states = {"completed", "failed", "expired", "canceled", "blocked_safety"}
+                if command_obj.status in terminal_states:
+                    logging.info(
+                        f"Command-ACK ignoriert (bereits terminal): command_id={command_id}, status={command_obj.status}"
+                    )
+                    return
+
+                now_utc = datetime.datetime.now(datetime.UTC)
+                command_obj.status = mapped_status
+                if mapped_status == "ack_received":
+                    command_obj.acked_at = now_utc
+                elif mapped_status == "execution_started":
+                    command_obj.execution_started_at = now_utc
+                elif mapped_status == "completed":
+                    command_obj.completed_at = now_utc
+                elif mapped_status == "failed":
+                    command_obj.failed_at = now_utc
+                    command_obj.error_code = str(error_code) if error_code is not None else command_obj.error_code
+                    command_obj.error_message = str(error_message) if error_message is not None else command_obj.error_message
+
+                db_session.commit()
+                logging.info(f"Command-ACK verarbeitet: command_id={command_id}, status={mapped_status}, uuid={uuid}")
+            except Exception as e:
+                db_session.rollback()
+                logging.error(f"Fehler bei Command-ACK Verarbeitung ({command_id}): {e}")
+            finally:
+                db_session.close()
+            return
+
        # Dashboard-Handling (nested screenshot payload)
        if topic.startswith("infoscreen/") and topic.endswith("/dashboard"):
            uuid = topic.split("/")[1]
@@ -506,6 +586,43 @@ def on_message(client, userdata, msg):
                    logging.error(f"Could not parse log payload from {uuid}: {e}")
            return
        
+        # Service-failed handling (systemd gave up restarting — retained message)
+        if topic.startswith("infoscreen/") and topic.endswith("/service_failed"):
+            uuid = topic.split("/")[1]
+            # Empty payload = retained message cleared; ignore it.
+            if not msg.payload:
+                logging.info(f"service_failed retained message cleared for {uuid}")
+                return
+            try:
+                payload_data = json.loads(msg.payload.decode())
+                failed_at_str = payload_data.get("failed_at")
+                unit = payload_data.get("unit", "")
+                try:
+                    failed_at = datetime.datetime.fromisoformat(failed_at_str.replace("Z", "+00:00")) if failed_at_str else datetime.datetime.now(datetime.UTC)
+                    if failed_at.tzinfo is None:
+                        failed_at = failed_at.replace(tzinfo=datetime.UTC)
+                except (ValueError, AttributeError):
+                    failed_at = datetime.datetime.now(datetime.UTC)
+
+                session = Session()
+                try:
+                    client_obj = session.query(Client).filter_by(uuid=uuid).first()
+                    if client_obj:
+                        client_obj.service_failed_at = failed_at
+                        client_obj.service_failed_unit = unit[:128] if unit else None
+                        session.commit()
+                        logging.warning(f"event=service_failed uuid={uuid} unit={unit} failed_at={failed_at.isoformat()}")
+                    else:
+                        logging.warning(f"service_failed received for unknown client uuid={uuid}")
+                except Exception as e:
+                    session.rollback()
+                    logging.error(f"Error persisting service_failed for {uuid}: {e}")
+                finally:
+                    session.close()
+            except (json.JSONDecodeError, UnicodeDecodeError) as e:
+                logging.error(f"Could not parse service_failed payload from {uuid}: {e}")
+            return
+
        # Health-Handling
        if topic.startswith("infoscreen/") and topic.endswith("/health"):
            uuid = topic.split("/")[1]
@@ -531,6 +648,26 @@ def on_message(client, userdata, msg):
                        screen_health_status=screen_health_status,
                        last_screenshot_analyzed=parse_timestamp((payload_data.get('health_metrics') or {}).get('last_frame_update')),
                    )
+
+                    # Update broker connection health fields
+                    broker_conn = payload_data.get('broker_connection')
+                    if isinstance(broker_conn, dict):
+                        reconnect_count = broker_conn.get('reconnect_count')
+                        last_disconnect_str = broker_conn.get('last_disconnect_at')
+                        if reconnect_count is not None:
+                            try:
+                                client_obj.mqtt_reconnect_count = int(reconnect_count)
+                            except (ValueError, TypeError):
+                                pass
+                        if last_disconnect_str:
+                            try:
+                                last_disconnect = datetime.datetime.fromisoformat(last_disconnect_str.replace('Z', '+00:00'))
+                                if last_disconnect.tzinfo is None:
+                                    last_disconnect = last_disconnect.replace(tzinfo=datetime.UTC)
+                                client_obj.mqtt_last_disconnect_at = last_disconnect
+                            except (ValueError, AttributeError):
+                                pass
+
                    session.commit()
                    logging.debug(f"Health update from {uuid}: {actual.get('process')} ({actual.get('status')})")
                session.close()
@@ -589,9 +726,29 @@ def main():
    mqtt_client.on_connect = on_connect
    # Set an exponential reconnect delay to survive broker restarts
    mqtt_client.reconnect_delay_set(min_delay=1, max_delay=60)
-    mqtt_client.connect("mqtt", 1883)

-    logging.info("Listener gestartet; warte auf MQTT-Verbindung und Nachrichten")
+    if MQTT_USERNAME and MQTT_PASSWORD:
+        mqtt_client.username_pw_set(MQTT_USERNAME, MQTT_PASSWORD)
+
+    if MQTT_TLS_ENABLED:
+        mqtt_client.tls_set(
+            ca_certs=MQTT_TLS_CA_CERT,
+            certfile=MQTT_TLS_CERTFILE,
+            keyfile=MQTT_TLS_KEYFILE,
+            cert_reqs=ssl.CERT_REQUIRED,
+        )
+        if MQTT_TLS_INSECURE:
+            mqtt_client.tls_insecure_set(True)
+
+    mqtt_client.connect(MQTT_BROKER_HOST, MQTT_BROKER_PORT)
+
+    logging.info(
+        "Listener gestartet; warte auf MQTT-Verbindung und Nachrichten (host=%s port=%s tls=%s auth=%s)",
+        MQTT_BROKER_HOST,
+        MQTT_BROKER_PORT,
+        MQTT_TLS_ENABLED,
+        bool(MQTT_USERNAME and MQTT_PASSWORD),
+    )
    mqtt_client.loop_forever()


--- a/models/models.py
+++ b/models/models.py
@@ -147,6 +147,14 @@ class Client(Base):
    screen_health_status = Column(Enum(ScreenHealthStatus), nullable=True, server_default='UNKNOWN')
    last_screenshot_hash = Column(String(32), nullable=True)

+    # Systemd service-failed tracking
+    service_failed_at = Column(TIMESTAMP(timezone=True), nullable=True)
+    service_failed_unit = Column(String(128), nullable=True)
+
+    # MQTT broker connection health
+    mqtt_reconnect_count = Column(Integer, nullable=True)
+    mqtt_last_disconnect_at = Column(TIMESTAMP(timezone=True), nullable=True)
+

 class ClientLog(Base):
    __tablename__ = 'client_logs'
@@ -164,6 +172,33 @@ class ClientLog(Base):
    )


+class ClientCommand(Base):
+    __tablename__ = 'client_commands'
+
+    id = Column(Integer, primary_key=True, autoincrement=True)
+    command_id = Column(String(36), nullable=False, unique=True, index=True)
+    client_uuid = Column(String(36), ForeignKey('clients.uuid', ondelete='CASCADE'), nullable=False, index=True)
+    action = Column(String(32), nullable=False, index=True)
+    status = Column(String(40), nullable=False, index=True)
+    reason = Column(Text, nullable=True)
+    requested_by = Column(Integer, ForeignKey('users.id', ondelete='SET NULL'), nullable=True, index=True)
+    issued_at = Column(TIMESTAMP(timezone=True), nullable=False)
+    expires_at = Column(TIMESTAMP(timezone=True), nullable=False)
+    published_at = Column(TIMESTAMP(timezone=True), nullable=True)
+    acked_at = Column(TIMESTAMP(timezone=True), nullable=True)
+    execution_started_at = Column(TIMESTAMP(timezone=True), nullable=True)
+    completed_at = Column(TIMESTAMP(timezone=True), nullable=True)
+    failed_at = Column(TIMESTAMP(timezone=True), nullable=True)
+    error_code = Column(String(64), nullable=True)
+    error_message = Column(Text, nullable=True)
+    created_at = Column(TIMESTAMP(timezone=True), server_default=func.current_timestamp(), nullable=False)
+    updated_at = Column(TIMESTAMP(timezone=True), server_default=func.current_timestamp(), onupdate=func.current_timestamp(), nullable=False)
+
+    __table_args__ = (
+        Index('ix_client_commands_client_status_created', 'client_uuid', 'status', 'created_at'),
+    )
+
+
 class EventType(enum.Enum):
    presentation = "presentation"
    website = "website"
--- a/scheduler/db_utils.py
+++ b/scheduler/db_utils.py
@@ -1,13 +1,14 @@
 # scheduler/db_utils.py
 from dotenv import load_dotenv
 import os
-from datetime import datetime
+from datetime import datetime, timedelta, timezone
 import hashlib
 import json
 import logging
 from sqlalchemy.orm import sessionmaker, joinedload
 from sqlalchemy import create_engine, or_, and_, text
-from models.models import Event, EventMedia, EventException, SystemSetting
+import uuid as _uuid_mod
+from models.models import Event, EventMedia, EventException, SystemSetting, Client, ClientCommand, ProcessStatus
 from dateutil.rrule import rrulestr
 from urllib.request import Request, urlopen
 from datetime import timezone
@@ -454,3 +455,167 @@ def format_event_with_media(event):
        # Add other event types (message, etc.) here as needed...

    return event_dict
+
+
+# ---------------------------------------------------------------------------
+# Crash detection / auto-recovery helpers
+# ---------------------------------------------------------------------------
+
+_CRASH_RECOVERY_SCHEMA_VERSION = "1.0"
+_CRASH_COMMAND_TOPIC = "infoscreen/{uuid}/commands"
+_CRASH_COMMAND_COMPAT_TOPIC = "infoscreen/{uuid}/command"
+_CRASH_RECOVERY_EXPIRY_SECONDS = int(os.getenv("CRASH_RECOVERY_COMMAND_EXPIRY_SECONDS", "240"))
+_CRASH_RECOVERY_LOCKOUT_MINUTES = int(os.getenv("CRASH_RECOVERY_LOCKOUT_MINUTES", "15"))
+
+
+def get_crash_recovery_candidates(heartbeat_grace_seconds: int) -> list:
+    """
+    Returns a list of dicts for active clients that are crashed (process_status=crashed)
+    or heartbeat-stale, and don't already have a recent recovery command in the lockout window.
+    """
+    session = Session()
+    try:
+        now = datetime.now(timezone.utc)
+        stale_cutoff = now - timedelta(seconds=heartbeat_grace_seconds)
+        lockout_cutoff = now - timedelta(minutes=_CRASH_RECOVERY_LOCKOUT_MINUTES)
+
+        candidates = (
+            session.query(Client)
+            .filter(Client.is_active == True)
+            .filter(
+                or_(
+                    Client.process_status == ProcessStatus.crashed,
+                    Client.last_alive < stale_cutoff,
+                )
+            )
+            .all()
+        )
+
+        result = []
+        for c in candidates:
+            recent = (
+                session.query(ClientCommand)
+                .filter(ClientCommand.client_uuid == c.uuid)
+                .filter(ClientCommand.created_at >= lockout_cutoff)
+                .filter(ClientCommand.action.in_(["reboot_host", "restart_app"]))
+                .first()
+            )
+            if recent:
+                continue
+            crash_reason = (
+                "process_crashed"
+                if c.process_status == ProcessStatus.crashed
+                else "heartbeat_stale"
+            )
+            result.append({
+                "uuid": c.uuid,
+                "reason": crash_reason,
+                "process_status": c.process_status.value if c.process_status else None,
+                "last_alive": c.last_alive,
+            })
+        return result
+    finally:
+        session.close()
+
+
+def issue_crash_recovery_command(client_uuid: str, reason: str) -> tuple:
+    """
+    Writes a ClientCommand (reboot_host) for crash recovery to the DB.
+    Returns (command_id, payload_dict) for the caller to publish over MQTT.
+    Also returns the MQTT topic strings.
+    """
+    session = Session()
+    try:
+        now = datetime.now(timezone.utc)
+        expires_at = now + timedelta(seconds=_CRASH_RECOVERY_EXPIRY_SECONDS)
+        command_id = str(_uuid_mod.uuid4())
+
+        command = ClientCommand(
+            command_id=command_id,
+            client_uuid=client_uuid,
+            action="reboot_host",
+            status="queued",
+            reason=reason,
+            requested_by=None,
+            issued_at=now,
+            expires_at=expires_at,
+        )
+        session.add(command)
+        session.commit()
+        command.status = "publish_in_progress"
+        session.commit()
+
+        payload = {
+            "schema_version": _CRASH_RECOVERY_SCHEMA_VERSION,
+            "command_id": command_id,
+            "client_uuid": client_uuid,
+            "action": "reboot_host",
+            "issued_at": now.isoformat().replace("+00:00", "Z"),
+            "expires_at": expires_at.isoformat().replace("+00:00", "Z"),
+            "requested_by": None,
+            "reason": reason,
+        }
+        topic = _CRASH_COMMAND_TOPIC.format(uuid=client_uuid)
+        compat_topic = _CRASH_COMMAND_COMPAT_TOPIC.format(uuid=client_uuid)
+        return command_id, payload, topic, compat_topic
+    except Exception:
+        session.rollback()
+        raise
+    finally:
+        session.close()
+
+
+def finalize_crash_recovery_command(command_id: str, published: bool, error: str = None) -> None:
+    """Updates command status after MQTT publish attempt."""
+    session = Session()
+    try:
+        cmd = session.query(ClientCommand).filter_by(command_id=command_id).first()
+        if not cmd:
+            return
+        now = datetime.now(timezone.utc)
+        if published:
+            cmd.status = "published"
+            cmd.published_at = now
+        else:
+            cmd.status = "failed"
+            cmd.failed_at = now
+            cmd.error_code = "mqtt_publish_failed"
+            cmd.error_message = error or "Unknown publish error"
+        session.commit()
+    finally:
+        session.close()
+
+
+_TERMINAL_COMMAND_STATUSES = {"completed", "failed", "expired", "canceled", "blocked_safety"}
+
+
+def sweep_expired_commands() -> int:
+    """Marks non-terminal commands whose expires_at has passed as expired.
+
+    Returns the number of commands updated.
+    """
+    session = Session()
+    try:
+        now = datetime.now(timezone.utc)
+        commands = (
+            session.query(ClientCommand)
+            .filter(
+                ClientCommand.expires_at < now,
+                ClientCommand.status.notin_(_TERMINAL_COMMAND_STATUSES),
+            )
+            .all()
+        )
+        if not commands:
+            return 0
+        for cmd in commands:
+            cmd.status = "expired"
+            cmd.failed_at = now
+            cmd.error_code = "expired"
+            cmd.error_message = "Command expired before terminal state was reached."
+        session.commit()
+        return len(commands)
+    except Exception:
+        session.rollback()
+        raise
+    finally:
+        session.close()
--- a/scheduler/scheduler.py
+++ b/scheduler/scheduler.py
@@ -8,12 +8,28 @@ from .db_utils import (
    compute_group_power_intent_basis,
    build_group_power_intent_body,
    compute_group_power_intent_fingerprint,
+    get_crash_recovery_candidates,
+    issue_crash_recovery_command,
+    finalize_crash_recovery_command,
+    sweep_expired_commands,
 )
 import paho.mqtt.client as mqtt
 import json
 import datetime
 import time
 import uuid
+import ssl
+
+
+MQTT_BROKER_HOST = os.getenv("MQTT_BROKER_HOST", os.getenv("MQTT_BROKER_URL", "mqtt"))
+MQTT_BROKER_PORT = int(os.getenv("MQTT_BROKER_PORT", os.getenv("MQTT_PORT", "1883")))
+MQTT_USERNAME = os.getenv("MQTT_USER") or os.getenv("MQTT_USERNAME")
+MQTT_PASSWORD = os.getenv("MQTT_PASSWORD")
+MQTT_TLS_ENABLED = os.getenv("MQTT_TLS_ENABLED", "false").strip().lower() in ("1", "true", "yes", "on")
+MQTT_TLS_CA_CERT = os.getenv("MQTT_TLS_CA_CERT")
+MQTT_TLS_CERTFILE = os.getenv("MQTT_TLS_CERTFILE")
+MQTT_TLS_KEYFILE = os.getenv("MQTT_TLS_KEYFILE")
+MQTT_TLS_INSECURE = os.getenv("MQTT_TLS_INSECURE", "false").strip().lower() in ("1", "true", "yes", "on")


 def _to_utc_z(dt: datetime.datetime) -> str:
@@ -224,6 +240,19 @@ def main():
    client = mqtt.Client(callback_api_version=mqtt.CallbackAPIVersion.VERSION2)
    client.reconnect_delay_set(min_delay=1, max_delay=30)

+    if MQTT_USERNAME and MQTT_PASSWORD:
+        client.username_pw_set(MQTT_USERNAME, MQTT_PASSWORD)
+
+    if MQTT_TLS_ENABLED:
+        client.tls_set(
+            ca_certs=MQTT_TLS_CA_CERT,
+            certfile=MQTT_TLS_CERTFILE,
+            keyfile=MQTT_TLS_KEYFILE,
+            cert_reqs=ssl.CERT_REQUIRED,
+        )
+        if MQTT_TLS_INSECURE:
+            client.tls_insecure_set(True)
+
    POLL_INTERVAL = int(os.getenv("POLL_INTERVAL_SECONDS", "30"))
    # 0 = aus; z.B. 600 für alle 10 Min
    # initial value from DB or fallback to env
@@ -238,16 +267,21 @@ def main():
    POWER_INTENT_HEARTBEAT_ENABLED = _env_bool("POWER_INTENT_HEARTBEAT_ENABLED", True)
    POWER_INTENT_EXPIRY_MULTIPLIER = int(os.getenv("POWER_INTENT_EXPIRY_MULTIPLIER", "3"))
    POWER_INTENT_MIN_EXPIRY_SECONDS = int(os.getenv("POWER_INTENT_MIN_EXPIRY_SECONDS", "90"))
+    CRASH_RECOVERY_ENABLED = _env_bool("CRASH_RECOVERY_ENABLED", False)
+    CRASH_RECOVERY_GRACE_SECONDS = int(os.getenv("CRASH_RECOVERY_GRACE_SECONDS", "180"))

    logging.info(
        "Scheduler config: poll_interval=%ss refresh_seconds=%s power_intent_enabled=%s "
-        "power_intent_heartbeat=%s power_intent_expiry_multiplier=%s power_intent_min_expiry=%ss",
+        "power_intent_heartbeat=%s power_intent_expiry_multiplier=%s power_intent_min_expiry=%ss "
+        "crash_recovery_enabled=%s crash_recovery_grace=%ss",
        POLL_INTERVAL,
        REFRESH_SECONDS,
        POWER_INTENT_PUBLISH_ENABLED,
        POWER_INTENT_HEARTBEAT_ENABLED,
        POWER_INTENT_EXPIRY_MULTIPLIER,
        POWER_INTENT_MIN_EXPIRY_SECONDS,
+        CRASH_RECOVERY_ENABLED,
+        CRASH_RECOVERY_GRACE_SECONDS,
    )
    # Konfigurierbares Zeitfenster in Tagen (Standard: 7)
    WINDOW_DAYS = int(os.getenv("EVENTS_WINDOW_DAYS", "7"))
@@ -275,8 +309,15 @@ def main():

    client.on_connect = on_connect

-    client.connect("mqtt", 1883)
+    client.connect(MQTT_BROKER_HOST, MQTT_BROKER_PORT)
    client.loop_start()
+    logging.info(
+        "MQTT connection configured host=%s port=%s tls=%s auth=%s",
+        MQTT_BROKER_HOST,
+        MQTT_BROKER_PORT,
+        MQTT_TLS_ENABLED,
+        bool(MQTT_USERNAME and MQTT_PASSWORD),
+    )

    while True:
        now = datetime.datetime.now(datetime.timezone.utc)
@@ -390,6 +431,51 @@ def main():
                power_intent_metrics["retained_republish_total"],
            )

+        if CRASH_RECOVERY_ENABLED:
+            try:
+                candidates = get_crash_recovery_candidates(CRASH_RECOVERY_GRACE_SECONDS)
+                if candidates:
+                    logging.info("event=crash_recovery_scan candidates=%s", len(candidates))
+                for candidate in candidates:
+                    cuuid = candidate["uuid"]
+                    reason = candidate["reason"]
+                    try:
+                        command_id, payload, topic, compat_topic = issue_crash_recovery_command(
+                            client_uuid=cuuid,
+                            reason=reason,
+                        )
+                        result = client.publish(topic, json.dumps(payload), qos=1, retain=False)
+                        result.wait_for_publish(timeout=5.0)
+                        compat_result = client.publish(compat_topic, json.dumps(payload), qos=1, retain=False)
+                        compat_result.wait_for_publish(timeout=5.0)
+                        success = result.rc == mqtt.MQTT_ERR_SUCCESS
+                        error = None if success else mqtt.error_string(result.rc)
+                        finalize_crash_recovery_command(command_id, published=success, error=error)
+                        if success:
+                            logging.info(
+                                "event=crash_recovery_command_issued client_uuid=%s reason=%s command_id=%s",
+                                cuuid, reason, command_id,
+                            )
+                        else:
+                            logging.error(
+                                "event=crash_recovery_publish_failed client_uuid=%s reason=%s command_id=%s error=%s",
+                                cuuid, reason, command_id, error,
+                            )
+                    except Exception as cmd_exc:
+                        logging.error(
+                            "event=crash_recovery_command_error client_uuid=%s reason=%s error=%s",
+                            cuuid, reason, cmd_exc,
+                        )
+            except Exception as scan_exc:
+                logging.error("event=crash_recovery_scan_error error=%s", scan_exc)
+
+        try:
+            expired_count = sweep_expired_commands()
+            if expired_count:
+                logging.info("event=command_expiry_sweep expired=%s", expired_count)
+        except Exception as sweep_exc:
+            logging.error("event=command_expiry_sweep_error error=%s", sweep_exc)
+
        time.sleep(POLL_INTERVAL)


--- a/server/alembic/versions/aa12bb34cc56_add_client_commands_table.py
+++ b/server/alembic/versions/aa12bb34cc56_add_client_commands_table.py
@@ -0,0 +1,63 @@
+"""add client commands table
+
+Revision ID: aa12bb34cc56
+Revises: f3c4d5e6a7b8
+Create Date: 2026-04-03 12:40:00.000000
+
+"""
+from typing import Sequence, Union
+
+from alembic import op
+import sqlalchemy as sa
+
+
+# revision identifiers, used by Alembic.
+revision: str = 'aa12bb34cc56'
+down_revision: Union[str, None] = 'f3c4d5e6a7b8'
+branch_labels: Union[str, Sequence[str], None] = None
+depends_on: Union[str, Sequence[str], None] = None
+
+
+def upgrade() -> None:
+    op.create_table(
+        'client_commands',
+        sa.Column('id', sa.Integer(), autoincrement=True, nullable=False),
+        sa.Column('command_id', sa.String(length=36), nullable=False),
+        sa.Column('client_uuid', sa.String(length=36), nullable=False),
+        sa.Column('action', sa.String(length=32), nullable=False),
+        sa.Column('status', sa.String(length=40), nullable=False),
+        sa.Column('reason', sa.Text(), nullable=True),
+        sa.Column('requested_by', sa.Integer(), nullable=True),
+        sa.Column('issued_at', sa.TIMESTAMP(timezone=True), nullable=False),
+        sa.Column('expires_at', sa.TIMESTAMP(timezone=True), nullable=False),
+        sa.Column('published_at', sa.TIMESTAMP(timezone=True), nullable=True),
+        sa.Column('acked_at', sa.TIMESTAMP(timezone=True), nullable=True),
+        sa.Column('execution_started_at', sa.TIMESTAMP(timezone=True), nullable=True),
+        sa.Column('completed_at', sa.TIMESTAMP(timezone=True), nullable=True),
+        sa.Column('failed_at', sa.TIMESTAMP(timezone=True), nullable=True),
+        sa.Column('error_code', sa.String(length=64), nullable=True),
+        sa.Column('error_message', sa.Text(), nullable=True),
+        sa.Column('created_at', sa.TIMESTAMP(timezone=True), server_default=sa.func.current_timestamp(), nullable=False),
+        sa.Column('updated_at', sa.TIMESTAMP(timezone=True), server_default=sa.func.current_timestamp(), nullable=False),
+        sa.ForeignKeyConstraint(['client_uuid'], ['clients.uuid'], ondelete='CASCADE'),
+        sa.ForeignKeyConstraint(['requested_by'], ['users.id'], ondelete='SET NULL'),
+        sa.PrimaryKeyConstraint('id'),
+        sa.UniqueConstraint('command_id')
+    )
+
+    op.create_index(op.f('ix_client_commands_action'), 'client_commands', ['action'], unique=False)
+    op.create_index(op.f('ix_client_commands_client_uuid'), 'client_commands', ['client_uuid'], unique=False)
+    op.create_index(op.f('ix_client_commands_command_id'), 'client_commands', ['command_id'], unique=False)
+    op.create_index(op.f('ix_client_commands_requested_by'), 'client_commands', ['requested_by'], unique=False)
+    op.create_index(op.f('ix_client_commands_status'), 'client_commands', ['status'], unique=False)
+    op.create_index('ix_client_commands_client_status_created', 'client_commands', ['client_uuid', 'status', 'created_at'], unique=False)
+
+
+def downgrade() -> None:
+    op.drop_index('ix_client_commands_client_status_created', table_name='client_commands')
+    op.drop_index(op.f('ix_client_commands_status'), table_name='client_commands')
+    op.drop_index(op.f('ix_client_commands_requested_by'), table_name='client_commands')
+    op.drop_index(op.f('ix_client_commands_command_id'), table_name='client_commands')
+    op.drop_index(op.f('ix_client_commands_client_uuid'), table_name='client_commands')
+    op.drop_index(op.f('ix_client_commands_action'), table_name='client_commands')
+    op.drop_table('client_commands')
--- a/server/alembic/versions/b1c2d3e4f5a6_add_service_failed_and_mqtt_broker_fields.py
+++ b/server/alembic/versions/b1c2d3e4f5a6_add_service_failed_and_mqtt_broker_fields.py
@@ -0,0 +1,43 @@
+"""add service_failed and mqtt broker connection fields to clients
+
+Revision ID: b1c2d3e4f5a6
+Revises: aa12bb34cc56
+Create Date: 2026-04-05 10:00:00.000000
+
+"""
+from typing import Sequence, Union
+
+from alembic import op
+import sqlalchemy as sa
+
+
+# revision identifiers, used by Alembic.
+revision: str = 'b1c2d3e4f5a6'
+down_revision: Union[str, None] = 'aa12bb34cc56'
+branch_labels: Union[str, Sequence[str], None] = None
+depends_on: Union[str, Sequence[str], None] = None
+
+
+def upgrade() -> None:
+    bind = op.get_bind()
+    inspector = sa.inspect(bind)
+    existing = {c['name'] for c in inspector.get_columns('clients')}
+
+    # Systemd service-failed tracking
+    if 'service_failed_at' not in existing:
+        op.add_column('clients', sa.Column('service_failed_at', sa.TIMESTAMP(timezone=True), nullable=True))
+    if 'service_failed_unit' not in existing:
+        op.add_column('clients', sa.Column('service_failed_unit', sa.String(128), nullable=True))
+
+    # MQTT broker connection health
+    if 'mqtt_reconnect_count' not in existing:
+        op.add_column('clients', sa.Column('mqtt_reconnect_count', sa.Integer(), nullable=True))
+    if 'mqtt_last_disconnect_at' not in existing:
+        op.add_column('clients', sa.Column('mqtt_last_disconnect_at', sa.TIMESTAMP(timezone=True), nullable=True))
+
+
+def downgrade() -> None:
+    op.drop_column('clients', 'mqtt_last_disconnect_at')
+    op.drop_column('clients', 'mqtt_reconnect_count')
+    op.drop_column('clients', 'service_failed_unit')
+    op.drop_column('clients', 'service_failed_at')
--- a/server/init_academic_periods.py
+++ b/server/init_academic_periods.py
@@ -1,7 +1,11 @@
 #!/usr/bin/env python3
 """
-Erstellt Standard-Schuljahre für österreichische Schulen
-Führe dieses Skript nach der Migration aus, um Standard-Perioden zu erstellen.
+Erstellt Standard-Schuljahre und setzt automatisch die aktive Periode.
+
+Dieses Skript ist idempotent:
+- Wenn keine Perioden existieren, werden Standard-Perioden erstellt.
+- Danach wird (bei jedem Lauf) die nicht-archivierte Periode aktiviert,
+  die das heutige Datum abdeckt.
 """

 from datetime import date
@@ -11,54 +15,94 @@ import sys
 sys.path.append('/workspace')


+def _create_default_periods_if_missing(session):
+    """Erstellt Standard-Schuljahre nur dann, wenn noch keine Perioden existieren."""
+    existing = session.query(AcademicPeriod).first()
+    if existing:
+        print("Academic periods already exist. Skipping creation.")
+        return False
+
+    periods = [
+        {
+            'name': 'Schuljahr 2024/25',
+            'display_name': 'SJ 24/25',
+            'start_date': date(2024, 9, 2),
+            'end_date': date(2025, 7, 4),
+            'period_type': AcademicPeriodType.schuljahr,
+            'is_active': False
+        },
+        {
+            'name': 'Schuljahr 2025/26',
+            'display_name': 'SJ 25/26',
+            'start_date': date(2025, 9, 1),
+            'end_date': date(2026, 7, 3),
+            'period_type': AcademicPeriodType.schuljahr,
+            'is_active': False
+        },
+        {
+            'name': 'Schuljahr 2026/27',
+            'display_name': 'SJ 26/27',
+            'start_date': date(2026, 9, 7),
+            'end_date': date(2027, 7, 2),
+            'period_type': AcademicPeriodType.schuljahr,
+            'is_active': False
+        }
+    ]
+
+    for period_data in periods:
+        period = AcademicPeriod(**period_data)
+        session.add(period)
+
+    session.flush()
+    print(f"Successfully created {len(periods)} academic periods")
+    return True
+
+
+def _activate_period_for_today(session):
+    """Aktiviert genau eine Periode: die Periode, die heute abdeckt."""
+    today = date.today()
+
+    period_for_today = (
+        session.query(AcademicPeriod)
+        .filter(
+            AcademicPeriod.is_archived == False,
+            AcademicPeriod.start_date <= today,
+            AcademicPeriod.end_date >= today,
+        )
+        .order_by(AcademicPeriod.start_date.desc())
+        .first()
+    )
+
+    # Immer zunächst alle aktiven Perioden deaktivieren, um den Zustand konsistent zu halten.
+    session.query(AcademicPeriod).filter(AcademicPeriod.is_active == True).update(
+        {AcademicPeriod.is_active: False},
+        synchronize_session=False,
+    )
+
+    if period_for_today:
+        period_for_today.is_active = True
+        print(
+            f"Activated academic period for today ({today}): {period_for_today.name} "
+            f"[{period_for_today.start_date} - {period_for_today.end_date}]"
+        )
+    else:
+        print(
+            f"WARNING: No academic period covers today ({today}). "
+            "All periods remain inactive."
+        )
+
+
 def create_default_academic_periods():
-    """Erstellt Standard-Schuljahre für österreichische Schulen"""
+    """Erstellt Standard-Perioden (falls nötig) und setzt aktive Periode für heute."""
    session = Session()

    try:
-        # Prüfe ob bereits Perioden existieren
-        existing = session.query(AcademicPeriod).first()
-        if existing:
-            print("Academic periods already exist. Skipping creation.")
-            return
-
-        # Standard Schuljahre erstellen
-        periods = [
-            {
-                'name': 'Schuljahr 2024/25',
-                'display_name': 'SJ 24/25',
-                'start_date': date(2024, 9, 2),
-                'end_date': date(2025, 7, 4),
-                'period_type': AcademicPeriodType.schuljahr,
-                'is_active': True  # Aktuelles Schuljahr
-            },
-            {
-                'name': 'Schuljahr 2025/26',
-                'display_name': 'SJ 25/26',
-                'start_date': date(2025, 9, 1),
-                'end_date': date(2026, 7, 3),
-                'period_type': AcademicPeriodType.schuljahr,
-                'is_active': False
-            },
-            {
-                'name': 'Schuljahr 2026/27',
-                'display_name': 'SJ 26/27',
-                'start_date': date(2026, 9, 7),
-                'end_date': date(2027, 7, 2),
-                'period_type': AcademicPeriodType.schuljahr,
-                'is_active': False
-            }
-        ]
-
-        for period_data in periods:
-            period = AcademicPeriod(**period_data)
-            session.add(period)
-
+        _create_default_periods_if_missing(session)
+        _activate_period_for_today(session)
        session.commit()
-        print(f"Successfully created {len(periods)} academic periods")

        # Zeige erstellte Perioden
-        for period in session.query(AcademicPeriod).all():
+        for period in session.query(AcademicPeriod).order_by(AcademicPeriod.start_date.asc()).all():
            status = "AKTIV" if period.is_active else "inaktiv"
            print(
                f"  - {period.name} ({period.start_date} - {period.end_date}) [{status}]")
--- a/server/init_defaults.py
+++ b/server/init_defaults.py
@@ -1,4 +1,4 @@
-from sqlalchemy import create_engine, text
+from sqlalchemy import create_engine, textmosquitto.conf
 import os
 from dotenv import load_dotenv
 import bcrypt
--- a/server/routes/client_logs.py
+++ b/server/routes/client_logs.py
@@ -421,6 +421,8 @@ def get_monitoring_overview():
                },
                "latest_log": _serialize_log_entry(latest_log),
                "latest_error": _serialize_log_entry(latest_error),
+                "mqtt_reconnect_count": client.mqtt_reconnect_count,
+                "mqtt_last_disconnect_at": client.mqtt_last_disconnect_at.isoformat() if client.mqtt_last_disconnect_at else None,
            })

            summary_counts["total_clients"] += 1
--- a/server/routes/clients.py
+++ b/server/routes/clients.py
@@ -1,7 +1,8 @@
 from server.database import Session
-from models.models import Client, ClientGroup
-from flask import Blueprint, request, jsonify
+from models.models import Client, ClientGroup, ClientCommand, ProcessStatus
+from flask import Blueprint, request, jsonify, session as flask_session
 from server.permissions import admin_or_higher
+from server.routes.groups import get_grace_period, is_client_alive
 from server.mqtt_helper import publish_client_group, delete_client_group_message, publish_multiple_client_groups
 import sys
 import os
@@ -9,13 +10,196 @@ import glob
 import base64
 import hashlib
 import json
-from datetime import datetime, timezone
+import uuid as uuid_lib
+from datetime import datetime, timezone, timedelta
 sys.path.append('/workspace')

 clients_bp = Blueprint("clients", __name__, url_prefix="/api/clients")

 VALID_SCREENSHOT_TYPES = {"periodic", "event_start", "event_stop"}

+COMMAND_SCHEMA_VERSION = "1.0"
+COMMAND_TOPIC_TEMPLATE = "infoscreen/{uuid}/commands"
+COMMAND_TOPIC_COMPAT_TEMPLATE = "infoscreen/{uuid}/command"
+LEGACY_RESTART_TOPIC_TEMPLATE = "clients/{uuid}/restart"
+COMMAND_EXPIRY_SECONDS = 240
+REBOOT_LOCKOUT_WINDOW_MINUTES = 15
+REBOOT_LOCKOUT_THRESHOLD = 3
+API_ACTION_TO_COMMAND_ACTION = {
+    "restart": "reboot_host",
+    "shutdown": "shutdown_host",
+    "restart_app": "restart_app",
+}
+ALLOWED_COMMAND_ACTIONS = set(API_ACTION_TO_COMMAND_ACTION.keys())
+
+
+def _iso_utc_z(ts: datetime) -> str:
+    return ts.astimezone(timezone.utc).isoformat().replace("+00:00", "Z")
+
+
+def _command_to_dict(command: ClientCommand) -> dict:
+    return {
+        "commandId": command.command_id,
+        "clientUuid": command.client_uuid,
+        "action": command.action,
+        "status": command.status,
+        "reason": command.reason,
+        "requestedBy": command.requested_by,
+        "issuedAt": command.issued_at.isoformat() if command.issued_at else None,
+        "expiresAt": command.expires_at.isoformat() if command.expires_at else None,
+        "publishedAt": command.published_at.isoformat() if command.published_at else None,
+        "ackedAt": command.acked_at.isoformat() if command.acked_at else None,
+        "executionStartedAt": command.execution_started_at.isoformat() if command.execution_started_at else None,
+        "completedAt": command.completed_at.isoformat() if command.completed_at else None,
+        "failedAt": command.failed_at.isoformat() if command.failed_at else None,
+        "errorCode": command.error_code,
+        "errorMessage": command.error_message,
+        "createdAt": command.created_at.isoformat() if command.created_at else None,
+        "updatedAt": command.updated_at.isoformat() if command.updated_at else None,
+    }
+
+
+def _publish_client_command(client_uuid: str, action: str, payload: dict) -> None:
+    import paho.mqtt.client as mqtt
+
+    broker_host = os.getenv("MQTT_BROKER_HOST", "mqtt")
+    broker_port = int(os.getenv("MQTT_BROKER_PORT", 1883))
+    username = os.getenv("MQTT_USER")
+    password = os.getenv("MQTT_PASSWORD")
+
+    mqtt_client = mqtt.Client()
+    if username and password:
+        mqtt_client.username_pw_set(username, password)
+
+    mqtt_client.connect(broker_host, broker_port)
+
+    # Primary topic for contract-based command handling.
+    command_topic = COMMAND_TOPIC_TEMPLATE.format(uuid=client_uuid)
+    result = mqtt_client.publish(command_topic, json.dumps(payload), qos=1, retain=False)
+    result.wait_for_publish(timeout=5.0)
+
+    # Transitional compatibility for clients that still consume singular topic naming.
+    compat_topic = COMMAND_TOPIC_COMPAT_TEMPLATE.format(uuid=client_uuid)
+    compat_result = mqtt_client.publish(compat_topic, json.dumps(payload), qos=1, retain=False)
+    compat_result.wait_for_publish(timeout=5.0)
+
+    # Transitional compatibility for existing restart-only clients.
+    if action == "restart":
+        legacy_topic = LEGACY_RESTART_TOPIC_TEMPLATE.format(uuid=client_uuid)
+        legacy_payload = {"action": "restart"}
+        legacy_result = mqtt_client.publish(legacy_topic, json.dumps(legacy_payload), qos=1, retain=False)
+        legacy_result.wait_for_publish(timeout=5.0)
+
+    mqtt_client.disconnect()
+
+
+def _issue_client_command(client_uuid: str, action: str):
+    if action not in ALLOWED_COMMAND_ACTIONS:
+        return jsonify({"error": f"Unsupported action '{action}'"}), 400
+
+    command_action = API_ACTION_TO_COMMAND_ACTION[action]
+
+    data = request.get_json(silent=True) or {}
+    reason = str(data.get("reason", "")).strip() or None
+    requested_by = flask_session.get("user_id")
+
+    now_utc = datetime.now(timezone.utc)
+    expires_at = now_utc + timedelta(seconds=COMMAND_EXPIRY_SECONDS)
+    command_id = str(uuid_lib.uuid4())
+
+    db = Session()
+    try:
+        client = db.query(Client).filter_by(uuid=client_uuid).first()
+        if not client:
+            return jsonify({"error": "Client nicht gefunden"}), 404
+
+        # Safety lockout: avoid rapid repeated reboot loops per client.
+        if command_action in ("reboot_host", "restart_app"):
+            window_start = now_utc - timedelta(minutes=REBOOT_LOCKOUT_WINDOW_MINUTES)
+            recent_reboots = (
+                db.query(ClientCommand)
+                .filter(ClientCommand.client_uuid == client_uuid)
+                .filter(ClientCommand.action.in_(["reboot_host", "restart_app"]))
+                .filter(ClientCommand.created_at >= window_start)
+                .count()
+            )
+            if recent_reboots >= REBOOT_LOCKOUT_THRESHOLD:
+                blocked = ClientCommand(
+                    command_id=command_id,
+                    client_uuid=client_uuid,
+                    action=command_action,
+                    status="blocked_safety",
+                    reason=reason,
+                    requested_by=requested_by,
+                    issued_at=now_utc,
+                    expires_at=expires_at,
+                    failed_at=now_utc,
+                    error_code="lockout_threshold",
+                    error_message="Reboot lockout active for this client",
+                )
+                db.add(blocked)
+                db.commit()
+                return jsonify({
+                    "success": False,
+                    "message": "Neustart voruebergehend blockiert (Sicherheits-Lockout)",
+                    "command": _command_to_dict(blocked),
+                }), 429
+
+        command = ClientCommand(
+            command_id=command_id,
+            client_uuid=client_uuid,
+            action=command_action,
+            status="queued",
+            reason=reason,
+            requested_by=requested_by,
+            issued_at=now_utc,
+            expires_at=expires_at,
+        )
+        db.add(command)
+        db.commit()
+
+        command.status = "publish_in_progress"
+        db.commit()
+
+        payload = {
+            "schema_version": COMMAND_SCHEMA_VERSION,
+            "command_id": command.command_id,
+            "client_uuid": command.client_uuid,
+            "action": command.action,
+            "issued_at": _iso_utc_z(command.issued_at),
+            "expires_at": _iso_utc_z(command.expires_at),
+            "requested_by": command.requested_by,
+            "reason": command.reason,
+        }
+
+        try:
+            _publish_client_command(client_uuid=client_uuid, action=action, payload=payload)
+            # ACK can arrive very quickly (including terminal failure) while publish is in-flight.
+            # Refresh to avoid regressing a newer listener-updated state back to "published".
+            db.refresh(command)
+            command.published_at = command.published_at or datetime.now(timezone.utc)
+            if command.status in {"queued", "publish_in_progress"}:
+                command.status = "published"
+            db.commit()
+            return jsonify({
+                "success": True,
+                "message": f"Command published for client {client_uuid}",
+                "command": _command_to_dict(command),
+            }), 202
+        except Exception as publish_error:
+            command.status = "failed"
+            command.failed_at = datetime.now(timezone.utc)
+            command.error_code = "mqtt_publish_failed"
+            command.error_message = str(publish_error)
+            db.commit()
+            return jsonify({
+                "success": False,
+                "error": f"Failed to publish command: {publish_error}",
+                "command": _command_to_dict(command),
+            }), 500
+    finally:
+        db.close()
+

 def _normalize_screenshot_type(raw_type):
    if raw_type is None:
@@ -280,45 +464,148 @@ def get_clients_with_alive_status():
            "ip": c.ip,
            "last_alive": c.last_alive.isoformat() if c.last_alive else None,
            "is_active": c.is_active,
-            "is_alive": bool(c.last_alive and c.is_active),
+            "is_alive": is_client_alive(c.last_alive, c.is_active),
        })
    session.close()
    return jsonify(result)


+@clients_bp.route("/crashed", methods=["GET"])
+@admin_or_higher
+def get_crashed_clients():
+    """Returns clients that are crashed (process_status=crashed) or heartbeat-stale."""
+    session = Session()
+    try:
+        from datetime import timedelta
+        grace = get_grace_period()
+        from datetime import datetime, timezone
+        stale_cutoff = datetime.now(timezone.utc) - timedelta(seconds=grace)
+        clients = (
+            session.query(Client)
+            .filter(Client.is_active == True)
+            .all()
+        )
+        result = []
+        for c in clients:
+            alive = is_client_alive(c.last_alive, c.is_active)
+            crashed = c.process_status == ProcessStatus.crashed
+            if not alive or crashed:
+                result.append({
+                    "uuid": c.uuid,
+                    "description": c.description,
+                    "hostname": c.hostname,
+                    "ip": c.ip,
+                    "group_id": c.group_id,
+                    "is_alive": alive,
+                    "process_status": c.process_status.value if c.process_status else None,
+                    "screen_health_status": c.screen_health_status.value if c.screen_health_status else None,
+                    "last_alive": c.last_alive.isoformat() if c.last_alive else None,
+                    "crash_reason": "process_crashed" if crashed else "heartbeat_stale",
+                })
+        return jsonify({
+            "crashed_count": len(result),
+            "grace_period_seconds": grace,
+            "clients": result,
+        })
+    finally:
+        session.close()
+
+
+@clients_bp.route("/service_failed", methods=["GET"])
+@admin_or_higher
+def get_service_failed_clients():
+    """Returns clients that have a service_failed_at set (systemd gave up restarting)."""
+    session = Session()
+    try:
+        clients = (
+            session.query(Client)
+            .filter(Client.service_failed_at.isnot(None))
+            .order_by(Client.service_failed_at.desc())
+            .all()
+        )
+        result = [
+            {
+                "uuid": c.uuid,
+                "description": c.description,
+                "hostname": c.hostname,
+                "ip": c.ip,
+                "group_id": c.group_id,
+                "service_failed_at": c.service_failed_at.isoformat() if c.service_failed_at else None,
+                "service_failed_unit": c.service_failed_unit,
+                "is_alive": is_client_alive(c.last_alive, c.is_active),
+                "last_alive": c.last_alive.isoformat() if c.last_alive else None,
+            }
+            for c in clients
+        ]
+        return jsonify({"service_failed_count": len(result), "clients": result})
+    finally:
+        session.close()
+
+
+@clients_bp.route("/<client_uuid>/clear_service_failed", methods=["POST"])
+@admin_or_higher
+def clear_service_failed(client_uuid):
+    """Clears the service_failed flag for a client and deletes the retained MQTT message."""
+    import paho.mqtt.client as mqtt_lib
+
+    session = Session()
+    try:
+        c = session.query(Client).filter_by(uuid=client_uuid).first()
+        if not c:
+            return jsonify({"error": "Client nicht gefunden"}), 404
+        if c.service_failed_at is None:
+            return jsonify({"success": True, "message": "Kein service_failed Flag gesetzt."}), 200
+
+        c.service_failed_at = None
+        c.service_failed_unit = None
+        session.commit()
+    finally:
+        session.close()
+
+    # Clear the retained MQTT message (publish empty payload, retained=True)
+    try:
+        broker_host = os.getenv("MQTT_BROKER_HOST", "mqtt")
+        broker_port = int(os.getenv("MQTT_BROKER_PORT", 1883))
+        username = os.getenv("MQTT_USER")
+        password = os.getenv("MQTT_PASSWORD")
+        mc = mqtt_lib.Client()
+        if username and password:
+            mc.username_pw_set(username, password)
+        mc.connect(broker_host, broker_port)
+        topic = f"infoscreen/{client_uuid}/service_failed"
+        mc.publish(topic, payload=None, qos=1, retain=True)
+        mc.disconnect()
+    except Exception as e:
+        # Log but don't fail — DB is already cleared
+        import logging
+        logging.warning(f"Could not clear retained service_failed MQTT message for {client_uuid}: {e}")
+
+    return jsonify({"success": True, "message": "service_failed Flag gelöscht."})
+
+
@clients_bp.route("/<uuid>/restart", methods=["POST"])
@admin_or_higher
 def restart_client(uuid):
-    """
-    Route to restart a specific client by UUID.
-    Sends an MQTT message to the broker to trigger the restart.
-    """
-    import paho.mqtt.client as mqtt
-    import json
+    return _issue_client_command(client_uuid=uuid, action="restart")

-    # MQTT broker configuration
-    MQTT_BROKER = "mqtt"
-    MQTT_PORT = 1883
-    MQTT_TOPIC = f"clients/{uuid}/restart"

-    # Connect to the database to check if the client exists
-    session = Session()
-    client = session.query(Client).filter_by(uuid=uuid).first()
-    if not client:
-        session.close()
-        return jsonify({"error": "Client nicht gefunden"}), 404
-    session.close()
+@clients_bp.route("/<uuid>/shutdown", methods=["POST"])
+@admin_or_higher
+def shutdown_client(uuid):
+    return _issue_client_command(client_uuid=uuid, action="shutdown")

-    # Send MQTT message
+
+@clients_bp.route("/commands/<command_id>", methods=["GET"])
+@admin_or_higher
+def get_client_command_status(command_id):
+    db = Session()
    try:
-        mqtt_client = mqtt.Client()
-        mqtt_client.connect(MQTT_BROKER, MQTT_PORT)
-        payload = {"action": "restart"}
-        mqtt_client.publish(MQTT_TOPIC, json.dumps(payload))
-        mqtt_client.disconnect()
-        return jsonify({"success": True, "message": f"Restart signal sent to client {uuid}"}), 200
-    except Exception as e:
-        return jsonify({"error": f"Failed to send MQTT message: {str(e)}"}), 500
+        command = db.query(ClientCommand).filter_by(command_id=command_id).first()
+        if not command:
+            return jsonify({"error": "Command nicht gefunden"}), 404
+        return jsonify(_command_to_dict(command)), 200
+    finally:
+        db.close()


@clients_bp.route("/<uuid>/screenshot", methods=["POST"])
Author	SHA1	Message	Date
Olaf	03e3c11e90	feat: crash recovery, service_failed monitoring, broker health fields, command expiry sweep - Add GET /api/clients/crashed endpoint (process_status=crashed or stale heartbeat) - Add restart_app command action with same lifecycle + lockout as reboot_host - Scheduler: crash auto-recovery loop (CRASH_RECOVERY_ENABLED flag, lockout, MQTT publish) - Scheduler: unconditional command expiry sweep per poll cycle (sweep_expired_commands) - Listener: subscribe to infoscreen/+/service_failed; persist service_failed_at + unit - Listener: extract broker_connection block from health payload; persist reconnect_count + last_disconnect_at - DB migration b1c2d3e4f5a6: service_failed_at, service_failed_unit, mqtt_reconnect_count, mqtt_last_disconnect_at on clients - Add GET /api/clients/service_failed and POST /api/clients/<uuid>/clear_service_failed - Monitoring overview API: include mqtt_reconnect_count + mqtt_last_disconnect_at per client - Frontend: orange service-failed alert panel (hidden when empty, auto-refresh, quittieren action) - Frontend: MQTT reconnect count + last disconnect in client detail panel - MQTT auth hardening: listener/scheduler/server use env credentials; broker enforces allow_anonymous false - Client command lifecycle foundation: ClientCommand model, reboot_host/shutdown_host, full ACK lifecycle - Docs: TECH-CHANGELOG, DEV-CHANGELOG, MQTT_EVENT_PAYLOAD_GUIDE, copilot-instructions updated - Add implementation-plans/, RESTART_VALIDATION_CHECKLIST.md, TODO.md	2026-04-05 10:17:56 +00:00
Olaf	4d652f0554	feat: 2026.1.0-alpha.16 – dashboard banner refactor, period auto-activation, text & docs Dashboard (dashboard/src/dashboard.tsx, settings.tsx, apiAcademicPeriods.ts): - Refactor loadHolidayStatus to useCallback with stable empty-deps reference; removes location.pathname dependency that caused overlapping API calls at mount and left the banner unresolved via request-sequence cancellation - Add key prop derived from severity:text to Syncfusion MessageComponent to force remount on state change, fixing stale banner that ignored React prop/children updates - Correct German transliterated text to proper Umlauts throughout visible UI strings (fuer -> für, oe -> ö, ae -> ä etc. across dashboard and settings views) Backend (server/init_academic_periods.py): - Refactor to idempotent two-phase flow: seed default periods only when table is empty; on every run activate exactly the non-archived period covering date.today() - Enforces single-active invariant by deactivating all periods before promoting match - Emits explicit warning when no period covers current date instead of doing nothing Deployment (docker-compose.prod.yml): - Add init_academic_periods.py to server startup chain after migrations and defaults; eliminates manual post-deploy step to set an active academic period Release docs: - program-info.json: bump to 2026.1.0-alpha.16; fix JSON parse error caused by typographic curly quotes in the new changelog entry - TECH-CHANGELOG.md: detailed alpha.16 section with root-cause motivation for both dashboard refactoring decisions (unstable callback ref + Syncfusion stale render) - DEV-CHANGELOG.md: document dashboard refactor, Syncfusion key fix, Umlaut changes, and program-info JSON regression and fix - README.md: add Latest Release Highlights section for alpha.16 - .github/copilot-instructions.md: sync file map, prod bootstrap note, backend and frontend pattern additions for academic period init and Syncfusion remount pattern	2026-04-02 14:16:53 +00:00
Olaf	06411edfab	docs: archive legacy guides and streamline copilot instructions governance	2026-04-01 08:37:50 +00:00
Olaf	365d8f58f3	merge: feat/tv-power-server-pr1 into main	2026-04-01 08:07:37 +00:00