feat(monitoring): add server-side client logging and health infrastructure

- add Alembic migration c1d2e3f4g5h6 for client monitoring: - create client_logs table with FK to clients.uuid and performance indexes - extend clients with process/health tracking fields - extend data model with ClientLog, LogLevel, ProcessStatus, and ScreenHealthStatus - enhance listener MQTT handling: - subscribe to logs and health topics - persist client logs from infoscreen/{uuid}/logs/{level} - process health payloads and enrich heartbeat-derived client state - add monitoring API blueprint server/routes/client_logs.py: - GET /api/client-logs/<uuid>/logs - GET /api/client-logs/summary - GET /api/client-logs/recent-errors - GET /api/client-logs/test - register client_logs blueprint in server/wsgi.py - align compose/dev runtime for listener live-code execution - add client-side implementation docs: - CLIENT_MONITORING_SPECIFICATION.md - CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md - update TECH-CHANGELOG.md and copilot-instructions.md: - document monitoring changes - codify post-release technical-notes/no-version-bump convention
2026-03-10 07:33:38 +00:00
parent 7746e26385
commit 3107d0f671
10 changed files with 2307 additions and 6 deletions
--- a/CLIENT_MONITORING_SPECIFICATION.md
+++ b/CLIENT_MONITORING_SPECIFICATION.md
@@ -0,0 +1,972 @@
+# Client-Side Monitoring Specification
+
+**Version:** 1.0  
+**Date:** 2026-03-10  
+**For:** Infoscreen Client Implementation  
+**Server Endpoint:** `192.168.43.201:8000` (or your production server)  
+**MQTT Broker:** `192.168.43.201:1883` (or your production MQTT broker)
+
+---
+
+## 1. Overview
+
+Each infoscreen client must implement health monitoring and logging capabilities to report status to the central server via MQTT.
+
+### 1.1 Goals
+- **Detect failures:** Process crashes, frozen screens, content mismatches
+- **Provide visibility:** Real-time health status visible on server dashboard
+- **Enable remote diagnosis:** Centralized log storage for debugging
+- **Auto-recovery:** Attempt automatic restart on failure
+
+### 1.2 Architecture
+```
+┌─────────────────────────────────────────┐
+│         Infoscreen Client               │
+│                                         │
+│  ┌──────────────┐    ┌──────────────┐  │
+│  │ Media Player │    │   Watchdog   │  │
+│  │ (VLC/Chrome) │◄───│   Monitor    │  │
+│  └──────────────┘    └──────┬───────┘  │
+│                              │          │
+│  ┌──────────────┐            │          │
+│  │  Event Mgr   │            │          │
+│  │  (receives   │            │          │
+│  │   schedule)  │◄───────────┘          │
+│  └──────┬───────┘                       │
+│         │                               │
+│  ┌──────▼───────────────────────┐      │
+│  │     MQTT Client               │      │
+│  │  - Heartbeat (every 60s)      │      │
+│  │  - Logs (error/warn/info)     │      │
+│  │  - Health metrics (every 5s)  │      │
+│  └──────┬────────────────────────┘      │
+└─────────┼──────────────────────────────┘
+          │
+          │ MQTT over TCP
+          ▼
+    ┌─────────────┐
+    │ MQTT Broker │
+    │  (server)   │
+    └─────────────┘
+```
+
+---
+
+## 2. MQTT Protocol Specification
+
+### 2.1 Connection Parameters
+```
+Broker: 192.168.43.201 (or DNS hostname)
+Port: 1883 (standard MQTT)
+Protocol: MQTT v3.1.1
+Client ID: "infoscreen-{client_uuid}"
+Clean Session: false (retain subscriptions)
+Keep Alive: 60 seconds
+Username/Password: (if configured on broker)
+```
+
+### 2.2 QoS Levels
+- **Heartbeat:** QoS 0 (fire and forget, high frequency)
+- **Logs (ERROR/WARN):** QoS 1 (at least once delivery, important)
+- **Logs (INFO):** QoS 0 (optional, high volume)
+- **Health metrics:** QoS 0 (frequent, latest value matters)
+
+---
+
+## 3. Topic Structure & Payload Formats
+
+### 3.1 Log Messages
+
+#### Topic Pattern:
+```
+infoscreen/{client_uuid}/logs/{level}
+```
+
+Where `{level}` is one of: `error`, `warn`, `info`
+
+#### Payload Format (JSON):
+```json
+{
+  "timestamp": "2026-03-10T07:30:00Z",
+  "message": "Human-readable error description",
+  "context": {
+    "event_id": 42,
+    "process": "vlc",
+    "error_code": "NETWORK_TIMEOUT",
+    "additional_key": "any relevant data"
+  }
+}
+```
+
+#### Field Specifications:
+| Field | Type | Required | Description |
+|-------|------|----------|-------------|
+| `timestamp` | string (ISO 8601 UTC) | Yes | When the event occurred. Use `YYYY-MM-DDTHH:MM:SSZ` format |
+| `message` | string | Yes | Human-readable description of the event (max 1000 chars) |
+| `context` | object | No | Additional structured data (will be stored as JSON) |
+
+#### Example Topics:
+```
+infoscreen/9b8d1856-ff34-4864-a726-12de072d0f77/logs/error
+infoscreen/9b8d1856-ff34-4864-a726-12de072d0f77/logs/warn
+infoscreen/9b8d1856-ff34-4864-a726-12de072d0f77/logs/info
+```
+
+#### When to Send Logs:
+
+**ERROR (Always send):**
+- Process crashed (VLC/Chromium/PDF viewer terminated unexpectedly)
+- Content failed to load (404, network timeout, corrupt file)
+- Hardware failure detected (display off, audio device missing)
+- Exception caught in main event loop
+- Maximum restart attempts exceeded
+
+**WARN (Always send):**
+- Process restarted automatically (after crash)
+- High resource usage (CPU >80%, RAM >90%)
+- Slow performance (frame drops, lag)
+- Non-critical failures (screenshot capture failed, cache full)
+- Fallback content displayed (primary source unavailable)
+
+**INFO (Send in development, optional in production):**
+- Process started successfully
+- Event transition (switched from video to presentation)
+- Content loaded successfully
+- Watchdog service started/stopped
+
+---
+
+### 3.2 Health Metrics
+
+#### Topic Pattern:
+```
+infoscreen/{client_uuid}/health
+```
+
+#### Payload Format (JSON):
+```json
+{
+  "timestamp": "2026-03-10T07:30:00Z",
+  "expected_state": {
+    "event_id": 42,
+    "event_type": "video",
+    "media_file": "presentation.mp4",
+    "started_at": "2026-03-10T07:15:00Z"
+  },
+  "actual_state": {
+    "process": "vlc",
+    "pid": 1234,
+    "status": "running",
+    "uptime_seconds": 900,
+    "position": 45.3,
+    "duration": 180.0
+  },
+  "health_metrics": {
+    "screen_on": true,
+    "last_frame_update": "2026-03-10T07:29:58Z",
+    "frames_dropped": 2,
+    "network_errors": 0,
+    "cpu_percent": 15.3,
+    "memory_mb": 234
+  }
+}
+```
+
+#### Field Specifications:
+
+**expected_state:**
+| Field | Type | Required | Description |
+|-------|------|----------|-------------|
+| `event_id` | integer | Yes | Current event ID from scheduler |
+| `event_type` | string | Yes | `presentation`, `video`, `website`, `webuntis`, `message` |
+| `media_file` | string | No | Filename or URL of current content |
+| `started_at` | string (ISO 8601) | Yes | When this event started playing |
+
+**actual_state:**
+| Field | Type | Required | Description |
+|-------|------|----------|-------------|
+| `process` | string | Yes | `vlc`, `chromium`, `pdf_viewer`, `none` |
+| `pid` | integer | No | Process ID (if running) |
+| `status` | string | Yes | `running`, `crashed`, `starting`, `stopped` |
+| `uptime_seconds` | integer | No | How long process has been running |
+| `position` | float | No | Current playback position (seconds, for video/audio) |
+| `duration` | float | No | Total content duration (seconds) |
+
+**health_metrics:**
+| Field | Type | Required | Description |
+|-------|------|----------|-------------|
+| `screen_on` | boolean | Yes | Is display powered on? |
+| `last_frame_update` | string (ISO 8601) | No | Last time screen content changed |
+| `frames_dropped` | integer | No | Video frames dropped (performance indicator) |
+| `network_errors` | integer | No | Count of network errors in last interval |
+| `cpu_percent` | float | No | CPU usage (0-100) |
+| `memory_mb` | integer | No | RAM usage in megabytes |
+
+#### Sending Frequency:
+- **Normal operation:** Every 5 seconds
+- **During startup/transition:** Every 1 second
+- **After error:** Immediately + every 2 seconds until recovered
+
+---
+
+### 3.3 Enhanced Heartbeat
+
+The existing heartbeat topic should be enhanced to include process status.
+
+#### Topic Pattern:
+```
+infoscreen/{client_uuid}/heartbeat
+```
+
+#### Enhanced Payload Format (JSON):
+```json
+{
+  "uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
+  "timestamp": "2026-03-10T07:30:00Z",
+  "current_process": "vlc",
+  "process_pid": 1234,
+  "process_status": "running",
+  "current_event_id": 42
+}
+```
+
+#### New Fields (add to existing heartbeat):
+| Field | Type | Required | Description |
+|-------|------|----------|-------------|
+| `current_process` | string | No | Name of active media player process |
+| `process_pid` | integer | No | Process ID |
+| `process_status` | string | No | `running`, `crashed`, `starting`, `stopped` |
+| `current_event_id` | integer | No | Event ID currently being displayed |
+
+#### Sending Frequency:
+- Keep existing: **Every 60 seconds**
+- Include new fields if available
+
+---
+
+## 4. Process Monitoring Requirements
+
+### 4.1 Processes to Monitor
+
+| Media Type | Process Name | How to Detect |
+|------------|--------------|---------------|
+| Video | `vlc` | `ps aux \| grep vlc` or `pgrep vlc` |
+| Website/WebUntis | `chromium` or `chromium-browser` | `pgrep chromium` |
+| PDF Presentation | `evince`, `okular`, or custom viewer | `pgrep {viewer_name}` |
+
+### 4.2 Monitoring Checks (Every 5 seconds)
+
+#### Check 1: Process Alive
+```
+Goal: Verify expected process is running
+Method: 
+  - Get list of running processes (psutil or `ps`)
+  - Check if expected process name exists
+  - Match PID if known
+Result:
+  - If missing → status = "crashed"
+  - If found → status = "running"
+Action on crash:
+  - Send ERROR log immediately
+  - Attempt restart (max 3 attempts)
+  - Send WARN log on each restart
+  - If max restarts exceeded → send ERROR log, display fallback
+```
+
+#### Check 2: Process Responsive
+```
+Goal: Detect frozen processes
+Method:
+  - For VLC: Query HTTP interface (status.json)
+  - For Chromium: Use DevTools Protocol (CDP)
+  - For custom viewers: Check last screen update time
+Result:
+  - If same frame >30 seconds → likely frozen
+  - If playback position not advancing → frozen
+Action on freeze:
+  - Send WARN log
+  - Force refresh (reload page, seek video, next slide)
+  - If refresh fails → restart process
+```
+
+#### Check 3: Content Match
+```
+Goal: Verify correct content is displayed
+Method:
+  - Compare expected event_id with actual media/URL
+  - Check scheduled time window (is event still active?)
+Result:
+  - Mismatch → content error
+Action:
+  - Send WARN log
+  - Reload correct event from scheduler
+```
+
+---
+
+## 5. Process Control Interface Requirements
+
+### 5.1 VLC Control
+
+**Requirement:** Enable VLC HTTP interface for monitoring
+
+**Launch Command:**
+```bash
+vlc --intf http --http-host 127.0.0.1 --http-port 8080 --http-password "vlc_password" \
+    --fullscreen --loop /path/to/video.mp4
+```
+
+**Status Query:**
+```bash
+curl http://127.0.0.1:8080/requests/status.json --user ":vlc_password"
+```
+
+**Response Fields to Monitor:**
+```json
+{
+  "state": "playing",     // "playing", "paused", "stopped"
+  "position": 0.25,       // 0.0-1.0 (25% through)
+  "time": 45,             // seconds into playback
+  "length": 180,          // total duration in seconds
+  "volume": 256           // 0-512
+}
+```
+
+---
+
+### 5.2 Chromium Control
+
+**Requirement:** Enable Chrome DevTools Protocol (CDP)
+
+**Launch Command:**
+```bash
+chromium --remote-debugging-port=9222 --kiosk --app=https://example.com
+```
+
+**Status Query:**
+```bash
+curl http://127.0.0.1:9222/json
+```
+
+**Response Fields to Monitor:**
+```json
+[
+  {
+    "url": "https://example.com",
+    "title": "Page Title",
+    "type": "page"
+  }
+]
+```
+
+**Advanced:** Use CDP WebSocket for events (page load, navigation, errors)
+
+---
+
+### 5.3 PDF Viewer (Custom or Standard)
+
+**Option A: Standard Viewer (e.g., Evince)**
+- No built-in API
+- Monitor via process check + screenshot comparison
+
+**Option B: Custom Python Viewer**
+- Implement REST API for status queries
+- Track: current page, total pages, last transition time
+
+---
+
+## 6. Watchdog Service Architecture
+
+### 6.1 Service Components
+
+**Component 1: Process Monitor Thread**
+```
+Responsibilities:
+  - Check process alive every 5 seconds
+  - Detect crashes and frozen processes
+  - Attempt automatic restart
+  - Send health metrics via MQTT
+
+State Machine:
+  IDLE → STARTING → RUNNING → (if crash) → RESTARTING → RUNNING
+                             → (if max restarts) → FAILED
+```
+
+**Component 2: MQTT Publisher Thread**
+```
+Responsibilities:
+  - Maintain MQTT connection
+  - Send heartbeat every 60 seconds
+  - Send logs on-demand (queued from other components)
+  - Send health metrics every 5 seconds
+  - Reconnect on connection loss
+```
+
+**Component 3: Event Manager Integration**
+```
+Responsibilities:
+  - Receive event schedule from server
+  - Notify watchdog of expected process/content
+  - Launch media player processes
+  - Handle event transitions
+```
+
+### 6.2 Service Lifecycle
+
+**On Startup:**
+1. Load configuration (client UUID, MQTT broker, etc.)
+2. Connect to MQTT broker
+3. Send INFO log: "Watchdog service started"
+4. Wait for first event from scheduler
+
+**During Operation:**
+1. Monitor loop runs every 5 seconds
+2. Check expected vs actual process state
+3. Send health metrics
+4. Handle failures (log + restart)
+
+**On Shutdown:**
+1. Send INFO log: "Watchdog service stopping"
+2. Gracefully stop monitored processes
+3. Disconnect from MQTT
+4. Exit cleanly
+
+---
+
+## 7. Auto-Recovery Logic
+
+### 7.1 Restart Strategy
+
+**Step 1: Detect Failure**
+```
+Trigger: Process not found in process list
+Action:
+  - Log ERROR: "Process {name} crashed"
+  - Increment restart counter
+  - Check if within retry limit (max 3)
+```
+
+**Step 2: Attempt Restart**
+```
+If restart_attempts < MAX_RESTARTS:
+  - Log WARN: "Attempting restart ({attempt}/{MAX_RESTARTS})"
+  - Kill any zombie processes
+  - Wait 2 seconds (cooldown)
+  - Launch process with same parameters
+  - Wait 5 seconds for startup
+  - Verify process is running
+  - If success: reset restart counter, log INFO
+  - If fail: increment counter, repeat
+```
+
+**Step 3: Permanent Failure**
+```
+If restart_attempts >= MAX_RESTARTS:
+  - Log ERROR: "Max restart attempts exceeded, failing over"
+  - Display fallback content (static image with error message)
+  - Send notification to server (separate alert topic, optional)
+  - Wait for manual intervention or scheduler event change
+```
+
+### 7.2 Restart Cooldown
+
+**Purpose:** Prevent rapid restart loops that waste resources
+
+**Implementation:**
+```
+After each restart attempt:
+  - Wait 2 seconds before next restart
+  - After 3 failures: wait 30 seconds before trying again
+  - Reset counter on successful run >5 minutes
+```
+
+---
+
+## 8. Resource Monitoring
+
+### 8.1 System Metrics to Track
+
+**CPU Usage:**
+```
+Method: Read /proc/stat or use psutil.cpu_percent()
+Frequency: Every 5 seconds
+Threshold: Warn if >80% for >60 seconds
+```
+
+**Memory Usage:**
+```
+Method: Read /proc/meminfo or use psutil.virtual_memory()
+Frequency: Every 5 seconds
+Threshold: Warn if >90% for >30 seconds
+```
+
+**Display Status:**
+```
+Method: Check DPMS state or xset query
+Frequency: Every 30 seconds
+Threshold: Error if display off (unexpected)
+```
+
+**Network Connectivity:**
+```
+Method: Ping server or check MQTT connection
+Frequency: Every 60 seconds
+Threshold: Warn if no server connectivity
+```
+
+---
+
+## 9. Development vs Production Mode
+
+### 9.1 Development Mode
+
+**Enable via:** Environment variable `DEBUG=true` or `ENV=development`
+
+**Behavior:**
+- Send INFO level logs
+- More verbose logging to console
+- Shorter monitoring intervals (faster feedback)
+- Screenshot capture every 30 seconds
+- No rate limiting on logs
+
+### 9.2 Production Mode
+
+**Enable via:** `ENV=production`
+
+**Behavior:**
+- Send only ERROR and WARN logs
+- Minimal console output
+- Standard monitoring intervals
+- Screenshot capture every 60 seconds
+- Rate limiting: max 10 logs per minute per level
+
+---
+
+## 10. Configuration File Format
+
+### 10.1 Recommended Config: JSON
+
+**File:** `/etc/infoscreen/config.json` or `~/.config/infoscreen/config.json`
+
+```json
+{
+  "client": {
+    "uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
+    "hostname": "infoscreen-room-101"
+  },
+  "mqtt": {
+    "broker": "192.168.43.201",
+    "port": 1883,
+    "username": "",
+    "password": "",
+    "keepalive": 60
+  },
+  "monitoring": {
+    "enabled": true,
+    "health_interval_seconds": 5,
+    "heartbeat_interval_seconds": 60,
+    "max_restart_attempts": 3,
+    "restart_cooldown_seconds": 2
+  },
+  "logging": {
+    "level": "INFO",
+    "send_info_logs": false,
+    "console_output": true,
+    "local_log_file": "/var/log/infoscreen/watchdog.log"
+  },
+  "processes": {
+    "vlc": {
+      "http_port": 8080,
+      "http_password": "vlc_password"
+    },
+    "chromium": {
+      "debug_port": 9222
+    }
+  }
+}
+```
+
+---
+
+## 11. Error Scenarios & Expected Behavior
+
+### Scenario 1: VLC Crashes Mid-Video
+```
+1. Watchdog detects: process_status = "crashed"
+2. Send ERROR log: "VLC process crashed"
+3. Attempt 1: Restart VLC with same video, seek to last position
+4. If success: Send INFO log "VLC restarted successfully"
+5. If fail: Repeat 2 more times
+6. After 3 failures: Send ERROR "Max restarts exceeded", show fallback
+```
+
+### Scenario 2: Network Timeout Loading Website
+```
+1. Chromium fails to load page (CDP reports error)
+2. Send WARN log: "Page load timeout"
+3. Attempt reload (Chromium refresh)
+4. If success after 10s: Continue monitoring
+5. If timeout again: Send ERROR, try restarting Chromium
+```
+
+### Scenario 3: Display Powers Off (Hardware)
+```
+1. DPMS check detects display off
+2. Send ERROR log: "Display powered off"
+3. Attempt to wake display (xset dpms force on)
+4. If success: Send INFO log
+5. If fail: Hardware issue, alert admin
+```
+
+### Scenario 4: High CPU Usage
+```
+1. CPU >80% for 60 seconds
+2. Send WARN log: "High CPU usage: 85%"
+3. Check if expected (e.g., video playback is normal)
+4. If unexpected: investigate process causing it
+5. If critical (>95%): consider restarting offending process
+```
+
+---
+
+## 12. Testing & Validation
+
+### 12.1 Manual Tests (During Development)
+
+**Test 1: Process Crash Simulation**
+```bash
+# Start video, then kill VLC manually
+killall vlc
+# Expected: ERROR log sent, automatic restart within 5 seconds
+```
+
+**Test 2: MQTT Connectivity**
+```bash
+# Subscribe to all client topics on server
+mosquitto_sub -h 192.168.43.201 -t "infoscreen/{uuid}/#" -v
+# Expected: See heartbeat every 60s, health every 5s
+```
+
+**Test 3: Log Levels**
+```bash
+# Trigger error condition and verify log appears in database
+curl http://192.168.43.201:8000/api/client-logs/test
+# Expected: See new log entry with correct level/message
+```
+
+### 12.2 Acceptance Criteria
+
+✅ **Client must:**
+1. Send heartbeat every 60 seconds without gaps
+2. Send ERROR log within 5 seconds of process crash
+3. Attempt automatic restart (max 3 times)
+4. Report health metrics every 5 seconds
+5. Survive MQTT broker restart (reconnect automatically)
+6. Survive network interruption (buffer logs, send when reconnected)
+7. Use correct timestamp format (ISO 8601 UTC)
+8. Only send logs for real client UUID (FK constraint)
+
+---
+
+## 13. Python Libraries (Recommended)
+
+**For process monitoring:**
+- `psutil` - Cross-platform process and system utilities
+
+**For MQTT:**
+- `paho-mqtt` - Official MQTT client (use v2.x with Callback API v2)
+
+**For VLC control:**
+- `requests` - HTTP client for status queries
+
+**For Chromium control:**
+- `websocket-client` or `pychrome` - Chrome DevTools Protocol
+
+**For datetime:**
+- `datetime` (stdlib) - Use `datetime.now(timezone.utc).isoformat()`
+
+**Example requirements.txt:**
+```
+paho-mqtt>=2.0.0
+psutil>=5.9.0
+requests>=2.31.0
+python-dateutil>=2.8.0
+```
+
+---
+
+## 14. Security Considerations
+
+### 14.1 MQTT Security
+- If broker requires auth, store credentials in config file with restricted permissions (`chmod 600`)
+- Consider TLS/SSL for MQTT (port 8883) if on untrusted network
+- Use unique client ID to prevent impersonation
+
+### 14.2 Process Control APIs
+- VLC HTTP password should be random, not default
+- Chromium debug port should bind to `127.0.0.1` only (not `0.0.0.0`)
+- Restrict file system access for media player processes
+
+### 14.3 Log Content
+- **Do not log:** Passwords, API keys, personal data
+- **Sanitize:** File paths (strip user directories), URLs (remove query params with tokens)
+
+---
+
+## 15. Performance Targets
+
+| Metric | Target | Acceptable | Critical |
+|--------|--------|------------|----------|
+| Health check interval | 5s | 10s | 30s |
+| Crash detection time | <5s | <10s | <30s |
+| Restart time | <10s | <20s | <60s |
+| MQTT publish latency | <100ms | <500ms | <2s |
+| CPU usage (watchdog) | <2% | <5% | <10% |
+| RAM usage (watchdog) | <50MB | <100MB | <200MB |
+| Log message size | <1KB | <10KB | <100KB |
+
+---
+
+## 16. Troubleshooting Guide (For Client Development)
+
+### Issue: Logs not appearing in server database
+**Check:**
+1. Is MQTT broker reachable? (`mosquitto_pub` test from client)
+2. Is client UUID correct and exists in `clients` table?
+3. Is timestamp format correct (ISO 8601 with 'Z')?
+4. Check server listener logs for errors
+
+### Issue: Health metrics not updating
+**Check:**
+1. Is health loop running? (check watchdog service status)
+2. Is MQTT connected? (check connection status in logs)
+3. Is payload JSON valid? (use JSON validator)
+
+### Issue: Process restarts in loop
+**Check:**
+1. Is media file/URL accessible?
+2. Is process command correct? (test manually)
+3. Check process exit code (crash reason)
+4. Increase restart cooldown to avoid rapid loops
+
+---
+
+## 17. Complete Message Flow Diagram
+
+```
+┌─────────────────────────────────────────────────────────┐
+│                    Infoscreen Client                     │
+│                                                          │
+│  Event Occurs:                                           │
+│    - Process crashed                                     │
+│    - High CPU usage                                      │
+│    - Content loaded                                      │
+│                                                          │
+│  ┌────────────────┐                                     │
+│  │ Decision Logic │                                     │
+│  │  - Is it ERROR?│                                     │
+│  │  - Is it WARN? │                                     │
+│  │  - Is it INFO? │                                     │
+│  └────────┬───────┘                                     │
+│           │                                              │
+│           ▼                                              │
+│  ┌────────────────────────────────┐                    │
+│  │ Build JSON Payload              │                    │
+│  │ {                               │                    │
+│  │   "timestamp": "...",           │                    │
+│  │   "message": "...",             │                    │
+│  │   "context": {...}              │                    │
+│  │ }                               │                    │
+│  └────────┬───────────────────────┘                    │
+│           │                                              │
+│           ▼                                              │
+│  ┌────────────────────────────────┐                    │
+│  │ MQTT Publish                    │                    │
+│  │ Topic: infoscreen/{uuid}/logs/error                 │
+│  │ QoS: 1                          │                    │
+│  └────────┬───────────────────────┘                    │
+└───────────┼──────────────────────────────────────────┘
+            │
+            │ TCP/IP (MQTT Protocol)
+            │
+            ▼
+     ┌──────────────┐
+     │ MQTT Broker  │
+     │ (Mosquitto)  │
+     └──────┬───────┘
+            │
+            │ Topic: infoscreen/+/logs/#
+            │
+            ▼
+     ┌──────────────────────────────┐
+     │   Listener Service            │
+     │   (Python)                    │
+     │                               │
+     │  - Parse JSON                 │
+     │  - Validate UUID              │
+     │  - Store in database          │
+     └──────┬───────────────────────┘
+            │
+            ▼
+     ┌──────────────────────────────┐
+     │   MariaDB Database            │
+     │                               │
+     │   Table: client_logs          │
+     │   - client_uuid               │
+     │   - timestamp                 │
+     │   - level                     │
+     │   - message                   │
+     │   - context (JSON)            │
+     └──────┬───────────────────────┘
+            │
+            │ SQL Query
+            │
+            ▼
+     ┌──────────────────────────────┐
+     │   API Server (Flask)          │
+     │                               │
+     │   GET /api/client-logs/{uuid}/logs
+     │   GET /api/client-logs/summary
+     └──────┬───────────────────────┘
+            │
+            │ HTTP/JSON
+            │
+            ▼
+     ┌──────────────────────────────┐
+     │   Dashboard (React)           │
+     │                               │
+     │   - Display logs              │
+     │   - Filter by level           │
+     │   - Show health status        │
+     └───────────────────────────────┘
+```
+
+---
+
+## 18. Quick Reference Card
+
+### MQTT Topics Summary
+```
+infoscreen/{uuid}/logs/error    → Critical failures
+infoscreen/{uuid}/logs/warn     → Non-critical issues
+infoscreen/{uuid}/logs/info     → Informational (dev mode)
+infoscreen/{uuid}/health        → Health metrics (every 5s)
+infoscreen/{uuid}/heartbeat     → Enhanced heartbeat (every 60s)
+```
+
+### JSON Timestamp Format
+```python
+from datetime import datetime, timezone
+timestamp = datetime.now(timezone.utc).isoformat()
+# Output: "2026-03-10T07:30:00+00:00" or "2026-03-10T07:30:00Z"
+```
+
+### Process Status Values
+```
+"running"  - Process is alive and responding
+"crashed"  - Process terminated unexpectedly
+"starting" - Process is launching (startup phase)
+"stopped"  - Process intentionally stopped
+```
+
+### Restart Logic
+```
+Max attempts: 3
+Cooldown: 2 seconds between attempts
+Reset: After 5 minutes of successful operation
+```
+
+---
+
+## 19. Contact & Support
+
+**Server API Documentation:**
+- Base URL: `http://192.168.43.201:8000`
+- Health check: `GET /health`
+- Test logs: `GET /api/client-logs/test` (no auth)
+- Full API docs: See `CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md` on server
+
+**MQTT Broker:**
+- Host: `192.168.43.201`
+- Port: `1883` (standard), `9001` (WebSocket)
+- Test tool: `mosquitto_pub` / `mosquitto_sub`
+
+**Database Schema:**
+- Table: `client_logs`
+- Foreign Key: `client_uuid` → `clients.uuid` (ON DELETE CASCADE)
+- Constraint: UUID must exist in clients table before logging
+
+**Server-Side Logs:**
+```bash
+# View listener logs (processes MQTT messages)
+docker compose logs -f listener
+
+# View server logs (API requests)
+docker compose logs -f server
+```
+
+---
+
+## 20. Appendix: Example Implementations
+
+### A. Minimal Python Watchdog (Pseudocode)
+
+```python
+import time
+import json
+import psutil
+import paho.mqtt.client as mqtt
+from datetime import datetime, timezone
+
+class MinimalWatchdog:
+    def __init__(self, client_uuid, mqtt_broker):
+        self.uuid = client_uuid
+        self.mqtt_client = mqtt.Client(callback_api_version=mqtt.CallbackAPIVersion.VERSION2)
+        self.mqtt_client.connect(mqtt_broker, 1883, 60)
+        self.mqtt_client.loop_start()
+        
+        self.expected_process = None
+        self.restart_attempts = 0
+        self.MAX_RESTARTS = 3
+    
+    def send_log(self, level, message, context=None):
+        topic = f"infoscreen/{self.uuid}/logs/{level}"
+        payload = {
+            "timestamp": datetime.now(timezone.utc).isoformat(),
+            "message": message,
+            "context": context or {}
+        }
+        self.mqtt_client.publish(topic, json.dumps(payload), qos=1)
+    
+    def is_process_running(self, process_name):
+        for proc in psutil.process_iter(['name']):
+            if process_name in proc.info['name']:
+                return True
+        return False
+    
+    def monitor_loop(self):
+        while True:
+            if self.expected_process:
+                if not self.is_process_running(self.expected_process):
+                    self.send_log("error", f"{self.expected_process} crashed")
+                    if self.restart_attempts < self.MAX_RESTARTS:
+                        self.restart_process()
+                    else:
+                        self.send_log("error", "Max restarts exceeded")
+            
+            time.sleep(5)
+
+# Usage:
+watchdog = MinimalWatchdog("9b8d1856-ff34-4864-a726-12de072d0f77", "192.168.43.201")
+watchdog.expected_process = "vlc"
+watchdog.monitor_loop()
+```
+
+---
+
+**END OF SPECIFICATION**
+
+Questions? Refer to:
+- `CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md` (server repo)
+- Server API: `http://192.168.43.201:8000/api/client-logs/test`
+- MQTT test: `mosquitto_sub -h 192.168.43.201 -t infoscreen/#`