feat(monitoring): add server-side client logging and health infrastructure
- add Alembic migration c1d2e3f4g5h6 for client monitoring:
- create client_logs table with FK to clients.uuid and performance indexes
- extend clients with process/health tracking fields
- extend data model with ClientLog, LogLevel, ProcessStatus, and ScreenHealthStatus
- enhance listener MQTT handling:
- subscribe to logs and health topics
- persist client logs from infoscreen/{uuid}/logs/{level}
- process health payloads and enrich heartbeat-derived client state
- add monitoring API blueprint server/routes/client_logs.py:
- GET /api/client-logs/<uuid>/logs
- GET /api/client-logs/summary
- GET /api/client-logs/recent-errors
- GET /api/client-logs/test
- register client_logs blueprint in server/wsgi.py
- align compose/dev runtime for listener live-code execution
- add client-side implementation docs:
- CLIENT_MONITORING_SPECIFICATION.md
- CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md
- update TECH-CHANGELOG.md and copilot-instructions.md:
- document monitoring changes
- codify post-release technical-notes/no-version-bump convention
This commit is contained in:
972
CLIENT_MONITORING_SPECIFICATION.md
Normal file
972
CLIENT_MONITORING_SPECIFICATION.md
Normal file
@@ -0,0 +1,972 @@
|
||||
# Client-Side Monitoring Specification
|
||||
|
||||
**Version:** 1.0
|
||||
**Date:** 2026-03-10
|
||||
**For:** Infoscreen Client Implementation
|
||||
**Server Endpoint:** `192.168.43.201:8000` (or your production server)
|
||||
**MQTT Broker:** `192.168.43.201:1883` (or your production MQTT broker)
|
||||
|
||||
---
|
||||
|
||||
## 1. Overview
|
||||
|
||||
Each infoscreen client must implement health monitoring and logging capabilities to report status to the central server via MQTT.
|
||||
|
||||
### 1.1 Goals
|
||||
- **Detect failures:** Process crashes, frozen screens, content mismatches
|
||||
- **Provide visibility:** Real-time health status visible on server dashboard
|
||||
- **Enable remote diagnosis:** Centralized log storage for debugging
|
||||
- **Auto-recovery:** Attempt automatic restart on failure
|
||||
|
||||
### 1.2 Architecture
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Infoscreen Client │
|
||||
│ │
|
||||
│ ┌──────────────┐ ┌──────────────┐ │
|
||||
│ │ Media Player │ │ Watchdog │ │
|
||||
│ │ (VLC/Chrome) │◄───│ Monitor │ │
|
||||
│ └──────────────┘ └──────┬───────┘ │
|
||||
│ │ │
|
||||
│ ┌──────────────┐ │ │
|
||||
│ │ Event Mgr │ │ │
|
||||
│ │ (receives │ │ │
|
||||
│ │ schedule) │◄───────────┘ │
|
||||
│ └──────┬───────┘ │
|
||||
│ │ │
|
||||
│ ┌──────▼───────────────────────┐ │
|
||||
│ │ MQTT Client │ │
|
||||
│ │ - Heartbeat (every 60s) │ │
|
||||
│ │ - Logs (error/warn/info) │ │
|
||||
│ │ - Health metrics (every 5s) │ │
|
||||
│ └──────┬────────────────────────┘ │
|
||||
└─────────┼──────────────────────────────┘
|
||||
│
|
||||
│ MQTT over TCP
|
||||
▼
|
||||
┌─────────────┐
|
||||
│ MQTT Broker │
|
||||
│ (server) │
|
||||
└─────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. MQTT Protocol Specification
|
||||
|
||||
### 2.1 Connection Parameters
|
||||
```
|
||||
Broker: 192.168.43.201 (or DNS hostname)
|
||||
Port: 1883 (standard MQTT)
|
||||
Protocol: MQTT v3.1.1
|
||||
Client ID: "infoscreen-{client_uuid}"
|
||||
Clean Session: false (retain subscriptions)
|
||||
Keep Alive: 60 seconds
|
||||
Username/Password: (if configured on broker)
|
||||
```
|
||||
|
||||
### 2.2 QoS Levels
|
||||
- **Heartbeat:** QoS 0 (fire and forget, high frequency)
|
||||
- **Logs (ERROR/WARN):** QoS 1 (at least once delivery, important)
|
||||
- **Logs (INFO):** QoS 0 (optional, high volume)
|
||||
- **Health metrics:** QoS 0 (frequent, latest value matters)
|
||||
|
||||
---
|
||||
|
||||
## 3. Topic Structure & Payload Formats
|
||||
|
||||
### 3.1 Log Messages
|
||||
|
||||
#### Topic Pattern:
|
||||
```
|
||||
infoscreen/{client_uuid}/logs/{level}
|
||||
```
|
||||
|
||||
Where `{level}` is one of: `error`, `warn`, `info`
|
||||
|
||||
#### Payload Format (JSON):
|
||||
```json
|
||||
{
|
||||
"timestamp": "2026-03-10T07:30:00Z",
|
||||
"message": "Human-readable error description",
|
||||
"context": {
|
||||
"event_id": 42,
|
||||
"process": "vlc",
|
||||
"error_code": "NETWORK_TIMEOUT",
|
||||
"additional_key": "any relevant data"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Field Specifications:
|
||||
| Field | Type | Required | Description |
|
||||
|-------|------|----------|-------------|
|
||||
| `timestamp` | string (ISO 8601 UTC) | Yes | When the event occurred. Use `YYYY-MM-DDTHH:MM:SSZ` format |
|
||||
| `message` | string | Yes | Human-readable description of the event (max 1000 chars) |
|
||||
| `context` | object | No | Additional structured data (will be stored as JSON) |
|
||||
|
||||
#### Example Topics:
|
||||
```
|
||||
infoscreen/9b8d1856-ff34-4864-a726-12de072d0f77/logs/error
|
||||
infoscreen/9b8d1856-ff34-4864-a726-12de072d0f77/logs/warn
|
||||
infoscreen/9b8d1856-ff34-4864-a726-12de072d0f77/logs/info
|
||||
```
|
||||
|
||||
#### When to Send Logs:
|
||||
|
||||
**ERROR (Always send):**
|
||||
- Process crashed (VLC/Chromium/PDF viewer terminated unexpectedly)
|
||||
- Content failed to load (404, network timeout, corrupt file)
|
||||
- Hardware failure detected (display off, audio device missing)
|
||||
- Exception caught in main event loop
|
||||
- Maximum restart attempts exceeded
|
||||
|
||||
**WARN (Always send):**
|
||||
- Process restarted automatically (after crash)
|
||||
- High resource usage (CPU >80%, RAM >90%)
|
||||
- Slow performance (frame drops, lag)
|
||||
- Non-critical failures (screenshot capture failed, cache full)
|
||||
- Fallback content displayed (primary source unavailable)
|
||||
|
||||
**INFO (Send in development, optional in production):**
|
||||
- Process started successfully
|
||||
- Event transition (switched from video to presentation)
|
||||
- Content loaded successfully
|
||||
- Watchdog service started/stopped
|
||||
|
||||
---
|
||||
|
||||
### 3.2 Health Metrics
|
||||
|
||||
#### Topic Pattern:
|
||||
```
|
||||
infoscreen/{client_uuid}/health
|
||||
```
|
||||
|
||||
#### Payload Format (JSON):
|
||||
```json
|
||||
{
|
||||
"timestamp": "2026-03-10T07:30:00Z",
|
||||
"expected_state": {
|
||||
"event_id": 42,
|
||||
"event_type": "video",
|
||||
"media_file": "presentation.mp4",
|
||||
"started_at": "2026-03-10T07:15:00Z"
|
||||
},
|
||||
"actual_state": {
|
||||
"process": "vlc",
|
||||
"pid": 1234,
|
||||
"status": "running",
|
||||
"uptime_seconds": 900,
|
||||
"position": 45.3,
|
||||
"duration": 180.0
|
||||
},
|
||||
"health_metrics": {
|
||||
"screen_on": true,
|
||||
"last_frame_update": "2026-03-10T07:29:58Z",
|
||||
"frames_dropped": 2,
|
||||
"network_errors": 0,
|
||||
"cpu_percent": 15.3,
|
||||
"memory_mb": 234
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Field Specifications:
|
||||
|
||||
**expected_state:**
|
||||
| Field | Type | Required | Description |
|
||||
|-------|------|----------|-------------|
|
||||
| `event_id` | integer | Yes | Current event ID from scheduler |
|
||||
| `event_type` | string | Yes | `presentation`, `video`, `website`, `webuntis`, `message` |
|
||||
| `media_file` | string | No | Filename or URL of current content |
|
||||
| `started_at` | string (ISO 8601) | Yes | When this event started playing |
|
||||
|
||||
**actual_state:**
|
||||
| Field | Type | Required | Description |
|
||||
|-------|------|----------|-------------|
|
||||
| `process` | string | Yes | `vlc`, `chromium`, `pdf_viewer`, `none` |
|
||||
| `pid` | integer | No | Process ID (if running) |
|
||||
| `status` | string | Yes | `running`, `crashed`, `starting`, `stopped` |
|
||||
| `uptime_seconds` | integer | No | How long process has been running |
|
||||
| `position` | float | No | Current playback position (seconds, for video/audio) |
|
||||
| `duration` | float | No | Total content duration (seconds) |
|
||||
|
||||
**health_metrics:**
|
||||
| Field | Type | Required | Description |
|
||||
|-------|------|----------|-------------|
|
||||
| `screen_on` | boolean | Yes | Is display powered on? |
|
||||
| `last_frame_update` | string (ISO 8601) | No | Last time screen content changed |
|
||||
| `frames_dropped` | integer | No | Video frames dropped (performance indicator) |
|
||||
| `network_errors` | integer | No | Count of network errors in last interval |
|
||||
| `cpu_percent` | float | No | CPU usage (0-100) |
|
||||
| `memory_mb` | integer | No | RAM usage in megabytes |
|
||||
|
||||
#### Sending Frequency:
|
||||
- **Normal operation:** Every 5 seconds
|
||||
- **During startup/transition:** Every 1 second
|
||||
- **After error:** Immediately + every 2 seconds until recovered
|
||||
|
||||
---
|
||||
|
||||
### 3.3 Enhanced Heartbeat
|
||||
|
||||
The existing heartbeat topic should be enhanced to include process status.
|
||||
|
||||
#### Topic Pattern:
|
||||
```
|
||||
infoscreen/{client_uuid}/heartbeat
|
||||
```
|
||||
|
||||
#### Enhanced Payload Format (JSON):
|
||||
```json
|
||||
{
|
||||
"uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
|
||||
"timestamp": "2026-03-10T07:30:00Z",
|
||||
"current_process": "vlc",
|
||||
"process_pid": 1234,
|
||||
"process_status": "running",
|
||||
"current_event_id": 42
|
||||
}
|
||||
```
|
||||
|
||||
#### New Fields (add to existing heartbeat):
|
||||
| Field | Type | Required | Description |
|
||||
|-------|------|----------|-------------|
|
||||
| `current_process` | string | No | Name of active media player process |
|
||||
| `process_pid` | integer | No | Process ID |
|
||||
| `process_status` | string | No | `running`, `crashed`, `starting`, `stopped` |
|
||||
| `current_event_id` | integer | No | Event ID currently being displayed |
|
||||
|
||||
#### Sending Frequency:
|
||||
- Keep existing: **Every 60 seconds**
|
||||
- Include new fields if available
|
||||
|
||||
---
|
||||
|
||||
## 4. Process Monitoring Requirements
|
||||
|
||||
### 4.1 Processes to Monitor
|
||||
|
||||
| Media Type | Process Name | How to Detect |
|
||||
|------------|--------------|---------------|
|
||||
| Video | `vlc` | `ps aux \| grep vlc` or `pgrep vlc` |
|
||||
| Website/WebUntis | `chromium` or `chromium-browser` | `pgrep chromium` |
|
||||
| PDF Presentation | `evince`, `okular`, or custom viewer | `pgrep {viewer_name}` |
|
||||
|
||||
### 4.2 Monitoring Checks (Every 5 seconds)
|
||||
|
||||
#### Check 1: Process Alive
|
||||
```
|
||||
Goal: Verify expected process is running
|
||||
Method:
|
||||
- Get list of running processes (psutil or `ps`)
|
||||
- Check if expected process name exists
|
||||
- Match PID if known
|
||||
Result:
|
||||
- If missing → status = "crashed"
|
||||
- If found → status = "running"
|
||||
Action on crash:
|
||||
- Send ERROR log immediately
|
||||
- Attempt restart (max 3 attempts)
|
||||
- Send WARN log on each restart
|
||||
- If max restarts exceeded → send ERROR log, display fallback
|
||||
```
|
||||
|
||||
#### Check 2: Process Responsive
|
||||
```
|
||||
Goal: Detect frozen processes
|
||||
Method:
|
||||
- For VLC: Query HTTP interface (status.json)
|
||||
- For Chromium: Use DevTools Protocol (CDP)
|
||||
- For custom viewers: Check last screen update time
|
||||
Result:
|
||||
- If same frame >30 seconds → likely frozen
|
||||
- If playback position not advancing → frozen
|
||||
Action on freeze:
|
||||
- Send WARN log
|
||||
- Force refresh (reload page, seek video, next slide)
|
||||
- If refresh fails → restart process
|
||||
```
|
||||
|
||||
#### Check 3: Content Match
|
||||
```
|
||||
Goal: Verify correct content is displayed
|
||||
Method:
|
||||
- Compare expected event_id with actual media/URL
|
||||
- Check scheduled time window (is event still active?)
|
||||
Result:
|
||||
- Mismatch → content error
|
||||
Action:
|
||||
- Send WARN log
|
||||
- Reload correct event from scheduler
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Process Control Interface Requirements
|
||||
|
||||
### 5.1 VLC Control
|
||||
|
||||
**Requirement:** Enable VLC HTTP interface for monitoring
|
||||
|
||||
**Launch Command:**
|
||||
```bash
|
||||
vlc --intf http --http-host 127.0.0.1 --http-port 8080 --http-password "vlc_password" \
|
||||
--fullscreen --loop /path/to/video.mp4
|
||||
```
|
||||
|
||||
**Status Query:**
|
||||
```bash
|
||||
curl http://127.0.0.1:8080/requests/status.json --user ":vlc_password"
|
||||
```
|
||||
|
||||
**Response Fields to Monitor:**
|
||||
```json
|
||||
{
|
||||
"state": "playing", // "playing", "paused", "stopped"
|
||||
"position": 0.25, // 0.0-1.0 (25% through)
|
||||
"time": 45, // seconds into playback
|
||||
"length": 180, // total duration in seconds
|
||||
"volume": 256 // 0-512
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 5.2 Chromium Control
|
||||
|
||||
**Requirement:** Enable Chrome DevTools Protocol (CDP)
|
||||
|
||||
**Launch Command:**
|
||||
```bash
|
||||
chromium --remote-debugging-port=9222 --kiosk --app=https://example.com
|
||||
```
|
||||
|
||||
**Status Query:**
|
||||
```bash
|
||||
curl http://127.0.0.1:9222/json
|
||||
```
|
||||
|
||||
**Response Fields to Monitor:**
|
||||
```json
|
||||
[
|
||||
{
|
||||
"url": "https://example.com",
|
||||
"title": "Page Title",
|
||||
"type": "page"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
**Advanced:** Use CDP WebSocket for events (page load, navigation, errors)
|
||||
|
||||
---
|
||||
|
||||
### 5.3 PDF Viewer (Custom or Standard)
|
||||
|
||||
**Option A: Standard Viewer (e.g., Evince)**
|
||||
- No built-in API
|
||||
- Monitor via process check + screenshot comparison
|
||||
|
||||
**Option B: Custom Python Viewer**
|
||||
- Implement REST API for status queries
|
||||
- Track: current page, total pages, last transition time
|
||||
|
||||
---
|
||||
|
||||
## 6. Watchdog Service Architecture
|
||||
|
||||
### 6.1 Service Components
|
||||
|
||||
**Component 1: Process Monitor Thread**
|
||||
```
|
||||
Responsibilities:
|
||||
- Check process alive every 5 seconds
|
||||
- Detect crashes and frozen processes
|
||||
- Attempt automatic restart
|
||||
- Send health metrics via MQTT
|
||||
|
||||
State Machine:
|
||||
IDLE → STARTING → RUNNING → (if crash) → RESTARTING → RUNNING
|
||||
→ (if max restarts) → FAILED
|
||||
```
|
||||
|
||||
**Component 2: MQTT Publisher Thread**
|
||||
```
|
||||
Responsibilities:
|
||||
- Maintain MQTT connection
|
||||
- Send heartbeat every 60 seconds
|
||||
- Send logs on-demand (queued from other components)
|
||||
- Send health metrics every 5 seconds
|
||||
- Reconnect on connection loss
|
||||
```
|
||||
|
||||
**Component 3: Event Manager Integration**
|
||||
```
|
||||
Responsibilities:
|
||||
- Receive event schedule from server
|
||||
- Notify watchdog of expected process/content
|
||||
- Launch media player processes
|
||||
- Handle event transitions
|
||||
```
|
||||
|
||||
### 6.2 Service Lifecycle
|
||||
|
||||
**On Startup:**
|
||||
1. Load configuration (client UUID, MQTT broker, etc.)
|
||||
2. Connect to MQTT broker
|
||||
3. Send INFO log: "Watchdog service started"
|
||||
4. Wait for first event from scheduler
|
||||
|
||||
**During Operation:**
|
||||
1. Monitor loop runs every 5 seconds
|
||||
2. Check expected vs actual process state
|
||||
3. Send health metrics
|
||||
4. Handle failures (log + restart)
|
||||
|
||||
**On Shutdown:**
|
||||
1. Send INFO log: "Watchdog service stopping"
|
||||
2. Gracefully stop monitored processes
|
||||
3. Disconnect from MQTT
|
||||
4. Exit cleanly
|
||||
|
||||
---
|
||||
|
||||
## 7. Auto-Recovery Logic
|
||||
|
||||
### 7.1 Restart Strategy
|
||||
|
||||
**Step 1: Detect Failure**
|
||||
```
|
||||
Trigger: Process not found in process list
|
||||
Action:
|
||||
- Log ERROR: "Process {name} crashed"
|
||||
- Increment restart counter
|
||||
- Check if within retry limit (max 3)
|
||||
```
|
||||
|
||||
**Step 2: Attempt Restart**
|
||||
```
|
||||
If restart_attempts < MAX_RESTARTS:
|
||||
- Log WARN: "Attempting restart ({attempt}/{MAX_RESTARTS})"
|
||||
- Kill any zombie processes
|
||||
- Wait 2 seconds (cooldown)
|
||||
- Launch process with same parameters
|
||||
- Wait 5 seconds for startup
|
||||
- Verify process is running
|
||||
- If success: reset restart counter, log INFO
|
||||
- If fail: increment counter, repeat
|
||||
```
|
||||
|
||||
**Step 3: Permanent Failure**
|
||||
```
|
||||
If restart_attempts >= MAX_RESTARTS:
|
||||
- Log ERROR: "Max restart attempts exceeded, failing over"
|
||||
- Display fallback content (static image with error message)
|
||||
- Send notification to server (separate alert topic, optional)
|
||||
- Wait for manual intervention or scheduler event change
|
||||
```
|
||||
|
||||
### 7.2 Restart Cooldown
|
||||
|
||||
**Purpose:** Prevent rapid restart loops that waste resources
|
||||
|
||||
**Implementation:**
|
||||
```
|
||||
After each restart attempt:
|
||||
- Wait 2 seconds before next restart
|
||||
- After 3 failures: wait 30 seconds before trying again
|
||||
- Reset counter on successful run >5 minutes
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Resource Monitoring
|
||||
|
||||
### 8.1 System Metrics to Track
|
||||
|
||||
**CPU Usage:**
|
||||
```
|
||||
Method: Read /proc/stat or use psutil.cpu_percent()
|
||||
Frequency: Every 5 seconds
|
||||
Threshold: Warn if >80% for >60 seconds
|
||||
```
|
||||
|
||||
**Memory Usage:**
|
||||
```
|
||||
Method: Read /proc/meminfo or use psutil.virtual_memory()
|
||||
Frequency: Every 5 seconds
|
||||
Threshold: Warn if >90% for >30 seconds
|
||||
```
|
||||
|
||||
**Display Status:**
|
||||
```
|
||||
Method: Check DPMS state or xset query
|
||||
Frequency: Every 30 seconds
|
||||
Threshold: Error if display off (unexpected)
|
||||
```
|
||||
|
||||
**Network Connectivity:**
|
||||
```
|
||||
Method: Ping server or check MQTT connection
|
||||
Frequency: Every 60 seconds
|
||||
Threshold: Warn if no server connectivity
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 9. Development vs Production Mode
|
||||
|
||||
### 9.1 Development Mode
|
||||
|
||||
**Enable via:** Environment variable `DEBUG=true` or `ENV=development`
|
||||
|
||||
**Behavior:**
|
||||
- Send INFO level logs
|
||||
- More verbose logging to console
|
||||
- Shorter monitoring intervals (faster feedback)
|
||||
- Screenshot capture every 30 seconds
|
||||
- No rate limiting on logs
|
||||
|
||||
### 9.2 Production Mode
|
||||
|
||||
**Enable via:** `ENV=production`
|
||||
|
||||
**Behavior:**
|
||||
- Send only ERROR and WARN logs
|
||||
- Minimal console output
|
||||
- Standard monitoring intervals
|
||||
- Screenshot capture every 60 seconds
|
||||
- Rate limiting: max 10 logs per minute per level
|
||||
|
||||
---
|
||||
|
||||
## 10. Configuration File Format
|
||||
|
||||
### 10.1 Recommended Config: JSON
|
||||
|
||||
**File:** `/etc/infoscreen/config.json` or `~/.config/infoscreen/config.json`
|
||||
|
||||
```json
|
||||
{
|
||||
"client": {
|
||||
"uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
|
||||
"hostname": "infoscreen-room-101"
|
||||
},
|
||||
"mqtt": {
|
||||
"broker": "192.168.43.201",
|
||||
"port": 1883,
|
||||
"username": "",
|
||||
"password": "",
|
||||
"keepalive": 60
|
||||
},
|
||||
"monitoring": {
|
||||
"enabled": true,
|
||||
"health_interval_seconds": 5,
|
||||
"heartbeat_interval_seconds": 60,
|
||||
"max_restart_attempts": 3,
|
||||
"restart_cooldown_seconds": 2
|
||||
},
|
||||
"logging": {
|
||||
"level": "INFO",
|
||||
"send_info_logs": false,
|
||||
"console_output": true,
|
||||
"local_log_file": "/var/log/infoscreen/watchdog.log"
|
||||
},
|
||||
"processes": {
|
||||
"vlc": {
|
||||
"http_port": 8080,
|
||||
"http_password": "vlc_password"
|
||||
},
|
||||
"chromium": {
|
||||
"debug_port": 9222
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 11. Error Scenarios & Expected Behavior
|
||||
|
||||
### Scenario 1: VLC Crashes Mid-Video
|
||||
```
|
||||
1. Watchdog detects: process_status = "crashed"
|
||||
2. Send ERROR log: "VLC process crashed"
|
||||
3. Attempt 1: Restart VLC with same video, seek to last position
|
||||
4. If success: Send INFO log "VLC restarted successfully"
|
||||
5. If fail: Repeat 2 more times
|
||||
6. After 3 failures: Send ERROR "Max restarts exceeded", show fallback
|
||||
```
|
||||
|
||||
### Scenario 2: Network Timeout Loading Website
|
||||
```
|
||||
1. Chromium fails to load page (CDP reports error)
|
||||
2. Send WARN log: "Page load timeout"
|
||||
3. Attempt reload (Chromium refresh)
|
||||
4. If success after 10s: Continue monitoring
|
||||
5. If timeout again: Send ERROR, try restarting Chromium
|
||||
```
|
||||
|
||||
### Scenario 3: Display Powers Off (Hardware)
|
||||
```
|
||||
1. DPMS check detects display off
|
||||
2. Send ERROR log: "Display powered off"
|
||||
3. Attempt to wake display (xset dpms force on)
|
||||
4. If success: Send INFO log
|
||||
5. If fail: Hardware issue, alert admin
|
||||
```
|
||||
|
||||
### Scenario 4: High CPU Usage
|
||||
```
|
||||
1. CPU >80% for 60 seconds
|
||||
2. Send WARN log: "High CPU usage: 85%"
|
||||
3. Check if expected (e.g., video playback is normal)
|
||||
4. If unexpected: investigate process causing it
|
||||
5. If critical (>95%): consider restarting offending process
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 12. Testing & Validation
|
||||
|
||||
### 12.1 Manual Tests (During Development)
|
||||
|
||||
**Test 1: Process Crash Simulation**
|
||||
```bash
|
||||
# Start video, then kill VLC manually
|
||||
killall vlc
|
||||
# Expected: ERROR log sent, automatic restart within 5 seconds
|
||||
```
|
||||
|
||||
**Test 2: MQTT Connectivity**
|
||||
```bash
|
||||
# Subscribe to all client topics on server
|
||||
mosquitto_sub -h 192.168.43.201 -t "infoscreen/{uuid}/#" -v
|
||||
# Expected: See heartbeat every 60s, health every 5s
|
||||
```
|
||||
|
||||
**Test 3: Log Levels**
|
||||
```bash
|
||||
# Trigger error condition and verify log appears in database
|
||||
curl http://192.168.43.201:8000/api/client-logs/test
|
||||
# Expected: See new log entry with correct level/message
|
||||
```
|
||||
|
||||
### 12.2 Acceptance Criteria
|
||||
|
||||
✅ **Client must:**
|
||||
1. Send heartbeat every 60 seconds without gaps
|
||||
2. Send ERROR log within 5 seconds of process crash
|
||||
3. Attempt automatic restart (max 3 times)
|
||||
4. Report health metrics every 5 seconds
|
||||
5. Survive MQTT broker restart (reconnect automatically)
|
||||
6. Survive network interruption (buffer logs, send when reconnected)
|
||||
7. Use correct timestamp format (ISO 8601 UTC)
|
||||
8. Only send logs for real client UUID (FK constraint)
|
||||
|
||||
---
|
||||
|
||||
## 13. Python Libraries (Recommended)
|
||||
|
||||
**For process monitoring:**
|
||||
- `psutil` - Cross-platform process and system utilities
|
||||
|
||||
**For MQTT:**
|
||||
- `paho-mqtt` - Official MQTT client (use v2.x with Callback API v2)
|
||||
|
||||
**For VLC control:**
|
||||
- `requests` - HTTP client for status queries
|
||||
|
||||
**For Chromium control:**
|
||||
- `websocket-client` or `pychrome` - Chrome DevTools Protocol
|
||||
|
||||
**For datetime:**
|
||||
- `datetime` (stdlib) - Use `datetime.now(timezone.utc).isoformat()`
|
||||
|
||||
**Example requirements.txt:**
|
||||
```
|
||||
paho-mqtt>=2.0.0
|
||||
psutil>=5.9.0
|
||||
requests>=2.31.0
|
||||
python-dateutil>=2.8.0
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 14. Security Considerations
|
||||
|
||||
### 14.1 MQTT Security
|
||||
- If broker requires auth, store credentials in config file with restricted permissions (`chmod 600`)
|
||||
- Consider TLS/SSL for MQTT (port 8883) if on untrusted network
|
||||
- Use unique client ID to prevent impersonation
|
||||
|
||||
### 14.2 Process Control APIs
|
||||
- VLC HTTP password should be random, not default
|
||||
- Chromium debug port should bind to `127.0.0.1` only (not `0.0.0.0`)
|
||||
- Restrict file system access for media player processes
|
||||
|
||||
### 14.3 Log Content
|
||||
- **Do not log:** Passwords, API keys, personal data
|
||||
- **Sanitize:** File paths (strip user directories), URLs (remove query params with tokens)
|
||||
|
||||
---
|
||||
|
||||
## 15. Performance Targets
|
||||
|
||||
| Metric | Target | Acceptable | Critical |
|
||||
|--------|--------|------------|----------|
|
||||
| Health check interval | 5s | 10s | 30s |
|
||||
| Crash detection time | <5s | <10s | <30s |
|
||||
| Restart time | <10s | <20s | <60s |
|
||||
| MQTT publish latency | <100ms | <500ms | <2s |
|
||||
| CPU usage (watchdog) | <2% | <5% | <10% |
|
||||
| RAM usage (watchdog) | <50MB | <100MB | <200MB |
|
||||
| Log message size | <1KB | <10KB | <100KB |
|
||||
|
||||
---
|
||||
|
||||
## 16. Troubleshooting Guide (For Client Development)
|
||||
|
||||
### Issue: Logs not appearing in server database
|
||||
**Check:**
|
||||
1. Is MQTT broker reachable? (`mosquitto_pub` test from client)
|
||||
2. Is client UUID correct and exists in `clients` table?
|
||||
3. Is timestamp format correct (ISO 8601 with 'Z')?
|
||||
4. Check server listener logs for errors
|
||||
|
||||
### Issue: Health metrics not updating
|
||||
**Check:**
|
||||
1. Is health loop running? (check watchdog service status)
|
||||
2. Is MQTT connected? (check connection status in logs)
|
||||
3. Is payload JSON valid? (use JSON validator)
|
||||
|
||||
### Issue: Process restarts in loop
|
||||
**Check:**
|
||||
1. Is media file/URL accessible?
|
||||
2. Is process command correct? (test manually)
|
||||
3. Check process exit code (crash reason)
|
||||
4. Increase restart cooldown to avoid rapid loops
|
||||
|
||||
---
|
||||
|
||||
## 17. Complete Message Flow Diagram
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ Infoscreen Client │
|
||||
│ │
|
||||
│ Event Occurs: │
|
||||
│ - Process crashed │
|
||||
│ - High CPU usage │
|
||||
│ - Content loaded │
|
||||
│ │
|
||||
│ ┌────────────────┐ │
|
||||
│ │ Decision Logic │ │
|
||||
│ │ - Is it ERROR?│ │
|
||||
│ │ - Is it WARN? │ │
|
||||
│ │ - Is it INFO? │ │
|
||||
│ └────────┬───────┘ │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ ┌────────────────────────────────┐ │
|
||||
│ │ Build JSON Payload │ │
|
||||
│ │ { │ │
|
||||
│ │ "timestamp": "...", │ │
|
||||
│ │ "message": "...", │ │
|
||||
│ │ "context": {...} │ │
|
||||
│ │ } │ │
|
||||
│ └────────┬───────────────────────┘ │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ ┌────────────────────────────────┐ │
|
||||
│ │ MQTT Publish │ │
|
||||
│ │ Topic: infoscreen/{uuid}/logs/error │
|
||||
│ │ QoS: 1 │ │
|
||||
│ └────────┬───────────────────────┘ │
|
||||
└───────────┼──────────────────────────────────────────┘
|
||||
│
|
||||
│ TCP/IP (MQTT Protocol)
|
||||
│
|
||||
▼
|
||||
┌──────────────┐
|
||||
│ MQTT Broker │
|
||||
│ (Mosquitto) │
|
||||
└──────┬───────┘
|
||||
│
|
||||
│ Topic: infoscreen/+/logs/#
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────┐
|
||||
│ Listener Service │
|
||||
│ (Python) │
|
||||
│ │
|
||||
│ - Parse JSON │
|
||||
│ - Validate UUID │
|
||||
│ - Store in database │
|
||||
└──────┬───────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────┐
|
||||
│ MariaDB Database │
|
||||
│ │
|
||||
│ Table: client_logs │
|
||||
│ - client_uuid │
|
||||
│ - timestamp │
|
||||
│ - level │
|
||||
│ - message │
|
||||
│ - context (JSON) │
|
||||
└──────┬───────────────────────┘
|
||||
│
|
||||
│ SQL Query
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────┐
|
||||
│ API Server (Flask) │
|
||||
│ │
|
||||
│ GET /api/client-logs/{uuid}/logs
|
||||
│ GET /api/client-logs/summary
|
||||
└──────┬───────────────────────┘
|
||||
│
|
||||
│ HTTP/JSON
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────┐
|
||||
│ Dashboard (React) │
|
||||
│ │
|
||||
│ - Display logs │
|
||||
│ - Filter by level │
|
||||
│ - Show health status │
|
||||
└───────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 18. Quick Reference Card
|
||||
|
||||
### MQTT Topics Summary
|
||||
```
|
||||
infoscreen/{uuid}/logs/error → Critical failures
|
||||
infoscreen/{uuid}/logs/warn → Non-critical issues
|
||||
infoscreen/{uuid}/logs/info → Informational (dev mode)
|
||||
infoscreen/{uuid}/health → Health metrics (every 5s)
|
||||
infoscreen/{uuid}/heartbeat → Enhanced heartbeat (every 60s)
|
||||
```
|
||||
|
||||
### JSON Timestamp Format
|
||||
```python
|
||||
from datetime import datetime, timezone
|
||||
timestamp = datetime.now(timezone.utc).isoformat()
|
||||
# Output: "2026-03-10T07:30:00+00:00" or "2026-03-10T07:30:00Z"
|
||||
```
|
||||
|
||||
### Process Status Values
|
||||
```
|
||||
"running" - Process is alive and responding
|
||||
"crashed" - Process terminated unexpectedly
|
||||
"starting" - Process is launching (startup phase)
|
||||
"stopped" - Process intentionally stopped
|
||||
```
|
||||
|
||||
### Restart Logic
|
||||
```
|
||||
Max attempts: 3
|
||||
Cooldown: 2 seconds between attempts
|
||||
Reset: After 5 minutes of successful operation
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 19. Contact & Support
|
||||
|
||||
**Server API Documentation:**
|
||||
- Base URL: `http://192.168.43.201:8000`
|
||||
- Health check: `GET /health`
|
||||
- Test logs: `GET /api/client-logs/test` (no auth)
|
||||
- Full API docs: See `CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md` on server
|
||||
|
||||
**MQTT Broker:**
|
||||
- Host: `192.168.43.201`
|
||||
- Port: `1883` (standard), `9001` (WebSocket)
|
||||
- Test tool: `mosquitto_pub` / `mosquitto_sub`
|
||||
|
||||
**Database Schema:**
|
||||
- Table: `client_logs`
|
||||
- Foreign Key: `client_uuid` → `clients.uuid` (ON DELETE CASCADE)
|
||||
- Constraint: UUID must exist in clients table before logging
|
||||
|
||||
**Server-Side Logs:**
|
||||
```bash
|
||||
# View listener logs (processes MQTT messages)
|
||||
docker compose logs -f listener
|
||||
|
||||
# View server logs (API requests)
|
||||
docker compose logs -f server
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 20. Appendix: Example Implementations
|
||||
|
||||
### A. Minimal Python Watchdog (Pseudocode)
|
||||
|
||||
```python
|
||||
import time
|
||||
import json
|
||||
import psutil
|
||||
import paho.mqtt.client as mqtt
|
||||
from datetime import datetime, timezone
|
||||
|
||||
class MinimalWatchdog:
|
||||
def __init__(self, client_uuid, mqtt_broker):
|
||||
self.uuid = client_uuid
|
||||
self.mqtt_client = mqtt.Client(callback_api_version=mqtt.CallbackAPIVersion.VERSION2)
|
||||
self.mqtt_client.connect(mqtt_broker, 1883, 60)
|
||||
self.mqtt_client.loop_start()
|
||||
|
||||
self.expected_process = None
|
||||
self.restart_attempts = 0
|
||||
self.MAX_RESTARTS = 3
|
||||
|
||||
def send_log(self, level, message, context=None):
|
||||
topic = f"infoscreen/{self.uuid}/logs/{level}"
|
||||
payload = {
|
||||
"timestamp": datetime.now(timezone.utc).isoformat(),
|
||||
"message": message,
|
||||
"context": context or {}
|
||||
}
|
||||
self.mqtt_client.publish(topic, json.dumps(payload), qos=1)
|
||||
|
||||
def is_process_running(self, process_name):
|
||||
for proc in psutil.process_iter(['name']):
|
||||
if process_name in proc.info['name']:
|
||||
return True
|
||||
return False
|
||||
|
||||
def monitor_loop(self):
|
||||
while True:
|
||||
if self.expected_process:
|
||||
if not self.is_process_running(self.expected_process):
|
||||
self.send_log("error", f"{self.expected_process} crashed")
|
||||
if self.restart_attempts < self.MAX_RESTARTS:
|
||||
self.restart_process()
|
||||
else:
|
||||
self.send_log("error", "Max restarts exceeded")
|
||||
|
||||
time.sleep(5)
|
||||
|
||||
# Usage:
|
||||
watchdog = MinimalWatchdog("9b8d1856-ff34-4864-a726-12de072d0f77", "192.168.43.201")
|
||||
watchdog.expected_process = "vlc"
|
||||
watchdog.monitor_loop()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**END OF SPECIFICATION**
|
||||
|
||||
Questions? Refer to:
|
||||
- `CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md` (server repo)
|
||||
- Server API: `http://192.168.43.201:8000/api/client-logs/test`
|
||||
- MQTT test: `mosquitto_sub -h 192.168.43.201 -t infoscreen/#`
|
||||
Reference in New Issue
Block a user