add superadmin monitoring dashboard with protected route, menu entry, and monitoring data client add monitoring overview API endpoint and improve log serialization/aggregation for dashboard use extend listener health/log handling with robust status/event/timestamp normalization and screenshot payload extraction improve screenshot persistence and retrieval (timestamp-aware uploads, latest screenshot endpoint fallback) fix page_progress and auto_progress persistence/serialization across create, update, and detached occurrence flows align technical and project docs to reflect implemented monitoring and no-version-bump backend changes add documentation sync log entry and include minor compose env indentation cleanup
980 lines
30 KiB
Markdown
980 lines
30 KiB
Markdown
# Client-Side Monitoring Specification
|
|
|
|
**Version:** 1.0
|
|
**Date:** 2026-03-10
|
|
**For:** Infoscreen Client Implementation
|
|
**Server Endpoint:** `192.168.43.201:8000` (or your production server)
|
|
**MQTT Broker:** `192.168.43.201:1883` (or your production MQTT broker)
|
|
|
|
---
|
|
|
|
## 1. Overview
|
|
|
|
Each infoscreen client must implement health monitoring and logging capabilities to report status to the central server via MQTT.
|
|
|
|
### 1.1 Goals
|
|
- **Detect failures:** Process crashes, frozen screens, content mismatches
|
|
- **Provide visibility:** Real-time health status visible on server dashboard
|
|
- **Enable remote diagnosis:** Centralized log storage for debugging
|
|
- **Auto-recovery:** Attempt automatic restart on failure
|
|
|
|
### 1.2 Architecture
|
|
```
|
|
┌─────────────────────────────────────────┐
|
|
│ Infoscreen Client │
|
|
│ │
|
|
│ ┌──────────────┐ ┌──────────────┐ │
|
|
│ │ Media Player │ │ Watchdog │ │
|
|
│ │ (VLC/Chrome) │◄───│ Monitor │ │
|
|
│ └──────────────┘ └──────┬───────┘ │
|
|
│ │ │
|
|
│ ┌──────────────┐ │ │
|
|
│ │ Event Mgr │ │ │
|
|
│ │ (receives │ │ │
|
|
│ │ schedule) │◄───────────┘ │
|
|
│ └──────┬───────┘ │
|
|
│ │ │
|
|
│ ┌──────▼───────────────────────┐ │
|
|
│ │ MQTT Client │ │
|
|
│ │ - Heartbeat (every 60s) │ │
|
|
│ │ - Logs (error/warn/info) │ │
|
|
│ │ - Health metrics (every 5s) │ │
|
|
│ └──────┬────────────────────────┘ │
|
|
└─────────┼──────────────────────────────┘
|
|
│
|
|
│ MQTT over TCP
|
|
▼
|
|
┌─────────────┐
|
|
│ MQTT Broker │
|
|
│ (server) │
|
|
└─────────────┘
|
|
```
|
|
|
|
### 1.3 Current Compatibility Notes
|
|
- The server now accepts both the original specification payloads and the currently implemented Phase 3 client payloads.
|
|
- `infoscreen/{uuid}/health` may currently contain a reduced payload with only `expected_state.event_id` and `actual_state.process|pid|status`. Additional `health_metrics` fields from this specification remain recommended.
|
|
- `event_id` is still specified as an integer. For compatibility with the current Phase 3 client, the server also tolerates string values such as `event_123` and extracts the numeric suffix where possible.
|
|
- If the client sends `process_health` inside `infoscreen/{uuid}/dashboard`, the server treats it as a fallback source for `current_process`, `process_pid`, `process_status`, and `current_event_id`.
|
|
- Long term, the preferred client payload remains the structure in this specification so the server can surface richer monitoring data such as screen state and resource metrics.
|
|
|
|
---
|
|
|
|
## 2. MQTT Protocol Specification
|
|
|
|
### 2.1 Connection Parameters
|
|
```
|
|
Broker: 192.168.43.201 (or DNS hostname)
|
|
Port: 1883 (standard MQTT)
|
|
Protocol: MQTT v3.1.1
|
|
Client ID: "infoscreen-{client_uuid}"
|
|
Clean Session: false (retain subscriptions)
|
|
Keep Alive: 60 seconds
|
|
Username/Password: (if configured on broker)
|
|
```
|
|
|
|
### 2.2 QoS Levels
|
|
- **Heartbeat:** QoS 0 (fire and forget, high frequency)
|
|
- **Logs (ERROR/WARN):** QoS 1 (at least once delivery, important)
|
|
- **Logs (INFO):** QoS 0 (optional, high volume)
|
|
- **Health metrics:** QoS 0 (frequent, latest value matters)
|
|
|
|
---
|
|
|
|
## 3. Topic Structure & Payload Formats
|
|
|
|
### 3.1 Log Messages
|
|
|
|
#### Topic Pattern:
|
|
```
|
|
infoscreen/{client_uuid}/logs/{level}
|
|
```
|
|
|
|
Where `{level}` is one of: `error`, `warn`, `info`
|
|
|
|
#### Payload Format (JSON):
|
|
```json
|
|
{
|
|
"timestamp": "2026-03-10T07:30:00Z",
|
|
"message": "Human-readable error description",
|
|
"context": {
|
|
"event_id": 42,
|
|
"process": "vlc",
|
|
"error_code": "NETWORK_TIMEOUT",
|
|
"additional_key": "any relevant data"
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Field Specifications:
|
|
| Field | Type | Required | Description |
|
|
|-------|------|----------|-------------|
|
|
| `timestamp` | string (ISO 8601 UTC) | Yes | When the event occurred. Use `YYYY-MM-DDTHH:MM:SSZ` format |
|
|
| `message` | string | Yes | Human-readable description of the event (max 1000 chars) |
|
|
| `context` | object | No | Additional structured data (will be stored as JSON) |
|
|
|
|
#### Example Topics:
|
|
```
|
|
infoscreen/9b8d1856-ff34-4864-a726-12de072d0f77/logs/error
|
|
infoscreen/9b8d1856-ff34-4864-a726-12de072d0f77/logs/warn
|
|
infoscreen/9b8d1856-ff34-4864-a726-12de072d0f77/logs/info
|
|
```
|
|
|
|
#### When to Send Logs:
|
|
|
|
**ERROR (Always send):**
|
|
- Process crashed (VLC/Chromium/PDF viewer terminated unexpectedly)
|
|
- Content failed to load (404, network timeout, corrupt file)
|
|
- Hardware failure detected (display off, audio device missing)
|
|
- Exception caught in main event loop
|
|
- Maximum restart attempts exceeded
|
|
|
|
**WARN (Always send):**
|
|
- Process restarted automatically (after crash)
|
|
- High resource usage (CPU >80%, RAM >90%)
|
|
- Slow performance (frame drops, lag)
|
|
- Non-critical failures (screenshot capture failed, cache full)
|
|
- Fallback content displayed (primary source unavailable)
|
|
|
|
**INFO (Send in development, optional in production):**
|
|
- Process started successfully
|
|
- Event transition (switched from video to presentation)
|
|
- Content loaded successfully
|
|
- Watchdog service started/stopped
|
|
|
|
---
|
|
|
|
### 3.2 Health Metrics
|
|
|
|
#### Topic Pattern:
|
|
```
|
|
infoscreen/{client_uuid}/health
|
|
```
|
|
|
|
#### Payload Format (JSON):
|
|
```json
|
|
{
|
|
"timestamp": "2026-03-10T07:30:00Z",
|
|
"expected_state": {
|
|
"event_id": 42,
|
|
"event_type": "video",
|
|
"media_file": "presentation.mp4",
|
|
"started_at": "2026-03-10T07:15:00Z"
|
|
},
|
|
"actual_state": {
|
|
"process": "vlc",
|
|
"pid": 1234,
|
|
"status": "running",
|
|
"uptime_seconds": 900,
|
|
"position": 45.3,
|
|
"duration": 180.0
|
|
},
|
|
"health_metrics": {
|
|
"screen_on": true,
|
|
"last_frame_update": "2026-03-10T07:29:58Z",
|
|
"frames_dropped": 2,
|
|
"network_errors": 0,
|
|
"cpu_percent": 15.3,
|
|
"memory_mb": 234
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Field Specifications:
|
|
|
|
**expected_state:**
|
|
| Field | Type | Required | Description |
|
|
|-------|------|----------|-------------|
|
|
| `event_id` | integer | Yes | Current event ID from scheduler |
|
|
| `event_type` | string | Yes | `presentation`, `video`, `website`, `webuntis`, `message` |
|
|
| `media_file` | string | No | Filename or URL of current content |
|
|
| `started_at` | string (ISO 8601) | Yes | When this event started playing |
|
|
|
|
**actual_state:**
|
|
| Field | Type | Required | Description |
|
|
|-------|------|----------|-------------|
|
|
| `process` | string | Yes | `vlc`, `chromium`, `pdf_viewer`, `none` |
|
|
| `pid` | integer | No | Process ID (if running) |
|
|
| `status` | string | Yes | `running`, `crashed`, `starting`, `stopped` |
|
|
| `uptime_seconds` | integer | No | How long process has been running |
|
|
| `position` | float | No | Current playback position (seconds, for video/audio) |
|
|
| `duration` | float | No | Total content duration (seconds) |
|
|
|
|
**health_metrics:**
|
|
| Field | Type | Required | Description |
|
|
|-------|------|----------|-------------|
|
|
| `screen_on` | boolean | Yes | Is display powered on? |
|
|
| `last_frame_update` | string (ISO 8601) | No | Last time screen content changed |
|
|
| `frames_dropped` | integer | No | Video frames dropped (performance indicator) |
|
|
| `network_errors` | integer | No | Count of network errors in last interval |
|
|
| `cpu_percent` | float | No | CPU usage (0-100) |
|
|
| `memory_mb` | integer | No | RAM usage in megabytes |
|
|
|
|
#### Sending Frequency:
|
|
- **Normal operation:** Every 5 seconds
|
|
- **During startup/transition:** Every 1 second
|
|
- **After error:** Immediately + every 2 seconds until recovered
|
|
|
|
---
|
|
|
|
### 3.3 Enhanced Heartbeat
|
|
|
|
The existing heartbeat topic should be enhanced to include process status.
|
|
|
|
#### Topic Pattern:
|
|
```
|
|
infoscreen/{client_uuid}/heartbeat
|
|
```
|
|
|
|
#### Enhanced Payload Format (JSON):
|
|
```json
|
|
{
|
|
"uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
|
|
"timestamp": "2026-03-10T07:30:00Z",
|
|
"current_process": "vlc",
|
|
"process_pid": 1234,
|
|
"process_status": "running",
|
|
"current_event_id": 42
|
|
}
|
|
```
|
|
|
|
#### New Fields (add to existing heartbeat):
|
|
| Field | Type | Required | Description |
|
|
|-------|------|----------|-------------|
|
|
| `current_process` | string | No | Name of active media player process |
|
|
| `process_pid` | integer | No | Process ID |
|
|
| `process_status` | string | No | `running`, `crashed`, `starting`, `stopped` |
|
|
| `current_event_id` | integer | No | Event ID currently being displayed |
|
|
|
|
#### Sending Frequency:
|
|
- Keep existing: **Every 60 seconds**
|
|
- Include new fields if available
|
|
|
|
---
|
|
|
|
## 4. Process Monitoring Requirements
|
|
|
|
### 4.1 Processes to Monitor
|
|
|
|
| Media Type | Process Name | How to Detect |
|
|
|------------|--------------|---------------|
|
|
| Video | `vlc` | `ps aux \| grep vlc` or `pgrep vlc` |
|
|
| Website/WebUntis | `chromium` or `chromium-browser` | `pgrep chromium` |
|
|
| PDF Presentation | `evince`, `okular`, or custom viewer | `pgrep {viewer_name}` |
|
|
|
|
### 4.2 Monitoring Checks (Every 5 seconds)
|
|
|
|
#### Check 1: Process Alive
|
|
```
|
|
Goal: Verify expected process is running
|
|
Method:
|
|
- Get list of running processes (psutil or `ps`)
|
|
- Check if expected process name exists
|
|
- Match PID if known
|
|
Result:
|
|
- If missing → status = "crashed"
|
|
- If found → status = "running"
|
|
Action on crash:
|
|
- Send ERROR log immediately
|
|
- Attempt restart (max 3 attempts)
|
|
- Send WARN log on each restart
|
|
- If max restarts exceeded → send ERROR log, display fallback
|
|
```
|
|
|
|
#### Check 2: Process Responsive
|
|
```
|
|
Goal: Detect frozen processes
|
|
Method:
|
|
- For VLC: Query HTTP interface (status.json)
|
|
- For Chromium: Use DevTools Protocol (CDP)
|
|
- For custom viewers: Check last screen update time
|
|
Result:
|
|
- If same frame >30 seconds → likely frozen
|
|
- If playback position not advancing → frozen
|
|
Action on freeze:
|
|
- Send WARN log
|
|
- Force refresh (reload page, seek video, next slide)
|
|
- If refresh fails → restart process
|
|
```
|
|
|
|
#### Check 3: Content Match
|
|
```
|
|
Goal: Verify correct content is displayed
|
|
Method:
|
|
- Compare expected event_id with actual media/URL
|
|
- Check scheduled time window (is event still active?)
|
|
Result:
|
|
- Mismatch → content error
|
|
Action:
|
|
- Send WARN log
|
|
- Reload correct event from scheduler
|
|
```
|
|
|
|
---
|
|
|
|
## 5. Process Control Interface Requirements
|
|
|
|
### 5.1 VLC Control
|
|
|
|
**Requirement:** Enable VLC HTTP interface for monitoring
|
|
|
|
**Launch Command:**
|
|
```bash
|
|
vlc --intf http --http-host 127.0.0.1 --http-port 8080 --http-password "vlc_password" \
|
|
--fullscreen --loop /path/to/video.mp4
|
|
```
|
|
|
|
**Status Query:**
|
|
```bash
|
|
curl http://127.0.0.1:8080/requests/status.json --user ":vlc_password"
|
|
```
|
|
|
|
**Response Fields to Monitor:**
|
|
```json
|
|
{
|
|
"state": "playing", // "playing", "paused", "stopped"
|
|
"position": 0.25, // 0.0-1.0 (25% through)
|
|
"time": 45, // seconds into playback
|
|
"length": 180, // total duration in seconds
|
|
"volume": 256 // 0-512
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
### 5.2 Chromium Control
|
|
|
|
**Requirement:** Enable Chrome DevTools Protocol (CDP)
|
|
|
|
**Launch Command:**
|
|
```bash
|
|
chromium --remote-debugging-port=9222 --kiosk --app=https://example.com
|
|
```
|
|
|
|
**Status Query:**
|
|
```bash
|
|
curl http://127.0.0.1:9222/json
|
|
```
|
|
|
|
**Response Fields to Monitor:**
|
|
```json
|
|
[
|
|
{
|
|
"url": "https://example.com",
|
|
"title": "Page Title",
|
|
"type": "page"
|
|
}
|
|
]
|
|
```
|
|
|
|
**Advanced:** Use CDP WebSocket for events (page load, navigation, errors)
|
|
|
|
---
|
|
|
|
### 5.3 PDF Viewer (Custom or Standard)
|
|
|
|
**Option A: Standard Viewer (e.g., Evince)**
|
|
- No built-in API
|
|
- Monitor via process check + screenshot comparison
|
|
|
|
**Option B: Custom Python Viewer**
|
|
- Implement REST API for status queries
|
|
- Track: current page, total pages, last transition time
|
|
|
|
---
|
|
|
|
## 6. Watchdog Service Architecture
|
|
|
|
### 6.1 Service Components
|
|
|
|
**Component 1: Process Monitor Thread**
|
|
```
|
|
Responsibilities:
|
|
- Check process alive every 5 seconds
|
|
- Detect crashes and frozen processes
|
|
- Attempt automatic restart
|
|
- Send health metrics via MQTT
|
|
|
|
State Machine:
|
|
IDLE → STARTING → RUNNING → (if crash) → RESTARTING → RUNNING
|
|
→ (if max restarts) → FAILED
|
|
```
|
|
|
|
**Component 2: MQTT Publisher Thread**
|
|
```
|
|
Responsibilities:
|
|
- Maintain MQTT connection
|
|
- Send heartbeat every 60 seconds
|
|
- Send logs on-demand (queued from other components)
|
|
- Send health metrics every 5 seconds
|
|
- Reconnect on connection loss
|
|
```
|
|
|
|
**Component 3: Event Manager Integration**
|
|
```
|
|
Responsibilities:
|
|
- Receive event schedule from server
|
|
- Notify watchdog of expected process/content
|
|
- Launch media player processes
|
|
- Handle event transitions
|
|
```
|
|
|
|
### 6.2 Service Lifecycle
|
|
|
|
**On Startup:**
|
|
1. Load configuration (client UUID, MQTT broker, etc.)
|
|
2. Connect to MQTT broker
|
|
3. Send INFO log: "Watchdog service started"
|
|
4. Wait for first event from scheduler
|
|
|
|
**During Operation:**
|
|
1. Monitor loop runs every 5 seconds
|
|
2. Check expected vs actual process state
|
|
3. Send health metrics
|
|
4. Handle failures (log + restart)
|
|
|
|
**On Shutdown:**
|
|
1. Send INFO log: "Watchdog service stopping"
|
|
2. Gracefully stop monitored processes
|
|
3. Disconnect from MQTT
|
|
4. Exit cleanly
|
|
|
|
---
|
|
|
|
## 7. Auto-Recovery Logic
|
|
|
|
### 7.1 Restart Strategy
|
|
|
|
**Step 1: Detect Failure**
|
|
```
|
|
Trigger: Process not found in process list
|
|
Action:
|
|
- Log ERROR: "Process {name} crashed"
|
|
- Increment restart counter
|
|
- Check if within retry limit (max 3)
|
|
```
|
|
|
|
**Step 2: Attempt Restart**
|
|
```
|
|
If restart_attempts < MAX_RESTARTS:
|
|
- Log WARN: "Attempting restart ({attempt}/{MAX_RESTARTS})"
|
|
- Kill any zombie processes
|
|
- Wait 2 seconds (cooldown)
|
|
- Launch process with same parameters
|
|
- Wait 5 seconds for startup
|
|
- Verify process is running
|
|
- If success: reset restart counter, log INFO
|
|
- If fail: increment counter, repeat
|
|
```
|
|
|
|
**Step 3: Permanent Failure**
|
|
```
|
|
If restart_attempts >= MAX_RESTARTS:
|
|
- Log ERROR: "Max restart attempts exceeded, failing over"
|
|
- Display fallback content (static image with error message)
|
|
- Send notification to server (separate alert topic, optional)
|
|
- Wait for manual intervention or scheduler event change
|
|
```
|
|
|
|
### 7.2 Restart Cooldown
|
|
|
|
**Purpose:** Prevent rapid restart loops that waste resources
|
|
|
|
**Implementation:**
|
|
```
|
|
After each restart attempt:
|
|
- Wait 2 seconds before next restart
|
|
- After 3 failures: wait 30 seconds before trying again
|
|
- Reset counter on successful run >5 minutes
|
|
```
|
|
|
|
---
|
|
|
|
## 8. Resource Monitoring
|
|
|
|
### 8.1 System Metrics to Track
|
|
|
|
**CPU Usage:**
|
|
```
|
|
Method: Read /proc/stat or use psutil.cpu_percent()
|
|
Frequency: Every 5 seconds
|
|
Threshold: Warn if >80% for >60 seconds
|
|
```
|
|
|
|
**Memory Usage:**
|
|
```
|
|
Method: Read /proc/meminfo or use psutil.virtual_memory()
|
|
Frequency: Every 5 seconds
|
|
Threshold: Warn if >90% for >30 seconds
|
|
```
|
|
|
|
**Display Status:**
|
|
```
|
|
Method: Check DPMS state or xset query
|
|
Frequency: Every 30 seconds
|
|
Threshold: Error if display off (unexpected)
|
|
```
|
|
|
|
**Network Connectivity:**
|
|
```
|
|
Method: Ping server or check MQTT connection
|
|
Frequency: Every 60 seconds
|
|
Threshold: Warn if no server connectivity
|
|
```
|
|
|
|
---
|
|
|
|
## 9. Development vs Production Mode
|
|
|
|
### 9.1 Development Mode
|
|
|
|
**Enable via:** Environment variable `DEBUG=true` or `ENV=development`
|
|
|
|
**Behavior:**
|
|
- Send INFO level logs
|
|
- More verbose logging to console
|
|
- Shorter monitoring intervals (faster feedback)
|
|
- Screenshot capture every 30 seconds
|
|
- No rate limiting on logs
|
|
|
|
### 9.2 Production Mode
|
|
|
|
**Enable via:** `ENV=production`
|
|
|
|
**Behavior:**
|
|
- Send only ERROR and WARN logs
|
|
- Minimal console output
|
|
- Standard monitoring intervals
|
|
- Screenshot capture every 60 seconds
|
|
- Rate limiting: max 10 logs per minute per level
|
|
|
|
---
|
|
|
|
## 10. Configuration File Format
|
|
|
|
### 10.1 Recommended Config: JSON
|
|
|
|
**File:** `/etc/infoscreen/config.json` or `~/.config/infoscreen/config.json`
|
|
|
|
```json
|
|
{
|
|
"client": {
|
|
"uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
|
|
"hostname": "infoscreen-room-101"
|
|
},
|
|
"mqtt": {
|
|
"broker": "192.168.43.201",
|
|
"port": 1883,
|
|
"username": "",
|
|
"password": "",
|
|
"keepalive": 60
|
|
},
|
|
"monitoring": {
|
|
"enabled": true,
|
|
"health_interval_seconds": 5,
|
|
"heartbeat_interval_seconds": 60,
|
|
"max_restart_attempts": 3,
|
|
"restart_cooldown_seconds": 2
|
|
},
|
|
"logging": {
|
|
"level": "INFO",
|
|
"send_info_logs": false,
|
|
"console_output": true,
|
|
"local_log_file": "/var/log/infoscreen/watchdog.log"
|
|
},
|
|
"processes": {
|
|
"vlc": {
|
|
"http_port": 8080,
|
|
"http_password": "vlc_password"
|
|
},
|
|
"chromium": {
|
|
"debug_port": 9222
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 11. Error Scenarios & Expected Behavior
|
|
|
|
### Scenario 1: VLC Crashes Mid-Video
|
|
```
|
|
1. Watchdog detects: process_status = "crashed"
|
|
2. Send ERROR log: "VLC process crashed"
|
|
3. Attempt 1: Restart VLC with same video, seek to last position
|
|
4. If success: Send INFO log "VLC restarted successfully"
|
|
5. If fail: Repeat 2 more times
|
|
6. After 3 failures: Send ERROR "Max restarts exceeded", show fallback
|
|
```
|
|
|
|
### Scenario 2: Network Timeout Loading Website
|
|
```
|
|
1. Chromium fails to load page (CDP reports error)
|
|
2. Send WARN log: "Page load timeout"
|
|
3. Attempt reload (Chromium refresh)
|
|
4. If success after 10s: Continue monitoring
|
|
5. If timeout again: Send ERROR, try restarting Chromium
|
|
```
|
|
|
|
### Scenario 3: Display Powers Off (Hardware)
|
|
```
|
|
1. DPMS check detects display off
|
|
2. Send ERROR log: "Display powered off"
|
|
3. Attempt to wake display (xset dpms force on)
|
|
4. If success: Send INFO log
|
|
5. If fail: Hardware issue, alert admin
|
|
```
|
|
|
|
### Scenario 4: High CPU Usage
|
|
```
|
|
1. CPU >80% for 60 seconds
|
|
2. Send WARN log: "High CPU usage: 85%"
|
|
3. Check if expected (e.g., video playback is normal)
|
|
4. If unexpected: investigate process causing it
|
|
5. If critical (>95%): consider restarting offending process
|
|
```
|
|
|
|
---
|
|
|
|
## 12. Testing & Validation
|
|
|
|
### 12.1 Manual Tests (During Development)
|
|
|
|
**Test 1: Process Crash Simulation**
|
|
```bash
|
|
# Start video, then kill VLC manually
|
|
killall vlc
|
|
# Expected: ERROR log sent, automatic restart within 5 seconds
|
|
```
|
|
|
|
**Test 2: MQTT Connectivity**
|
|
```bash
|
|
# Subscribe to all client topics on server
|
|
mosquitto_sub -h 192.168.43.201 -t "infoscreen/{uuid}/#" -v
|
|
# Expected: See heartbeat every 60s, health every 5s
|
|
```
|
|
|
|
**Test 3: Log Levels**
|
|
```bash
|
|
# Trigger error condition and verify log appears in database
|
|
curl http://192.168.43.201:8000/api/client-logs/test
|
|
# Expected: See new log entry with correct level/message
|
|
```
|
|
|
|
### 12.2 Acceptance Criteria
|
|
|
|
✅ **Client must:**
|
|
1. Send heartbeat every 60 seconds without gaps
|
|
2. Send ERROR log within 5 seconds of process crash
|
|
3. Attempt automatic restart (max 3 times)
|
|
4. Report health metrics every 5 seconds
|
|
5. Survive MQTT broker restart (reconnect automatically)
|
|
6. Survive network interruption (buffer logs, send when reconnected)
|
|
7. Use correct timestamp format (ISO 8601 UTC)
|
|
8. Only send logs for real client UUID (FK constraint)
|
|
|
|
---
|
|
|
|
## 13. Python Libraries (Recommended)
|
|
|
|
**For process monitoring:**
|
|
- `psutil` - Cross-platform process and system utilities
|
|
|
|
**For MQTT:**
|
|
- `paho-mqtt` - Official MQTT client (use v2.x with Callback API v2)
|
|
|
|
**For VLC control:**
|
|
- `requests` - HTTP client for status queries
|
|
|
|
**For Chromium control:**
|
|
- `websocket-client` or `pychrome` - Chrome DevTools Protocol
|
|
|
|
**For datetime:**
|
|
- `datetime` (stdlib) - Use `datetime.now(timezone.utc).isoformat()`
|
|
|
|
**Example requirements.txt:**
|
|
```
|
|
paho-mqtt>=2.0.0
|
|
psutil>=5.9.0
|
|
requests>=2.31.0
|
|
python-dateutil>=2.8.0
|
|
```
|
|
|
|
---
|
|
|
|
## 14. Security Considerations
|
|
|
|
### 14.1 MQTT Security
|
|
- If broker requires auth, store credentials in config file with restricted permissions (`chmod 600`)
|
|
- Consider TLS/SSL for MQTT (port 8883) if on untrusted network
|
|
- Use unique client ID to prevent impersonation
|
|
|
|
### 14.2 Process Control APIs
|
|
- VLC HTTP password should be random, not default
|
|
- Chromium debug port should bind to `127.0.0.1` only (not `0.0.0.0`)
|
|
- Restrict file system access for media player processes
|
|
|
|
### 14.3 Log Content
|
|
- **Do not log:** Passwords, API keys, personal data
|
|
- **Sanitize:** File paths (strip user directories), URLs (remove query params with tokens)
|
|
|
|
---
|
|
|
|
## 15. Performance Targets
|
|
|
|
| Metric | Target | Acceptable | Critical |
|
|
|--------|--------|------------|----------|
|
|
| Health check interval | 5s | 10s | 30s |
|
|
| Crash detection time | <5s | <10s | <30s |
|
|
| Restart time | <10s | <20s | <60s |
|
|
| MQTT publish latency | <100ms | <500ms | <2s |
|
|
| CPU usage (watchdog) | <2% | <5% | <10% |
|
|
| RAM usage (watchdog) | <50MB | <100MB | <200MB |
|
|
| Log message size | <1KB | <10KB | <100KB |
|
|
|
|
---
|
|
|
|
## 16. Troubleshooting Guide (For Client Development)
|
|
|
|
### Issue: Logs not appearing in server database
|
|
**Check:**
|
|
1. Is MQTT broker reachable? (`mosquitto_pub` test from client)
|
|
2. Is client UUID correct and exists in `clients` table?
|
|
3. Is timestamp format correct (ISO 8601 with 'Z')?
|
|
4. Check server listener logs for errors
|
|
|
|
### Issue: Health metrics not updating
|
|
**Check:**
|
|
1. Is health loop running? (check watchdog service status)
|
|
2. Is MQTT connected? (check connection status in logs)
|
|
3. Is payload JSON valid? (use JSON validator)
|
|
|
|
### Issue: Process restarts in loop
|
|
**Check:**
|
|
1. Is media file/URL accessible?
|
|
2. Is process command correct? (test manually)
|
|
3. Check process exit code (crash reason)
|
|
4. Increase restart cooldown to avoid rapid loops
|
|
|
|
---
|
|
|
|
## 17. Complete Message Flow Diagram
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────┐
|
|
│ Infoscreen Client │
|
|
│ │
|
|
│ Event Occurs: │
|
|
│ - Process crashed │
|
|
│ - High CPU usage │
|
|
│ - Content loaded │
|
|
│ │
|
|
│ ┌────────────────┐ │
|
|
│ │ Decision Logic │ │
|
|
│ │ - Is it ERROR?│ │
|
|
│ │ - Is it WARN? │ │
|
|
│ │ - Is it INFO? │ │
|
|
│ └────────┬───────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌────────────────────────────────┐ │
|
|
│ │ Build JSON Payload │ │
|
|
│ │ { │ │
|
|
│ │ "timestamp": "...", │ │
|
|
│ │ "message": "...", │ │
|
|
│ │ "context": {...} │ │
|
|
│ │ } │ │
|
|
│ └────────┬───────────────────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌────────────────────────────────┐ │
|
|
│ │ MQTT Publish │ │
|
|
│ │ Topic: infoscreen/{uuid}/logs/error │
|
|
│ │ QoS: 1 │ │
|
|
│ └────────┬───────────────────────┘ │
|
|
└───────────┼──────────────────────────────────────────┘
|
|
│
|
|
│ TCP/IP (MQTT Protocol)
|
|
│
|
|
▼
|
|
┌──────────────┐
|
|
│ MQTT Broker │
|
|
│ (Mosquitto) │
|
|
└──────┬───────┘
|
|
│
|
|
│ Topic: infoscreen/+/logs/#
|
|
│
|
|
▼
|
|
┌──────────────────────────────┐
|
|
│ Listener Service │
|
|
│ (Python) │
|
|
│ │
|
|
│ - Parse JSON │
|
|
│ - Validate UUID │
|
|
│ - Store in database │
|
|
└──────┬───────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────────────┐
|
|
│ MariaDB Database │
|
|
│ │
|
|
│ Table: client_logs │
|
|
│ - client_uuid │
|
|
│ - timestamp │
|
|
│ - level │
|
|
│ - message │
|
|
│ - context (JSON) │
|
|
└──────┬───────────────────────┘
|
|
│
|
|
│ SQL Query
|
|
│
|
|
▼
|
|
┌──────────────────────────────┐
|
|
│ API Server (Flask) │
|
|
│ │
|
|
│ GET /api/client-logs/{uuid}/logs
|
|
│ GET /api/client-logs/summary
|
|
└──────┬───────────────────────┘
|
|
│
|
|
│ HTTP/JSON
|
|
│
|
|
▼
|
|
┌──────────────────────────────┐
|
|
│ Dashboard (React) │
|
|
│ │
|
|
│ - Display logs │
|
|
│ - Filter by level │
|
|
│ - Show health status │
|
|
└───────────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## 18. Quick Reference Card
|
|
|
|
### MQTT Topics Summary
|
|
```
|
|
infoscreen/{uuid}/logs/error → Critical failures
|
|
infoscreen/{uuid}/logs/warn → Non-critical issues
|
|
infoscreen/{uuid}/logs/info → Informational (dev mode)
|
|
infoscreen/{uuid}/health → Health metrics (every 5s)
|
|
infoscreen/{uuid}/heartbeat → Enhanced heartbeat (every 60s)
|
|
```
|
|
|
|
### JSON Timestamp Format
|
|
```python
|
|
from datetime import datetime, timezone
|
|
timestamp = datetime.now(timezone.utc).isoformat()
|
|
# Output: "2026-03-10T07:30:00+00:00" or "2026-03-10T07:30:00Z"
|
|
```
|
|
|
|
### Process Status Values
|
|
```
|
|
"running" - Process is alive and responding
|
|
"crashed" - Process terminated unexpectedly
|
|
"starting" - Process is launching (startup phase)
|
|
"stopped" - Process intentionally stopped
|
|
```
|
|
|
|
### Restart Logic
|
|
```
|
|
Max attempts: 3
|
|
Cooldown: 2 seconds between attempts
|
|
Reset: After 5 minutes of successful operation
|
|
```
|
|
|
|
---
|
|
|
|
## 19. Contact & Support
|
|
|
|
**Server API Documentation:**
|
|
- Base URL: `http://192.168.43.201:8000`
|
|
- Health check: `GET /health`
|
|
- Test logs: `GET /api/client-logs/test` (no auth)
|
|
- Full API docs: See `CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md` on server
|
|
|
|
**MQTT Broker:**
|
|
- Host: `192.168.43.201`
|
|
- Port: `1883` (standard), `9001` (WebSocket)
|
|
- Test tool: `mosquitto_pub` / `mosquitto_sub`
|
|
|
|
**Database Schema:**
|
|
- Table: `client_logs`
|
|
- Foreign Key: `client_uuid` → `clients.uuid` (ON DELETE CASCADE)
|
|
- Constraint: UUID must exist in clients table before logging
|
|
|
|
**Server-Side Logs:**
|
|
```bash
|
|
# View listener logs (processes MQTT messages)
|
|
docker compose logs -f listener
|
|
|
|
# View server logs (API requests)
|
|
docker compose logs -f server
|
|
```
|
|
|
|
---
|
|
|
|
## 20. Appendix: Example Implementations
|
|
|
|
### A. Minimal Python Watchdog (Pseudocode)
|
|
|
|
```python
|
|
import time
|
|
import json
|
|
import psutil
|
|
import paho.mqtt.client as mqtt
|
|
from datetime import datetime, timezone
|
|
|
|
class MinimalWatchdog:
|
|
def __init__(self, client_uuid, mqtt_broker):
|
|
self.uuid = client_uuid
|
|
self.mqtt_client = mqtt.Client(callback_api_version=mqtt.CallbackAPIVersion.VERSION2)
|
|
self.mqtt_client.connect(mqtt_broker, 1883, 60)
|
|
self.mqtt_client.loop_start()
|
|
|
|
self.expected_process = None
|
|
self.restart_attempts = 0
|
|
self.MAX_RESTARTS = 3
|
|
|
|
def send_log(self, level, message, context=None):
|
|
topic = f"infoscreen/{self.uuid}/logs/{level}"
|
|
payload = {
|
|
"timestamp": datetime.now(timezone.utc).isoformat(),
|
|
"message": message,
|
|
"context": context or {}
|
|
}
|
|
self.mqtt_client.publish(topic, json.dumps(payload), qos=1)
|
|
|
|
def is_process_running(self, process_name):
|
|
for proc in psutil.process_iter(['name']):
|
|
if process_name in proc.info['name']:
|
|
return True
|
|
return False
|
|
|
|
def monitor_loop(self):
|
|
while True:
|
|
if self.expected_process:
|
|
if not self.is_process_running(self.expected_process):
|
|
self.send_log("error", f"{self.expected_process} crashed")
|
|
if self.restart_attempts < self.MAX_RESTARTS:
|
|
self.restart_process()
|
|
else:
|
|
self.send_log("error", "Max restarts exceeded")
|
|
|
|
time.sleep(5)
|
|
|
|
# Usage:
|
|
watchdog = MinimalWatchdog("9b8d1856-ff34-4864-a726-12de072d0f77", "192.168.43.201")
|
|
watchdog.expected_process = "vlc"
|
|
watchdog.monitor_loop()
|
|
```
|
|
|
|
---
|
|
|
|
**END OF SPECIFICATION**
|
|
|
|
Questions? Refer to:
|
|
- `CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md` (server repo)
|
|
- Server API: `http://192.168.43.201:8000/api/client-logs/test`
|
|
- MQTT test: `mosquitto_sub -h 192.168.43.201 -t infoscreen/#`
|