Files

olafn 3107d0f671 feat(monitoring): add server-side client logging and health infrastructure

- add Alembic migration c1d2e3f4g5h6 for client monitoring:
  - create client_logs table with FK to clients.uuid and performance indexes
  - extend clients with process/health tracking fields
- extend data model with ClientLog, LogLevel, ProcessStatus, and ScreenHealthStatus
- enhance listener MQTT handling:
  - subscribe to logs and health topics
  - persist client logs from infoscreen/{uuid}/logs/{level}
  - process health payloads and enrich heartbeat-derived client state
- add monitoring API blueprint server/routes/client_logs.py:
  - GET /api/client-logs/<uuid>/logs
  - GET /api/client-logs/summary
  - GET /api/client-logs/recent-errors
  - GET /api/client-logs/test
- register client_logs blueprint in server/wsgi.py
- align compose/dev runtime for listener live-code execution
- add client-side implementation docs:
  - CLIENT_MONITORING_SPECIFICATION.md
  - CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md
- update TECH-CHANGELOG.md and copilot-instructions.md:
  - document monitoring changes
  - codify post-release technical-notes/no-version-bump convention

2026-03-10 07:33:38 +00:00

29 KiB

Raw Blame History

Client-Side Monitoring Specification

Version: 1.0
Date: 2026-03-10
For: Infoscreen Client Implementation
Server Endpoint: 192.168.43.201:8000 (or your production server)
MQTT Broker: 192.168.43.201:1883 (or your production MQTT broker)

1. Overview

Each infoscreen client must implement health monitoring and logging capabilities to report status to the central server via MQTT.

1.1 Goals

Detect failures: Process crashes, frozen screens, content mismatches
Provide visibility: Real-time health status visible on server dashboard
Enable remote diagnosis: Centralized log storage for debugging
Auto-recovery: Attempt automatic restart on failure

1.2 Architecture

┌─────────────────────────────────────────┐
│         Infoscreen Client               │
│                                         │
│  ┌──────────────┐    ┌──────────────┐  │
│  │ Media Player │    │   Watchdog   │  │
│  │ (VLC/Chrome) │◄───│   Monitor    │  │
│  └──────────────┘    └──────┬───────┘  │
│                              │          │
│  ┌──────────────┐            │          │
│  │  Event Mgr   │            │          │
│  │  (receives   │            │          │
│  │   schedule)  │◄───────────┘          │
│  └──────┬───────┘                       │
│         │                               │
│  ┌──────▼───────────────────────┐      │
│  │     MQTT Client               │      │
│  │  - Heartbeat (every 60s)      │      │
│  │  - Logs (error/warn/info)     │      │
│  │  - Health metrics (every 5s)  │      │
│  └──────┬────────────────────────┘      │
└─────────┼──────────────────────────────┘
          │
          │ MQTT over TCP
          ▼
    ┌─────────────┐
    │ MQTT Broker │
    │  (server)   │
    └─────────────┘

2. MQTT Protocol Specification

2.1 Connection Parameters

Broker: 192.168.43.201 (or DNS hostname)
Port: 1883 (standard MQTT)
Protocol: MQTT v3.1.1
Client ID: "infoscreen-{client_uuid}"
Clean Session: false (retain subscriptions)
Keep Alive: 60 seconds
Username/Password: (if configured on broker)

2.2 QoS Levels

Heartbeat: QoS 0 (fire and forget, high frequency)
Logs (ERROR/WARN): QoS 1 (at least once delivery, important)
Logs (INFO): QoS 0 (optional, high volume)
Health metrics: QoS 0 (frequent, latest value matters)

3. Topic Structure & Payload Formats

3.1 Log Messages

Topic Pattern:

infoscreen/{client_uuid}/logs/{level}

Where {level} is one of: error, warn, info

Payload Format (JSON):

{
  "timestamp": "2026-03-10T07:30:00Z",
  "message": "Human-readable error description",
  "context": {
    "event_id": 42,
    "process": "vlc",
    "error_code": "NETWORK_TIMEOUT",
    "additional_key": "any relevant data"
  }
}

Field Specifications:

Field	Type	Required	Description
`timestamp`	string (ISO 8601 UTC)	Yes	When the event occurred. Use `YYYY-MM-DDTHH:MM:SSZ` format
`message`	string	Yes	Human-readable description of the event (max 1000 chars)
`context`	object	No	Additional structured data (will be stored as JSON)

Example Topics:

infoscreen/9b8d1856-ff34-4864-a726-12de072d0f77/logs/error
infoscreen/9b8d1856-ff34-4864-a726-12de072d0f77/logs/warn
infoscreen/9b8d1856-ff34-4864-a726-12de072d0f77/logs/info

When to Send Logs:

ERROR (Always send):

Process crashed (VLC/Chromium/PDF viewer terminated unexpectedly)
Content failed to load (404, network timeout, corrupt file)
Hardware failure detected (display off, audio device missing)
Exception caught in main event loop
Maximum restart attempts exceeded

WARN (Always send):

Process restarted automatically (after crash)
High resource usage (CPU >80%, RAM >90%)
Slow performance (frame drops, lag)
Non-critical failures (screenshot capture failed, cache full)
Fallback content displayed (primary source unavailable)

INFO (Send in development, optional in production):

Process started successfully
Event transition (switched from video to presentation)
Content loaded successfully
Watchdog service started/stopped

3.2 Health Metrics

Topic Pattern:

infoscreen/{client_uuid}/health

Payload Format (JSON):

{
  "timestamp": "2026-03-10T07:30:00Z",
  "expected_state": {
    "event_id": 42,
    "event_type": "video",
    "media_file": "presentation.mp4",
    "started_at": "2026-03-10T07:15:00Z"
  },
  "actual_state": {
    "process": "vlc",
    "pid": 1234,
    "status": "running",
    "uptime_seconds": 900,
    "position": 45.3,
    "duration": 180.0
  },
  "health_metrics": {
    "screen_on": true,
    "last_frame_update": "2026-03-10T07:29:58Z",
    "frames_dropped": 2,
    "network_errors": 0,
    "cpu_percent": 15.3,
    "memory_mb": 234
  }
}

Field Specifications:

expected_state:

Field	Type	Required	Description
`event_id`	integer	Yes	Current event ID from scheduler
`event_type`	string	Yes	`presentation`, `video`, `website`, `webuntis`, `message`
`media_file`	string	No	Filename or URL of current content
`started_at`	string (ISO 8601)	Yes	When this event started playing

actual_state:

Field	Type	Required	Description
`process`	string	Yes	`vlc`, `chromium`, `pdf_viewer`, `none`
`pid`	integer	No	Process ID (if running)
`status`	string	Yes	`running`, `crashed`, `starting`, `stopped`
`uptime_seconds`	integer	No	How long process has been running
`position`	float	No	Current playback position (seconds, for video/audio)
`duration`	float	No	Total content duration (seconds)

health_metrics:

Field	Type	Required	Description
`screen_on`	boolean	Yes	Is display powered on?
`last_frame_update`	string (ISO 8601)	No	Last time screen content changed
`frames_dropped`	integer	No	Video frames dropped (performance indicator)
`network_errors`	integer	No	Count of network errors in last interval
`cpu_percent`	float	No	CPU usage (0-100)
`memory_mb`	integer	No	RAM usage in megabytes

Sending Frequency:

Normal operation: Every 5 seconds
During startup/transition: Every 1 second
After error: Immediately + every 2 seconds until recovered

3.3 Enhanced Heartbeat

The existing heartbeat topic should be enhanced to include process status.

Topic Pattern:

infoscreen/{client_uuid}/heartbeat

Enhanced Payload Format (JSON):

{
  "uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
  "timestamp": "2026-03-10T07:30:00Z",
  "current_process": "vlc",
  "process_pid": 1234,
  "process_status": "running",
  "current_event_id": 42
}

New Fields (add to existing heartbeat):

Field	Type	Required	Description
`current_process`	string	No	Name of active media player process
`process_pid`	integer	No	Process ID
`process_status`	string	No	`running`, `crashed`, `starting`, `stopped`
`current_event_id`	integer	No	Event ID currently being displayed

Sending Frequency:

Keep existing: Every 60 seconds
Include new fields if available

4. Process Monitoring Requirements

4.1 Processes to Monitor

Media Type	Process Name	How to Detect
Video	`vlc`	`ps aux \| grep vlc` or `pgrep vlc`
Website/WebUntis	`chromium` or `chromium-browser`	`pgrep chromium`
PDF Presentation	`evince`, `okular`, or custom viewer	`pgrep {viewer_name}`

4.2 Monitoring Checks (Every 5 seconds)

Check 1: Process Alive

Goal: Verify expected process is running
Method: 
  - Get list of running processes (psutil or `ps`)
  - Check if expected process name exists
  - Match PID if known
Result:
  - If missing → status = "crashed"
  - If found → status = "running"
Action on crash:
  - Send ERROR log immediately
  - Attempt restart (max 3 attempts)
  - Send WARN log on each restart
  - If max restarts exceeded → send ERROR log, display fallback

Check 2: Process Responsive

Goal: Detect frozen processes
Method:
  - For VLC: Query HTTP interface (status.json)
  - For Chromium: Use DevTools Protocol (CDP)
  - For custom viewers: Check last screen update time
Result:
  - If same frame >30 seconds → likely frozen
  - If playback position not advancing → frozen
Action on freeze:
  - Send WARN log
  - Force refresh (reload page, seek video, next slide)
  - If refresh fails → restart process

Check 3: Content Match

Goal: Verify correct content is displayed
Method:
  - Compare expected event_id with actual media/URL
  - Check scheduled time window (is event still active?)
Result:
  - Mismatch → content error
Action:
  - Send WARN log
  - Reload correct event from scheduler

5. Process Control Interface Requirements

5.1 VLC Control

Requirement: Enable VLC HTTP interface for monitoring

Launch Command:

vlc --intf http --http-host 127.0.0.1 --http-port 8080 --http-password "vlc_password" \
    --fullscreen --loop /path/to/video.mp4

Status Query:

curl http://127.0.0.1:8080/requests/status.json --user ":vlc_password"

Response Fields to Monitor:

{
  "state": "playing",     // "playing", "paused", "stopped"
  "position": 0.25,       // 0.0-1.0 (25% through)
  "time": 45,             // seconds into playback
  "length": 180,          // total duration in seconds
  "volume": 256           // 0-512
}

5.2 Chromium Control

Requirement: Enable Chrome DevTools Protocol (CDP)

Launch Command:

chromium --remote-debugging-port=9222 --kiosk --app=https://example.com

Status Query:

curl http://127.0.0.1:9222/json

Response Fields to Monitor:

[
  {
    "url": "https://example.com",
    "title": "Page Title",
    "type": "page"
  }
]

Advanced: Use CDP WebSocket for events (page load, navigation, errors)

5.3 PDF Viewer (Custom or Standard)

Option A: Standard Viewer (e.g., Evince)

No built-in API
Monitor via process check + screenshot comparison

Option B: Custom Python Viewer

Implement REST API for status queries
Track: current page, total pages, last transition time

6. Watchdog Service Architecture

6.1 Service Components

Component 1: Process Monitor Thread

Responsibilities:
  - Check process alive every 5 seconds
  - Detect crashes and frozen processes
  - Attempt automatic restart
  - Send health metrics via MQTT

State Machine:
  IDLE → STARTING → RUNNING → (if crash) → RESTARTING → RUNNING
                             → (if max restarts) → FAILED

Component 2: MQTT Publisher Thread

Responsibilities:
  - Maintain MQTT connection
  - Send heartbeat every 60 seconds
  - Send logs on-demand (queued from other components)
  - Send health metrics every 5 seconds
  - Reconnect on connection loss

Component 3: Event Manager Integration

Responsibilities:
  - Receive event schedule from server
  - Notify watchdog of expected process/content
  - Launch media player processes
  - Handle event transitions

6.2 Service Lifecycle

On Startup:

Load configuration (client UUID, MQTT broker, etc.)
Connect to MQTT broker
Send INFO log: "Watchdog service started"
Wait for first event from scheduler

During Operation:

Monitor loop runs every 5 seconds
Check expected vs actual process state
Send health metrics
Handle failures (log + restart)

On Shutdown:

Send INFO log: "Watchdog service stopping"
Gracefully stop monitored processes
Disconnect from MQTT
Exit cleanly

7. Auto-Recovery Logic

7.1 Restart Strategy

Step 1: Detect Failure

Trigger: Process not found in process list
Action:
  - Log ERROR: "Process {name} crashed"
  - Increment restart counter
  - Check if within retry limit (max 3)

Step 2: Attempt Restart

If restart_attempts < MAX_RESTARTS:
  - Log WARN: "Attempting restart ({attempt}/{MAX_RESTARTS})"
  - Kill any zombie processes
  - Wait 2 seconds (cooldown)
  - Launch process with same parameters
  - Wait 5 seconds for startup
  - Verify process is running
  - If success: reset restart counter, log INFO
  - If fail: increment counter, repeat

Step 3: Permanent Failure

If restart_attempts >= MAX_RESTARTS:
  - Log ERROR: "Max restart attempts exceeded, failing over"
  - Display fallback content (static image with error message)
  - Send notification to server (separate alert topic, optional)
  - Wait for manual intervention or scheduler event change

7.2 Restart Cooldown

Purpose: Prevent rapid restart loops that waste resources

Implementation:

After each restart attempt:
  - Wait 2 seconds before next restart
  - After 3 failures: wait 30 seconds before trying again
  - Reset counter on successful run >5 minutes

8. Resource Monitoring

8.1 System Metrics to Track

CPU Usage:

Method: Read /proc/stat or use psutil.cpu_percent()
Frequency: Every 5 seconds
Threshold: Warn if >80% for >60 seconds

Memory Usage:

Method: Read /proc/meminfo or use psutil.virtual_memory()
Frequency: Every 5 seconds
Threshold: Warn if >90% for >30 seconds

Display Status:

Method: Check DPMS state or xset query
Frequency: Every 30 seconds
Threshold: Error if display off (unexpected)

Network Connectivity:

Method: Ping server or check MQTT connection
Frequency: Every 60 seconds
Threshold: Warn if no server connectivity

9. Development vs Production Mode

9.1 Development Mode

Enable via: Environment variable DEBUG=true or ENV=development

Behavior:

Send INFO level logs
More verbose logging to console
Shorter monitoring intervals (faster feedback)
Screenshot capture every 30 seconds
No rate limiting on logs

9.2 Production Mode

Enable via: ENV=production

Behavior:

Send only ERROR and WARN logs
Minimal console output
Standard monitoring intervals
Screenshot capture every 60 seconds
Rate limiting: max 10 logs per minute per level

10. Configuration File Format

10.1 Recommended Config: JSON

File: /etc/infoscreen/config.json or ~/.config/infoscreen/config.json

{
  "client": {
    "uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
    "hostname": "infoscreen-room-101"
  },
  "mqtt": {
    "broker": "192.168.43.201",
    "port": 1883,
    "username": "",
    "password": "",
    "keepalive": 60
  },
  "monitoring": {
    "enabled": true,
    "health_interval_seconds": 5,
    "heartbeat_interval_seconds": 60,
    "max_restart_attempts": 3,
    "restart_cooldown_seconds": 2
  },
  "logging": {
    "level": "INFO",
    "send_info_logs": false,
    "console_output": true,
    "local_log_file": "/var/log/infoscreen/watchdog.log"
  },
  "processes": {
    "vlc": {
      "http_port": 8080,
      "http_password": "vlc_password"
    },
    "chromium": {
      "debug_port": 9222
    }
  }
}

11. Error Scenarios & Expected Behavior

Scenario 1: VLC Crashes Mid-Video

1. Watchdog detects: process_status = "crashed"
2. Send ERROR log: "VLC process crashed"
3. Attempt 1: Restart VLC with same video, seek to last position
4. If success: Send INFO log "VLC restarted successfully"
5. If fail: Repeat 2 more times
6. After 3 failures: Send ERROR "Max restarts exceeded", show fallback

Scenario 2: Network Timeout Loading Website

1. Chromium fails to load page (CDP reports error)
2. Send WARN log: "Page load timeout"
3. Attempt reload (Chromium refresh)
4. If success after 10s: Continue monitoring
5. If timeout again: Send ERROR, try restarting Chromium

Scenario 3: Display Powers Off (Hardware)

1. DPMS check detects display off
2. Send ERROR log: "Display powered off"
3. Attempt to wake display (xset dpms force on)
4. If success: Send INFO log
5. If fail: Hardware issue, alert admin

Scenario 4: High CPU Usage

1. CPU >80% for 60 seconds
2. Send WARN log: "High CPU usage: 85%"
3. Check if expected (e.g., video playback is normal)
4. If unexpected: investigate process causing it
5. If critical (>95%): consider restarting offending process

12. Testing & Validation

12.1 Manual Tests (During Development)

Test 1: Process Crash Simulation

# Start video, then kill VLC manually
killall vlc
# Expected: ERROR log sent, automatic restart within 5 seconds

Test 2: MQTT Connectivity

# Subscribe to all client topics on server
mosquitto_sub -h 192.168.43.201 -t "infoscreen/{uuid}/#" -v
# Expected: See heartbeat every 60s, health every 5s

Test 3: Log Levels

# Trigger error condition and verify log appears in database
curl http://192.168.43.201:8000/api/client-logs/test
# Expected: See new log entry with correct level/message

12.2 Acceptance Criteria

✅ Client must:

Send heartbeat every 60 seconds without gaps
Send ERROR log within 5 seconds of process crash
Attempt automatic restart (max 3 times)
Report health metrics every 5 seconds
Survive MQTT broker restart (reconnect automatically)
Survive network interruption (buffer logs, send when reconnected)
Use correct timestamp format (ISO 8601 UTC)
Only send logs for real client UUID (FK constraint)

13. Python Libraries (Recommended)

For process monitoring:

psutil - Cross-platform process and system utilities

For MQTT:

paho-mqtt - Official MQTT client (use v2.x with Callback API v2)

For VLC control:

requests - HTTP client for status queries

For Chromium control:

websocket-client or pychrome - Chrome DevTools Protocol

For datetime:

datetime (stdlib) - Use datetime.now(timezone.utc).isoformat()

Example requirements.txt:

paho-mqtt>=2.0.0
psutil>=5.9.0
requests>=2.31.0
python-dateutil>=2.8.0

14. Security Considerations

14.1 MQTT Security

If broker requires auth, store credentials in config file with restricted permissions (chmod 600)
Consider TLS/SSL for MQTT (port 8883) if on untrusted network
Use unique client ID to prevent impersonation

14.2 Process Control APIs

VLC HTTP password should be random, not default
Chromium debug port should bind to 127.0.0.1 only (not 0.0.0.0)
Restrict file system access for media player processes

14.3 Log Content

Do not log: Passwords, API keys, personal data
Sanitize: File paths (strip user directories), URLs (remove query params with tokens)

15. Performance Targets

Metric	Target	Acceptable	Critical
Health check interval	5s	10s	30s
Crash detection time	<5s	<10s	<30s
Restart time	<10s	<20s	<60s
MQTT publish latency	<100ms	<500ms	<2s
CPU usage (watchdog)	<2%	<5%	<10%
RAM usage (watchdog)	<50MB	<100MB	<200MB
Log message size	<1KB	<10KB	<100KB

16. Troubleshooting Guide (For Client Development)

Issue: Logs not appearing in server database

Check:

Is MQTT broker reachable? (mosquitto_pub test from client)
Is client UUID correct and exists in clients table?
Is timestamp format correct (ISO 8601 with 'Z')?
Check server listener logs for errors

Issue: Health metrics not updating

Check:

Is health loop running? (check watchdog service status)
Is MQTT connected? (check connection status in logs)
Is payload JSON valid? (use JSON validator)

Issue: Process restarts in loop

Check:

Is media file/URL accessible?
Is process command correct? (test manually)
Check process exit code (crash reason)
Increase restart cooldown to avoid rapid loops

17. Complete Message Flow Diagram

┌─────────────────────────────────────────────────────────┐
│                    Infoscreen Client                     │
│                                                          │
│  Event Occurs:                                           │
│    - Process crashed                                     │
│    - High CPU usage                                      │
│    - Content loaded                                      │
│                                                          │
│  ┌────────────────┐                                     │
│  │ Decision Logic │                                     │
│  │  - Is it ERROR?│                                     │
│  │  - Is it WARN? │                                     │
│  │  - Is it INFO? │                                     │
│  └────────┬───────┘                                     │
│           │                                              │
│           ▼                                              │
│  ┌────────────────────────────────┐                    │
│  │ Build JSON Payload              │                    │
│  │ {                               │                    │
│  │   "timestamp": "...",           │                    │
│  │   "message": "...",             │                    │
│  │   "context": {...}              │                    │
│  │ }                               │                    │
│  └────────┬───────────────────────┘                    │
│           │                                              │
│           ▼                                              │
│  ┌────────────────────────────────┐                    │
│  │ MQTT Publish                    │                    │
│  │ Topic: infoscreen/{uuid}/logs/error                 │
│  │ QoS: 1                          │                    │
│  └────────┬───────────────────────┘                    │
└───────────┼──────────────────────────────────────────┘
            │
            │ TCP/IP (MQTT Protocol)
            │
            ▼
     ┌──────────────┐
     │ MQTT Broker  │
     │ (Mosquitto)  │
     └──────┬───────┘
            │
            │ Topic: infoscreen/+/logs/#
            │
            ▼
     ┌──────────────────────────────┐
     │   Listener Service            │
     │   (Python)                    │
     │                               │
     │  - Parse JSON                 │
     │  - Validate UUID              │
     │  - Store in database          │
     └──────┬───────────────────────┘
            │
            ▼
     ┌──────────────────────────────┐
     │   MariaDB Database            │
     │                               │
     │   Table: client_logs          │
     │   - client_uuid               │
     │   - timestamp                 │
     │   - level                     │
     │   - message                   │
     │   - context (JSON)            │
     └──────┬───────────────────────┘
            │
            │ SQL Query
            │
            ▼
     ┌──────────────────────────────┐
     │   API Server (Flask)          │
     │                               │
     │   GET /api/client-logs/{uuid}/logs
     │   GET /api/client-logs/summary
     └──────┬───────────────────────┘
            │
            │ HTTP/JSON
            │
            ▼
     ┌──────────────────────────────┐
     │   Dashboard (React)           │
     │                               │
     │   - Display logs              │
     │   - Filter by level           │
     │   - Show health status        │
     └───────────────────────────────┘

18. Quick Reference Card

MQTT Topics Summary

infoscreen/{uuid}/logs/error    → Critical failures
infoscreen/{uuid}/logs/warn     → Non-critical issues
infoscreen/{uuid}/logs/info     → Informational (dev mode)
infoscreen/{uuid}/health        → Health metrics (every 5s)
infoscreen/{uuid}/heartbeat     → Enhanced heartbeat (every 60s)

JSON Timestamp Format

from datetime import datetime, timezone
timestamp = datetime.now(timezone.utc).isoformat()
# Output: "2026-03-10T07:30:00+00:00" or "2026-03-10T07:30:00Z"

Process Status Values

"running"  - Process is alive and responding
"crashed"  - Process terminated unexpectedly
"starting" - Process is launching (startup phase)
"stopped"  - Process intentionally stopped

Restart Logic

Max attempts: 3
Cooldown: 2 seconds between attempts
Reset: After 5 minutes of successful operation

19. Contact & Support

Server API Documentation:

Base URL: http://192.168.43.201:8000
Health check: GET /health
Test logs: GET /api/client-logs/test (no auth)
Full API docs: See CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md on server

MQTT Broker:

Host: 192.168.43.201
Port: 1883 (standard), 9001 (WebSocket)
Test tool: mosquitto_pub / mosquitto_sub

Database Schema:

Table: client_logs
Foreign Key: client_uuid → clients.uuid (ON DELETE CASCADE)
Constraint: UUID must exist in clients table before logging

Server-Side Logs:

# View listener logs (processes MQTT messages)
docker compose logs -f listener

# View server logs (API requests)
docker compose logs -f server

20. Appendix: Example Implementations

A. Minimal Python Watchdog (Pseudocode)

import time
import json
import psutil
import paho.mqtt.client as mqtt
from datetime import datetime, timezone

class MinimalWatchdog:
    def __init__(self, client_uuid, mqtt_broker):
        self.uuid = client_uuid
        self.mqtt_client = mqtt.Client(callback_api_version=mqtt.CallbackAPIVersion.VERSION2)
        self.mqtt_client.connect(mqtt_broker, 1883, 60)
        self.mqtt_client.loop_start()
        
        self.expected_process = None
        self.restart_attempts = 0
        self.MAX_RESTARTS = 3
    
    def send_log(self, level, message, context=None):
        topic = f"infoscreen/{self.uuid}/logs/{level}"
        payload = {
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "message": message,
            "context": context or {}
        }
        self.mqtt_client.publish(topic, json.dumps(payload), qos=1)
    
    def is_process_running(self, process_name):
        for proc in psutil.process_iter(['name']):
            if process_name in proc.info['name']:
                return True
        return False
    
    def monitor_loop(self):
        while True:
            if self.expected_process:
                if not self.is_process_running(self.expected_process):
                    self.send_log("error", f"{self.expected_process} crashed")
                    if self.restart_attempts < self.MAX_RESTARTS:
                        self.restart_process()
                    else:
                        self.send_log("error", "Max restarts exceeded")
            
            time.sleep(5)

# Usage:
watchdog = MinimalWatchdog("9b8d1856-ff34-4864-a726-12de072d0f77", "192.168.43.201")
watchdog.expected_process = "vlc"
watchdog.monitor_loop()

END OF SPECIFICATION

Questions? Refer to:

CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md (server repo)
Server API: http://192.168.43.201:8000/api/client-logs/test
MQTT test: mosquitto_sub -h 192.168.43.201 -t infoscreen/#

29 KiB Raw Blame History

Client-Side Monitoring Specification

1. Overview

1.1 Goals

1.2 Architecture

2. MQTT Protocol Specification

2.1 Connection Parameters

2.2 QoS Levels

3. Topic Structure & Payload Formats

3.1 Log Messages

Topic Pattern:

Payload Format (JSON):

Field Specifications:

Example Topics:

When to Send Logs:

3.2 Health Metrics

Topic Pattern:

Payload Format (JSON):

Field Specifications:

Sending Frequency:

3.3 Enhanced Heartbeat

Topic Pattern:

Enhanced Payload Format (JSON):

New Fields (add to existing heartbeat):

Sending Frequency:

4. Process Monitoring Requirements

4.1 Processes to Monitor

4.2 Monitoring Checks (Every 5 seconds)

Check 1: Process Alive

Check 2: Process Responsive

Check 3: Content Match

5. Process Control Interface Requirements

5.1 VLC Control

5.2 Chromium Control

5.3 PDF Viewer (Custom or Standard)

6. Watchdog Service Architecture

6.1 Service Components

6.2 Service Lifecycle

7. Auto-Recovery Logic

7.1 Restart Strategy

7.2 Restart Cooldown

8. Resource Monitoring

8.1 System Metrics to Track

9. Development vs Production Mode

9.1 Development Mode

9.2 Production Mode

10. Configuration File Format

10.1 Recommended Config: JSON

11. Error Scenarios & Expected Behavior

Scenario 1: VLC Crashes Mid-Video

Scenario 2: Network Timeout Loading Website

Scenario 3: Display Powers Off (Hardware)

Scenario 4: High CPU Usage

12. Testing & Validation

12.1 Manual Tests (During Development)

12.2 Acceptance Criteria

13. Python Libraries (Recommended)

14. Security Considerations

14.1 MQTT Security

14.2 Process Control APIs

14.3 Log Content

15. Performance Targets

16. Troubleshooting Guide (For Client Development)

Issue: Logs not appearing in server database

Issue: Health metrics not updating

Issue: Process restarts in loop

17. Complete Message Flow Diagram

18. Quick Reference Card

MQTT Topics Summary

JSON Timestamp Format

Process Status Values

Restart Logic

19. Contact & Support

20. Appendix: Example Implementations

A. Minimal Python Watchdog (Pseudocode)

29 KiB

Raw Blame History