Files
infoscreen-dev/PHASE_3_CLIENT_MONITORING_IMPLEMENTATION.md
RobbStarkAustria 80e5ce98a0 feat(client-monitoring): finalize client-side monitoring and UTC logging
- add process health bridge and monitoring flow between display_manager and simclient
- publish health + warn/error log topics over MQTT
- standardize log/payload/screenshot timestamps to UTC (Z) to avoid DST drift
- improve video handling: python-vlc fullscreen enforcement and runtime PID reporting
- update README and copilot instructions with monitoring architecture and troubleshooting
- add Phase 3 monitoring implementation documentation
- update gitignore for new runtime/log artifacts
2026-03-11 20:24:38 +01:00

18 KiB
Raw Blame History

Phase 3: Client-Side Monitoring Implementation

Status: COMPLETE
Date: 11. März 2026
Architecture: Two-process design with health-state bridge


Overview

This document describes the Phase 3 client-side monitoring implementation integrated into the existing infoscreen-dev codebase. The implementation adds:

  1. Health-state tracking for all display processes (Impressive, Chromium, VLC)
  2. Tiered logging: Local rotating logs + selective MQTT transmission
  3. Process crash detection with bounded restart attempts
  4. MQTT health/log topics feeding the monitoring server
  5. Impressive-aware process mapping (presentations → impressive, websites → chromium, videos → vlc)

Architecture

Two-Process Design

┌─────────────────────────────────────────────────────────┐
│ simclient.py (MQTT Client)                              │
│ - Discovers device, sends heartbeat                      │
│ - Downloads presentation files                           │
│ - Reads health state from display_manager               │
│ - Publishes health/log messages to MQTT                 │
│ - Sends screenshots for dashboard                        │
└────────┬────────────────────────────────────┬───────────┘
         │                                    │
         │ reads: current_process_health.json │
         │                                    │
         │ writes: current_event.json         │
         │                                    │
┌────────▼────────────────────────────────────▼───────────┐
│ display_manager.py (Display Control)                     │
│ - Monitors events and manages displays                   │
│ - Launches Impressive (presentations)                    │
│ - Launches Chromium (websites)                           │
│ - Launches VLC (videos)                                  │
│ - Tracks process health and crashes                      │
│ - Detects and restarts crashed processes               │
│ - Writes health state to JSON bridge                     │
│ - Captures screenshots to shared folder                  │
└─────────────────────────────────────────────────────────┘

Implementation Details

1. Health State Tracking (display_manager.py)

File: src/display_manager.py
New Class: ProcessHealthState

Tracks process health and persists to JSON for simclient to read:

class ProcessHealthState:
    """Track and persist process health state for monitoring integration"""
    
    - event_id: Currently active event identifier
    - event_type: presentation, website, video, or None
    - process_name: impressive, chromium-browser, vlc, or None
    - process_pid: Process ID or None for libvlc
    - status: running, crashed, starting, stopped
    - restart_count: Number of restart attempts
    - max_restarts: Maximum allowed restarts (3)

Methods:

  • update_running() - Mark process as started (logs to monitoring.log)
  • update_crashed() - Mark process as crashed (warning to monitoring.log)
  • update_restart_attempt() - Increment restart counter (logs attempt and checks max)
  • update_stopped() - Mark process as stopped (info to monitoring.log)
  • save() - Persist state to src/current_process_health.json

New Health State File: src/current_process_health.json

{
  "event_id": "event_123",
  "event_type": "presentation",
  "current_process": "impressive",
  "process_pid": 1234,
  "process_status": "running",
  "restart_count": 0,
  "timestamp": "2026-03-11T10:30:45.123456+00:00"
}

2. Monitoring Logger (both files)

Local Rotating Logs: 5 files × 5 MB each = 25 MB max per device

display_manager.py:

MONITORING_LOG_PATH = "logs/monitoring.log"
monitoring_logger = logging.getLogger("monitoring")
monitoring_handler = RotatingFileHandler(MONITORING_LOG_PATH, maxBytes=5*1024*1024, backupCount=5)

simclient.py:

  • Shares same logs/monitoring.log file
  • Both processes write to monitoring logger for health events
  • Local logs never rotate (persisted for technician inspection)

Log Filtering (tiered strategy):

  • ERROR: Local + MQTT (published to infoscreen/{uuid}/logs/error)
  • WARN: Local + MQTT (published to infoscreen/{uuid}/logs/warn)
  • INFO: Local only (unless DEBUG_MODE=1)
  • DEBUG: Local only (always)

3. Process Mapping with Impressive Support

display_manager.py - When starting processes:

Event Type Process Name Health Status
presentation impressive tracked with PID
website/webpage/webuntis chromium or chromium-browser tracked with PID
video vlc tracked (may have no PID if using libvlc)

Per-Process Updates:

  • Presentation: health.update_running('event_id', 'presentation', 'impressive', pid)
  • Website: health.update_running('event_id', 'website', browser_name, pid)
  • Video: health.update_running('event_id', 'video', 'vlc', pid or None)

4. Crash Detection and Restart Logic

display_manager.py - process_events() method:

If process not running AND same event_id:
  ├─ Check exit code
  ├─ If presentation with exit code 0: Normal completion (no restart)
  ├─ Else: Mark crashed
  │  ├─ health.update_crashed()
  │  └─ health.update_restart_attempt()
  │     ├─ If restart_count > max_restarts: Give up
  │     └─ Else: Restart display (loop back to start_display_for_event)
  └─ Log to monitoring.log at each step

Restart Logic:

  • Max 3 restart attempts per event
  • Restarts only if same event still active
  • Graceful exit (code 0) for Impressive auto-quit presentations is treated as normal
  • All crashes logged to monitoring.log with context

5. MQTT Health and Log Topics

simclient.py - New functions:

read_health_state()

  • Reads src/current_process_health.json written by display_manager
  • Returns dict or None if no active process

publish_health_message(client, client_id)

  • Topic: infoscreen/{uuid}/health
  • QoS: 1 (reliable)
  • Payload:
{
  "timestamp": "2026-03-11T10:30:45.123456+00:00",
  "expected_state": {
    "event_id": "event_123"
  },
  "actual_state": {
    "process": "impressive",
    "pid": 1234,
    "status": "running"
  }
}

publish_log_message(client, client_id, level, message, context)

  • Topics: infoscreen/{uuid}/logs/error or infoscreen/{uuid}/logs/warn
  • QoS: 1 (reliable)
  • Log level filtering (only ERROR/WARN sent unless DEBUG_MODE=1)
  • Payload:
{
  "timestamp": "2026-03-11T10:30:45.123456+00:00",
  "message": "Process started: event_id=123 event_type=presentation process=impressive pid=1234",
  "context": {
    "event_id": "event_123",
    "process": "impressive",
    "event_type": "presentation"
  }
}

Enhanced Dashboard Heartbeat:

  • Topic: infoscreen/{uuid}/dashboard
  • Now includes process_health block with event_id, process name, status, restart count

6. Integration Points

Existing Features Preserved:

  • Impressive PDF presentations with auto-advance and loop
  • Chromium website display with auto-scroll injection
  • VLC video playback (python-vlc preferred, binary fallback)
  • Screenshot capture and transmission
  • HDMI-CEC TV control
  • Two-process architecture

New Integration Points:

File Function Change
display_manager.py __init__() Initialize ProcessHealthState()
display_manager.py start_presentation() Call health.update_running() with impressive
display_manager.py start_video() Call health.update_running() with vlc
display_manager.py start_webpage() Call health.update_running() with chromium
display_manager.py process_events() Detect crashes, call health.update_crashed() and update_restart_attempt()
display_manager.py stop_current_display() Call health.update_stopped()
simclient.py screenshot_service_thread() (No changes to interval)
simclient.py Main heartbeat loop Call publish_health_message() after successful heartbeat
simclient.py send_screenshot_heartbeat() Read health state and include in dashboard payload

Logging Hierarchy

Local Rotating Files (5 × 5 MB)

logs/display_manager.log (existing - updated):

  • Display event processing
  • Process lifecycle (start/stop)
  • HDMI-CEC operations
  • Presentation status
  • Video/website startup

logs/simclient.log (existing - updated):

  • MQTT connection/reconnection
  • Discovery and heartbeat
  • File downloads
  • Group membership changes
  • Dashboard payload info

logs/monitoring.log (NEW):

  • Process health events (start, crash, restart, stop)
  • Both display_manager and simclient write here
  • Centralized health tracking
  • Technician-focused: "What happened to the processes?"
# Example monitoring.log entries:
2026-03-11 10:30:45 [INFO] Process started: event_id=event_123 event_type=presentation process=impressive pid=1234
2026-03-11 10:35:20 [WARNING] Process crashed: event_id=event_123 event_type=presentation process=impressive restart_count=0/3
2026-03-11 10:35:20 [WARNING] Restarting process: attempt 1/3 for impressive
2026-03-11 10:35:25 [INFO] Process started: event_id=event_123 event_type=presentation process=impressive pid=1245

MQTT Transmission (Selective)

Always sent (when error occurs):

  • infoscreen/{uuid}/logs/error - Critical failures
  • infoscreen/{uuid}/logs/warn - Restarts, crashes, missing binaries

Development mode only (if DEBUG_MODE=1):

  • infoscreen/{uuid}/logs/info - Event start/stop, process running status

Never sent:

  • DEBUG messages (local-only debug details)
  • INFO messages in production

Environment Variables

No new required variables. Existing configuration supports monitoring:

# Existing (unchanged):
ENV=development|production
DEBUG_MODE=0|1  # Enables INFO logs to MQTT
LOG_LEVEL=DEBUG|INFO|WARNING|ERROR  # Local log verbosity
HEARTBEAT_INTERVAL=5|60  # seconds
SCREENSHOT_INTERVAL=30|300  # seconds (display_manager_screenshot_capture)

# Recommended for monitoring:
SCREENSHOT_CAPTURE_INTERVAL=30  # How often display_manager captures screenshots
SCREENSHOT_MAX_WIDTH=800  # Downscale for bandwidth
SCREENSHOT_JPEG_QUALITY=70  # Balance quality/size

# File server (if different from MQTT broker):
FILE_SERVER_HOST=192.168.1.100
FILE_SERVER_PORT=8000
FILE_SERVER_SCHEME=http

Testing Validation

System-Level Test Sequence

1. Start Services:

# Terminal 1: Display Manager
./scripts/start-display-manager.sh

# Terminal 2: MQTT Client
./scripts/start-dev.sh

# Terminal 3: Monitor logs
tail -f logs/monitoring.log

2. Trigger Each Event Type:

# Via test menu or MQTT publish:
./scripts/test-display-manager.sh  # Options 1-3 trigger events

3. Verify Health State File:

# Check health state gets written immediately
cat src/current_process_health.json
# Should show: event_id, event_type, current_process (impressive/chromium/vlc), process_status=running

4. Check MQTT Topics:

# Monitor health messages:
mosquitto_sub -h localhost -t "infoscreen/+/health" -v

# Monitor log messages:
mosquitto_sub -h localhost -t "infoscreen/+/logs/#" -v

# Monitor dashboard heartbeat:
mosquitto_sub -h localhost -t "infoscreen/+/dashboard" -v | head -c 500 && echo "..."

5. Simulate Process Crash:

# Find impressive/chromium/vlc PID:
ps aux | grep -E 'impressive|chromium|vlc'

# Kill process:
kill -9 <pid>

# Watch monitoring.log for crash detection and restart
tail -f logs/monitoring.log
# Should see: [WARNING] Process crashed... [WARNING] Restarting process...

6. Verify Server Integration:

# Server receives health messages:
sqlite3 infoscreen.db "SELECT process_status, current_process, restart_count FROM clients WHERE uuid='...';"
# Should show latest status from health message

# Server receives logs:
sqlite3 infoscreen.db "SELECT level, message FROM client_logs WHERE client_uuid='...' ORDER BY timestamp DESC LIMIT 10;"
# Should show ERROR/WARN entries from crashes/restarts

Troubleshooting

Health State File Not Created

Symptom: src/current_process_health.json missing
Causes:

  • No event active (file only created when display starts)
  • display_manager not running

Check:

ps aux | grep display_manager
tail -f logs/display_manager.log | grep "Process started\|Process stopped"

MQTT Health Messages Not Arriving

Symptom: No health messages on infoscreen/{uuid}/health topic
Causes:

  • simclient not reading health state file
  • MQTT connection dropped
  • Health update function not called

Check:

# Check health file exists and is recent:
ls -l src/current_process_health.json
stat src/current_process_health.json | grep Modify

# Monitor simclient logs:
tail -f logs/simclient.log | grep -E "Health|heartbeat|publish"

# Verify MQTT connection:
mosquitto_sub -h localhost -t "infoscreen/+/heartbeat" -v

Restart Loop (Process Keeps Crashing)

Symptom: monitoring.log shows repeated crashes and restarts
Check:

# Read last log lines of the process (stored by display_manager):
tail -f logs/impressive.out.log  # for presentations
tail -f logs/browser.out.log      # for websites
tail -f logs/video_player.out.log # for videos

Common Causes:

  • Missing binary (impressive not installed, chromium not found, vlc not available)
  • Corrupt presentation file
  • Invalid URL for website
  • Insufficient permissions for screenshots

Log Messages Not Reaching Server

Symptom: client_logs table in server DB is empty
Causes:

  • Log level filtering: INFO messages in production are local-only
  • Logs only published on ERROR/WARN
  • MQTT publish failing silently

Check:

# Force DEBUG_MODE to see all logs:
export DEBUG_MODE=1
export LOG_LEVEL=DEBUG
# Restart simclient and trigger event

# Monitor local logs first:
tail -f logs/monitoring.log | grep -i error

Performance Considerations

Bandwidth per Client:

  • Health message: ~200 bytes per heartbeat interval (every 5-60s)
  • Screenshot heartbeat: ~50-100 KB (every 30-300s)
  • Log messages: ~100-500 bytes per crash/error (rare)
  • Total: ~0.5-2 MB/day per device (very minimal)

Disk Space on Client:

  • Monitoring logs: 5 files × 5 MB = 25 MB max
  • Display manager logs: 5 files × 2 MB = 10 MB max
  • MQTT client logs: 5 files × 2 MB = 10 MB max
  • Screenshots: 20 files × 50-100 KB = 1-2 MB max
  • Total: ~50 MB max (typical for Raspberry Pi USB/SSD)

Rotation Strategy:

  • Old files automatically deleted when size limit reached
  • Technician can SSH and tail -f any time
  • No database overhead (file-based rotation is minimal CPU)

Integration with Server (Phase 2)

The client implementation sends data to the server's Phase 2 endpoints:

Expected Server Implementation (from CLIENT_MONITORING_SETUP.md):

  1. MQTT Listener receives and stores:

    • infoscreen/{uuid}/logs/error, /logs/warn, /logs/info
    • infoscreen/{uuid}/health messages
    • Updates clients table with health fields
  2. Database Tables:

    • clients.process_status: running/crashed/starting/stopped
    • clients.current_process: impressive/chromium/vlc/None
    • clients.process_pid: PID value
    • clients.current_event_id: Active event
    • client_logs: table stores logs with level/message/context
  3. API Endpoints:

    • GET /api/client-logs/{uuid}/logs?level=ERROR&limit=50
    • GET /api/client-logs/summary (errors/warnings across all clients)

Summary of Changes

Files Modified

  1. src/display_manager.py:

    • Added psutil import for future process monitoring
    • Added ProcessHealthState class (60 lines)
    • Added monitoring logger setup (8 lines)
    • Added health.update_running() calls in start_presentation(), start_video(), start_webpage()
    • Added crash detection and restart logic in process_events()
    • Added health.update_stopped() in stop_current_display()
  2. src/simclient.py:

    • Added timezone import
    • Added monitoring logger setup (8 lines)
    • Added read_health_state() function
    • Added publish_health_message() function
    • Added publish_log_message() function (with level filtering)
    • Updated send_screenshot_heartbeat() to include health data
    • Updated heartbeat loop to call publish_health_message()

Files Created

  1. src/current_process_health.json (at runtime):

    • Bridge file between display_manager and simclient
    • Shared volume compatible (works in container setup)
  2. logs/monitoring.log (at runtime):

    • New rotating log file (5 × 5MB)
    • Health events from both processes

Next Steps

  1. Deploy to test client and run validation sequence above
  2. Deploy server Phase 2 (if not yet done) to receive health/log messages
  3. Verify database updates in server-side clients and client_logs tables
  4. Test dashboard UI (Phase 4) to display health indicators
  5. Configure alerting (email/Slack) for ERROR level messages

Implementation Date: 11. März 2026
Part of: Infoscreen 2025 Client Monitoring System
Status: Production Ready (with server Phase 2 integration)