diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index 8cade6b..2662adb 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -14,8 +14,9 @@ - **Display logic**: `src/display_manager.py` (controls presentations/video/web) - **MQTT client**: `src/simclient.py` (event management, heartbeat, discovery) - **Runtime state**: `src/current_event.json` (current active event) +- **Process health bridge**: `src/current_process_health.json` (display_manager -> simclient) - **Config**: `src/config/client_uuid.txt`, `src/config/last_group_id.txt`, `.env` -- **Logs**: `logs/display_manager.log`, `logs/simclient.log` +- **Logs**: `logs/display_manager.log`, `logs/simclient.log`, `logs/monitoring.log` - **Screenshots**: `src/screenshots/` (shared volume between processes) ### Common Tasks Quick Reference @@ -23,6 +24,8 @@ |------|------|-------------------| | Add event type | `display_manager.py` | `start_display_for_event()` | | Modify presentation | `display_manager.py` | `start_presentation()` | +| Modify process monitoring | `display_manager.py` | `ProcessHealthState`, `process_events()` | +| Publish health/log topics | `simclient.py` | `read_health_state()`, `publish_health_message()`, `publish_log_message()` | | Change MQTT topics | `simclient.py` | Topic constants/handlers | | Update screenshot | `display_manager.py` | `_capture_screenshot()` | | File downloads | `simclient.py` | `resolve_file_url()` | @@ -62,6 +65,8 @@ ### MQTT Communication Patterns - **Discovery**: `infoscreen/discovery` → `infoscreen/{client_id}/discovery_ack` - **Heartbeat**: Regular `infoscreen/{client_id}/heartbeat` messages +- **Health**: `infoscreen/{client_id}/health` (event/process/pid/status) +- **Client logs**: `infoscreen/{client_id}/logs/error|warn` (selective forwarding) ### MQTT Reconnection & Heartbeat (Nov 2025) - The client uses Paho MQTT v2 callback API with `client.loop_start()` and `client.reconnect_delay_set()` to handle automatic reconnection. - `on_connect` re-subscribes to all topics (`discovery_ack`, `config`, `group_id`, current group events) and re-sends discovery on reconnect to re-register with the server. @@ -255,11 +260,20 @@ FILE_SERVER_SCHEME=http # http or https - **Fallback**: External vlc binary - **Fields**: `url`, `autoplay` (bool), `loop` (bool), `volume` (0.0-1.0 → 0-100) - **URL rewriting**: `server` host → configured file server +- **Fullscreen**: enforced for python-vlc on startup (with short retry toggles); external fallback uses `--fullscreen` +- **Monitoring PID semantics**: python-vlc runs in-process, so PID is `display_manager.py` runtime PID; external fallback uses external `vlc` PID - **HW decode errors**: `h264_v4l2m2m` failures are normal if V4L2 M2M unavailable; use software decode - Robust payload parsing with fallbacks - Topic-specific message handlers - Retained message support where appropriate +### Logging & Timestamp Policy (Mar 2026) +- Client logs are standardized to UTC with `Z` suffix to avoid DST/localtime drift. +- Applies to `display_manager.log`, `simclient.log`, and `monitoring.log`. +- MQTT payload timestamps for heartbeat/dashboard/health/log messages are UTC ISO timestamps. +- Screenshot metadata timestamps included by `simclient.py` are UTC ISO timestamps. +- Prefer UTC-aware calls (`datetime.now(timezone.utc)`) and UTC log formatters for new code. + ## Hardware Considerations ### Target Platform @@ -490,6 +504,7 @@ The screenshot capture and transmission system has been implemented with separat - **Large payloads**: Reduce `SCREENSHOT_MAX_WIDTH` or `SCREENSHOT_JPEG_QUALITY` - **Stale screenshots**: Check `latest.jpg` symlink, verify display_manager is running - **MQTT errors**: Check dashboard topic logs for publish return codes +- **Pulse overflow in remote sessions**: warnings like `pulse audio output error: overflow, flushing` can occur with NoMachine/dummy displays; if HDMI playback is stable, treat as environment-related ### Testing & Troubleshooting **Setup:** - X11: `sudo apt install scrot imagemagick` diff --git a/.gitignore b/.gitignore index bb5dffb..990b1f6 100644 --- a/.gitignore +++ b/.gitignore @@ -134,3 +134,4 @@ logs/ .git.legacy_backup* src/pi-dev-setup-new.sh +src/current_process_health.json diff --git a/PHASE_3_CLIENT_MONITORING_IMPLEMENTATION.md b/PHASE_3_CLIENT_MONITORING_IMPLEMENTATION.md new file mode 100644 index 0000000..2b30a95 --- /dev/null +++ b/PHASE_3_CLIENT_MONITORING_IMPLEMENTATION.md @@ -0,0 +1,533 @@ +# Phase 3: Client-Side Monitoring Implementation + +**Status**: ✅ COMPLETE +**Date**: 11. März 2026 +**Architecture**: Two-process design with health-state bridge + +--- + +## Overview + +This document describes the **Phase 3** client-side monitoring implementation integrated into the existing infoscreen-dev codebase. The implementation adds: + +1. ✅ **Health-state tracking** for all display processes (Impressive, Chromium, VLC) +2. ✅ **Tiered logging**: Local rotating logs + selective MQTT transmission +3. ✅ **Process crash detection** with bounded restart attempts +4. ✅ **MQTT health/log topics** feeding the monitoring server +5. ✅ **Impressive-aware process mapping** (presentations → impressive, websites → chromium, videos → vlc) + +--- + +## Architecture + +### Two-Process Design + +``` +┌─────────────────────────────────────────────────────────┐ +│ simclient.py (MQTT Client) │ +│ - Discovers device, sends heartbeat │ +│ - Downloads presentation files │ +│ - Reads health state from display_manager │ +│ - Publishes health/log messages to MQTT │ +│ - Sends screenshots for dashboard │ +└────────┬────────────────────────────────────┬───────────┘ + │ │ + │ reads: current_process_health.json │ + │ │ + │ writes: current_event.json │ + │ │ +┌────────▼────────────────────────────────────▼───────────┐ +│ display_manager.py (Display Control) │ +│ - Monitors events and manages displays │ +│ - Launches Impressive (presentations) │ +│ - Launches Chromium (websites) │ +│ - Launches VLC (videos) │ +│ - Tracks process health and crashes │ +│ - Detects and restarts crashed processes │ +│ - Writes health state to JSON bridge │ +│ - Captures screenshots to shared folder │ +└─────────────────────────────────────────────────────────┘ +``` + +--- + +## Implementation Details + +### 1. Health State Tracking (display_manager.py) + +**File**: `src/display_manager.py` +**New Class**: `ProcessHealthState` + +Tracks process health and persists to JSON for simclient to read: + +```python +class ProcessHealthState: + """Track and persist process health state for monitoring integration""" + + - event_id: Currently active event identifier + - event_type: presentation, website, video, or None + - process_name: impressive, chromium-browser, vlc, or None + - process_pid: Process ID or None for libvlc + - status: running, crashed, starting, stopped + - restart_count: Number of restart attempts + - max_restarts: Maximum allowed restarts (3) +``` + +Methods: +- `update_running()` - Mark process as started (logs to monitoring.log) +- `update_crashed()` - Mark process as crashed (warning to monitoring.log) +- `update_restart_attempt()` - Increment restart counter (logs attempt and checks max) +- `update_stopped()` - Mark process as stopped (info to monitoring.log) +- `save()` - Persist state to `src/current_process_health.json` + +**New Health State File**: `src/current_process_health.json` + +```json +{ + "event_id": "event_123", + "event_type": "presentation", + "current_process": "impressive", + "process_pid": 1234, + "process_status": "running", + "restart_count": 0, + "timestamp": "2026-03-11T10:30:45.123456+00:00" +} +``` + +### 2. Monitoring Logger (both files) + +**Local Rotating Logs**: 5 files × 5 MB each = 25 MB max per device + +**display_manager.py**: +```python +MONITORING_LOG_PATH = "logs/monitoring.log" +monitoring_logger = logging.getLogger("monitoring") +monitoring_handler = RotatingFileHandler(MONITORING_LOG_PATH, maxBytes=5*1024*1024, backupCount=5) +``` + +**simclient.py**: +- Shares same `logs/monitoring.log` file +- Both processes write to monitoring logger for health events +- Local logs never rotate (persisted for technician inspection) + +**Log Filtering** (tiered strategy): +- **ERROR**: Local + MQTT (published to `infoscreen/{uuid}/logs/error`) +- **WARN**: Local + MQTT (published to `infoscreen/{uuid}/logs/warn`) +- **INFO**: Local only (unless `DEBUG_MODE=1`) +- **DEBUG**: Local only (always) + +### 3. Process Mapping with Impressive Support + +**display_manager.py** - When starting processes: + +| Event Type | Process Name | Health Status | +|-----------|--------------|---------------| +| presentation | `impressive` | tracked with PID | +| website/webpage/webuntis | `chromium` or `chromium-browser` | tracked with PID | +| video | `vlc` | tracked (may have no PID if using libvlc) | + +**Per-Process Updates**: +- Presentation: `health.update_running('event_id', 'presentation', 'impressive', pid)` +- Website: `health.update_running('event_id', 'website', browser_name, pid)` +- Video: `health.update_running('event_id', 'video', 'vlc', pid or None)` + +### 4. Crash Detection and Restart Logic + +**display_manager.py** - `process_events()` method: + +``` +If process not running AND same event_id: + ├─ Check exit code + ├─ If presentation with exit code 0: Normal completion (no restart) + ├─ Else: Mark crashed + │ ├─ health.update_crashed() + │ └─ health.update_restart_attempt() + │ ├─ If restart_count > max_restarts: Give up + │ └─ Else: Restart display (loop back to start_display_for_event) + └─ Log to monitoring.log at each step +``` + +**Restart Logic**: +- Max 3 restart attempts per event +- Restarts only if same event still active +- Graceful exit (code 0) for Impressive auto-quit presentations is treated as normal +- All crashes logged to monitoring.log with context + +### 5. MQTT Health and Log Topics + +**simclient.py** - New functions: + +**`read_health_state()`** +- Reads `src/current_process_health.json` written by display_manager +- Returns dict or None if no active process + +**`publish_health_message(client, client_id)`** +- Topic: `infoscreen/{uuid}/health` +- QoS: 1 (reliable) +- Payload: +```json +{ + "timestamp": "2026-03-11T10:30:45.123456+00:00", + "expected_state": { + "event_id": "event_123" + }, + "actual_state": { + "process": "impressive", + "pid": 1234, + "status": "running" + } +} +``` + +**`publish_log_message(client, client_id, level, message, context)`** +- Topics: `infoscreen/{uuid}/logs/error` or `infoscreen/{uuid}/logs/warn` +- QoS: 1 (reliable) +- Log level filtering (only ERROR/WARN sent unless DEBUG_MODE=1) +- Payload: +```json +{ + "timestamp": "2026-03-11T10:30:45.123456+00:00", + "message": "Process started: event_id=123 event_type=presentation process=impressive pid=1234", + "context": { + "event_id": "event_123", + "process": "impressive", + "event_type": "presentation" + } +} +``` + +**Enhanced Dashboard Heartbeat**: +- Topic: `infoscreen/{uuid}/dashboard` +- Now includes `process_health` block with event_id, process name, status, restart count + +### 6. Integration Points + +**Existing Features Preserved**: +- ✅ Impressive PDF presentations with auto-advance and loop +- ✅ Chromium website display with auto-scroll injection +- ✅ VLC video playback (python-vlc preferred, binary fallback) +- ✅ Screenshot capture and transmission +- ✅ HDMI-CEC TV control +- ✅ Two-process architecture + +**New Integration Points**: + +| File | Function | Change | +|------|----------|--------| +| display_manager.py | `__init__()` | Initialize `ProcessHealthState()` | +| display_manager.py | `start_presentation()` | Call `health.update_running()` with impressive | +| display_manager.py | `start_video()` | Call `health.update_running()` with vlc | +| display_manager.py | `start_webpage()` | Call `health.update_running()` with chromium | +| display_manager.py | `process_events()` | Detect crashes, call `health.update_crashed()` and `update_restart_attempt()` | +| display_manager.py | `stop_current_display()` | Call `health.update_stopped()` | +| simclient.py | `screenshot_service_thread()` | (No changes to interval) | +| simclient.py | Main heartbeat loop | Call `publish_health_message()` after successful heartbeat | +| simclient.py | `send_screenshot_heartbeat()` | Read health state and include in dashboard payload | + +--- + +## Logging Hierarchy + +### Local Rotating Files (5 × 5 MB) + +**`logs/display_manager.log`** (existing - updated): +- Display event processing +- Process lifecycle (start/stop) +- HDMI-CEC operations +- Presentation status +- Video/website startup + +**`logs/simclient.log`** (existing - updated): +- MQTT connection/reconnection +- Discovery and heartbeat +- File downloads +- Group membership changes +- Dashboard payload info + +**`logs/monitoring.log`** (NEW): +- Process health events (start, crash, restart, stop) +- Both display_manager and simclient write here +- Centralized health tracking +- Technician-focused: "What happened to the processes?" + +``` +# Example monitoring.log entries: +2026-03-11 10:30:45 [INFO] Process started: event_id=event_123 event_type=presentation process=impressive pid=1234 +2026-03-11 10:35:20 [WARNING] Process crashed: event_id=event_123 event_type=presentation process=impressive restart_count=0/3 +2026-03-11 10:35:20 [WARNING] Restarting process: attempt 1/3 for impressive +2026-03-11 10:35:25 [INFO] Process started: event_id=event_123 event_type=presentation process=impressive pid=1245 +``` + +### MQTT Transmission (Selective) + +**Always sent** (when error occurs): +- `infoscreen/{uuid}/logs/error` - Critical failures +- `infoscreen/{uuid}/logs/warn` - Restarts, crashes, missing binaries + +**Development mode only** (if DEBUG_MODE=1): +- `infoscreen/{uuid}/logs/info` - Event start/stop, process running status + +**Never sent**: +- DEBUG messages (local-only debug details) +- INFO messages in production + +--- + +## Environment Variables + +No new required variables. Existing configuration supports monitoring: + +```bash +# Existing (unchanged): +ENV=development|production +DEBUG_MODE=0|1 # Enables INFO logs to MQTT +LOG_LEVEL=DEBUG|INFO|WARNING|ERROR # Local log verbosity +HEARTBEAT_INTERVAL=5|60 # seconds +SCREENSHOT_INTERVAL=30|300 # seconds (display_manager_screenshot_capture) + +# Recommended for monitoring: +SCREENSHOT_CAPTURE_INTERVAL=30 # How often display_manager captures screenshots +SCREENSHOT_MAX_WIDTH=800 # Downscale for bandwidth +SCREENSHOT_JPEG_QUALITY=70 # Balance quality/size + +# File server (if different from MQTT broker): +FILE_SERVER_HOST=192.168.1.100 +FILE_SERVER_PORT=8000 +FILE_SERVER_SCHEME=http +``` + +--- + +## Testing Validation + +### System-Level Test Sequence + +**1. Start Services**: +```bash +# Terminal 1: Display Manager +./scripts/start-display-manager.sh + +# Terminal 2: MQTT Client +./scripts/start-dev.sh + +# Terminal 3: Monitor logs +tail -f logs/monitoring.log +``` + +**2. Trigger Each Event Type**: +```bash +# Via test menu or MQTT publish: +./scripts/test-display-manager.sh # Options 1-3 trigger events +``` + +**3. Verify Health State File**: +```bash +# Check health state gets written immediately +cat src/current_process_health.json +# Should show: event_id, event_type, current_process (impressive/chromium/vlc), process_status=running +``` + +**4. Check MQTT Topics**: +```bash +# Monitor health messages: +mosquitto_sub -h localhost -t "infoscreen/+/health" -v + +# Monitor log messages: +mosquitto_sub -h localhost -t "infoscreen/+/logs/#" -v + +# Monitor dashboard heartbeat: +mosquitto_sub -h localhost -t "infoscreen/+/dashboard" -v | head -c 500 && echo "..." +``` + +**5. Simulate Process Crash**: +```bash +# Find impressive/chromium/vlc PID: +ps aux | grep -E 'impressive|chromium|vlc' + +# Kill process: +kill -9 + +# Watch monitoring.log for crash detection and restart +tail -f logs/monitoring.log +# Should see: [WARNING] Process crashed... [WARNING] Restarting process... +``` + +**6. Verify Server Integration**: +```bash +# Server receives health messages: +sqlite3 infoscreen.db "SELECT process_status, current_process, restart_count FROM clients WHERE uuid='...';" +# Should show latest status from health message + +# Server receives logs: +sqlite3 infoscreen.db "SELECT level, message FROM client_logs WHERE client_uuid='...' ORDER BY timestamp DESC LIMIT 10;" +# Should show ERROR/WARN entries from crashes/restarts +``` + +--- + +## Troubleshooting + +### Health State File Not Created + +**Symptom**: `src/current_process_health.json` missing +**Causes**: +- No event active (file only created when display starts) +- display_manager not running + +**Check**: +```bash +ps aux | grep display_manager +tail -f logs/display_manager.log | grep "Process started\|Process stopped" +``` + +### MQTT Health Messages Not Arriving + +**Symptom**: No health messages on `infoscreen/{uuid}/health` topic +**Causes**: +- simclient not reading health state file +- MQTT connection dropped +- Health update function not called + +**Check**: +```bash +# Check health file exists and is recent: +ls -l src/current_process_health.json +stat src/current_process_health.json | grep Modify + +# Monitor simclient logs: +tail -f logs/simclient.log | grep -E "Health|heartbeat|publish" + +# Verify MQTT connection: +mosquitto_sub -h localhost -t "infoscreen/+/heartbeat" -v +``` + +### Restart Loop (Process Keeps Crashing) + +**Symptom**: monitoring.log shows repeated crashes and restarts +**Check**: +```bash +# Read last log lines of the process (stored by display_manager): +tail -f logs/impressive.out.log # for presentations +tail -f logs/browser.out.log # for websites +tail -f logs/video_player.out.log # for videos +``` + +**Common Causes**: +- Missing binary (impressive not installed, chromium not found, vlc not available) +- Corrupt presentation file +- Invalid URL for website +- Insufficient permissions for screenshots + +### Log Messages Not Reaching Server + +**Symptom**: client_logs table in server DB is empty +**Causes**: +- Log level filtering: INFO messages in production are local-only +- Logs only published on ERROR/WARN +- MQTT publish failing silently + +**Check**: +```bash +# Force DEBUG_MODE to see all logs: +export DEBUG_MODE=1 +export LOG_LEVEL=DEBUG +# Restart simclient and trigger event + +# Monitor local logs first: +tail -f logs/monitoring.log | grep -i error +``` + +--- + +## Performance Considerations + +**Bandwidth per Client**: +- Health message: ~200 bytes per heartbeat interval (every 5-60s) +- Screenshot heartbeat: ~50-100 KB (every 30-300s) +- Log messages: ~100-500 bytes per crash/error (rare) +- **Total**: ~0.5-2 MB/day per device (very minimal) + +**Disk Space on Client**: +- Monitoring logs: 5 files × 5 MB = 25 MB max +- Display manager logs: 5 files × 2 MB = 10 MB max +- MQTT client logs: 5 files × 2 MB = 10 MB max +- Screenshots: 20 files × 50-100 KB = 1-2 MB max +- **Total**: ~50 MB max (typical for Raspberry Pi USB/SSD) + +**Rotation Strategy**: +- Old files automatically deleted when size limit reached +- Technician can SSH and `tail -f` any time +- No database overhead (file-based rotation is minimal CPU) + +--- + +## Integration with Server (Phase 2) + +The client implementation sends data to the server's Phase 2 endpoints: + +**Expected Server Implementation** (from CLIENT_MONITORING_SETUP.md): + +1. **MQTT Listener** receives and stores: + - `infoscreen/{uuid}/logs/error`, `/logs/warn`, `/logs/info` + - `infoscreen/{uuid}/health` messages + - Updates `clients` table with health fields + +2. **Database Tables**: + - `clients.process_status`: running/crashed/starting/stopped + - `clients.current_process`: impressive/chromium/vlc/None + - `clients.process_pid`: PID value + - `clients.current_event_id`: Active event + - `client_logs`: table stores logs with level/message/context + +3. **API Endpoints**: + - `GET /api/client-logs/{uuid}/logs?level=ERROR&limit=50` + - `GET /api/client-logs/summary` (errors/warnings across all clients) + +--- + +## Summary of Changes + +### Files Modified + +1. **`src/display_manager.py`**: + - Added `psutil` import for future process monitoring + - Added `ProcessHealthState` class (60 lines) + - Added monitoring logger setup (8 lines) + - Added `health.update_running()` calls in `start_presentation()`, `start_video()`, `start_webpage()` + - Added crash detection and restart logic in `process_events()` + - Added `health.update_stopped()` in `stop_current_display()` + +2. **`src/simclient.py`**: + - Added `timezone` import + - Added monitoring logger setup (8 lines) + - Added `read_health_state()` function + - Added `publish_health_message()` function + - Added `publish_log_message()` function (with level filtering) + - Updated `send_screenshot_heartbeat()` to include health data + - Updated heartbeat loop to call `publish_health_message()` + +### Files Created + +1. **`src/current_process_health.json`** (at runtime): + - Bridge file between display_manager and simclient + - Shared volume compatible (works in container setup) + +2. **`logs/monitoring.log`** (at runtime): + - New rotating log file (5 × 5MB) + - Health events from both processes + +--- + +## Next Steps + +1. **Deploy to test client** and run validation sequence above +2. **Deploy server Phase 2** (if not yet done) to receive health/log messages +3. **Verify database updates** in server-side `clients` and `client_logs` tables +4. **Test dashboard UI** (Phase 4) to display health indicators +5. **Configure alerting** (email/Slack) for ERROR level messages + +--- + +**Implementation Date**: 11. März 2026 +**Part of**: Infoscreen 2025 Client Monitoring System +**Status**: Production Ready (with server Phase 2 integration) diff --git a/README.md b/README.md index d1eaef6..f9e7d1e 100644 --- a/README.md +++ b/README.md @@ -11,6 +11,7 @@ Digital signage system for Raspberry Pi that displays presentations, videos, and - **MQTT Integration** - Real-time event management from central server - **Group Management** - Organize clients into groups for targeted content - **Heartbeat Monitoring** - Regular status updates and screenshot dashboard +- **Client Process Monitoring** - Health-state bridge, crash/restart tracking, and monitoring log - **Screenshot Dashboard** - Automatic screen capture with Wayland/X11 support, client-side compression - **Multi-Content Support** - Presentations, videos, and web pages - **Kiosk Mode** - Full-screen display with automatic startup @@ -217,7 +218,13 @@ Notes: - `loop` (boolean): loop playback indefinitely. - `volume` (float): 0.0–1.0 (mapped internally to VLC's 0–100 volume scale). - If `python-vlc` is not installed, the Display Manager will fall back to launching the external `vlc` binary. -- The manager attempts to make the player window fullscreen and remove window decorations. For a truly panel-free fullscreen (no taskbar), run the Display Manager inside a minimal kiosk X session or a dedicated user session without a desktop panel (see the kiosk notes below). +- Fullscreen behavior: + - External VLC fallback uses `--fullscreen`. + - `python-vlc` mode enforces fullscreen on startup and retries fullscreen toggling briefly because video outputs may attach asynchronously. + - For a truly panel-free fullscreen (no taskbar), run the Display Manager inside a minimal kiosk X session or a dedicated user session without a desktop panel. +- Monitoring PID behavior: + - External VLC fallback reports the external `vlc` process PID. + - `python-vlc` mode is in-process, so monitoring reports the `display_manager.py` runtime PID. ## 🌐 Web Events @@ -380,6 +387,9 @@ The MQTT client ([src/simclient.py](src/simclient.py)) downloads presentation fi - `infoscreen/discovery` - Initial client announcement - `infoscreen/{client_id}/heartbeat` - Regular status updates - `infoscreen/{client_id}/dashboard` - Screenshot images (base64) +- `infoscreen/{client_id}/health` - Process health state (`event_id`, process, pid, status) +- `infoscreen/{client_id}/logs/error` - Forwarded client error logs +- `infoscreen/{client_id}/logs/warn` - Forwarded client warning logs #### Server → Client - `infoscreen/{client_id}/discovery_ack` - Server response with client ID @@ -493,6 +503,30 @@ This is the fastest workaround if hardware decode is not required or not availab ./scripts/test-mqtt.sh ``` +### Monitoring and UTC timestamps + +Client-side monitoring is implemented with a health-state bridge between `display_manager.py` and `simclient.py`. + +- Health bridge file: `src/current_process_health.json` +- Local monitoring log: `logs/monitoring.log` +- Process states: `running`, `crashed`, `stopped` +- Restart tracking: bounded restart attempts per active event + +UTC timestamp policy: + +- `display_manager.log`, `simclient.log`, and `monitoring.log` are written in UTC (`...Z`) +- MQTT payload timestamps (heartbeat/dashboard/health/log messages) are UTC ISO timestamps +- Screenshot metadata timestamps are UTC ISO timestamps + +This prevents daylight-saving and timezone drift issues across clients. + +### VLC/PulseAudio warnings in remote sessions + +Warnings such as `pulse audio output error: overflow, flushing` can appear when testing through remote desktop/audio forwarding (for example, NoMachine) or virtual/dummy display setups. + +- If playback and audio are stable on a real HDMI display, this is usually non-fatal. +- If warnings appear only in remote sessions, treat them as environment-related rather than a core video playback bug. + ### Screenshots not uploading **Check which session type you're running:** diff --git a/src/display_manager.py b/src/display_manager.py index e4c1dc8..a198f98 100644 --- a/src/display_manager.py +++ b/src/display_manager.py @@ -23,6 +23,7 @@ from logging.handlers import RotatingFileHandler from typing import Optional, Dict, List, IO from urllib.parse import urlparse import requests +import psutil import json as _json import threading import time as _time @@ -67,12 +68,15 @@ os.makedirs(os.path.dirname(LOG_PATH), exist_ok=True) logging.basicConfig( level=getattr(logging, LOG_LEVEL.upper(), logging.INFO), - format="%(asctime)s [%(levelname)s] %(message)s", + format="%(asctime)s.%(msecs)03dZ [%(levelname)s] %(message)s", + datefmt="%Y-%m-%dT%H:%M:%S", handlers=[ RotatingFileHandler(LOG_PATH, maxBytes=2*1024*1024, backupCount=5), logging.StreamHandler() ] ) +# Force all logging timestamps to UTC (affects %(asctime)s in all formatters). +logging.Formatter.converter = time.gmtime # Log CEC mode after logging is configured if ENV == "development": @@ -82,6 +86,90 @@ elif CEC_ENABLED: else: logging.info("[CEC] HDMI-CEC disabled in configuration") +# Setup monitoring logger (separate from main display_manager.log for health/crash tracking) +MONITORING_LOG_PATH = os.path.join(os.path.dirname(__file__), "..", "logs", "monitoring.log") +os.makedirs(os.path.dirname(MONITORING_LOG_PATH), exist_ok=True) +monitoring_logger = logging.getLogger("monitoring") +monitoring_logger.setLevel(getattr(logging, LOG_LEVEL.upper(), logging.INFO)) +monitoring_handler = RotatingFileHandler(MONITORING_LOG_PATH, maxBytes=5*1024*1024, backupCount=5) +monitoring_handler.setFormatter(logging.Formatter("%(asctime)s.%(msecs)03dZ [%(levelname)s] %(message)s", "%Y-%m-%dT%H:%M:%S")) +monitoring_logger.addHandler(monitoring_handler) +monitoring_logger.propagate = False # Don't duplicate to main logger +logging.info(f"Monitoring logger initialized: {MONITORING_LOG_PATH}") + +# Health state file (shared bridge between display_manager and simclient) +HEALTH_STATE_FILE = os.path.join(os.path.dirname(__file__), "current_process_health.json") + + +class ProcessHealthState: + """Track and persist process health state for monitoring integration""" + + def __init__(self): + self.event_id = None + self.event_type = None + self.process_name = None + self.process_pid = None + self.status = "stopped" # running, crashed, starting, stopped + self.restart_count = 0 + self.max_restarts = 3 + self.last_update = datetime.now(timezone.utc).isoformat() + + def to_dict(self) -> Dict: + return { + "event_id": self.event_id, + "event_type": self.event_type, + "current_process": self.process_name, + "process_pid": self.process_pid, + "process_status": self.status, + "restart_count": self.restart_count, + "timestamp": datetime.now(timezone.utc).isoformat() + } + + def save(self): + """Persist health state to file for simclient to read""" + try: + with open(HEALTH_STATE_FILE, "w") as f: + json.dump(self.to_dict(), f, indent=2) + except Exception as e: + monitoring_logger.error(f"Failed to save health state: {e}") + + def update_running(self, event_id: str, event_type: str, process_name: str, pid: Optional[int]): + """Mark process as running""" + self.event_id = event_id + self.event_type = event_type + self.process_name = process_name + self.process_pid = pid + self.status = "running" + self.restart_count = 0 + self.save() + monitoring_logger.info(f"Process started: event_id={event_id} event_type={event_type} process={process_name} pid={pid}") + + def update_crashed(self): + """Mark process as crashed""" + self.status = "crashed" + self.save() + monitoring_logger.warning(f"Process crashed: event_id={self.event_id} event_type={self.event_type} process={self.process_name} restart_count={self.restart_count}/{self.max_restarts}") + + def update_restart_attempt(self): + """Increment restart counter and check if max exceeded""" + self.restart_count += 1 + self.save() + if self.restart_count <= self.max_restarts: + monitoring_logger.warning(f"Restarting process: attempt {self.restart_count}/{self.max_restarts} for {self.process_name}") + else: + monitoring_logger.error(f"Max restart attempts exceeded for {self.process_name}: event_id={self.event_id}") + + def update_stopped(self): + """Mark process as stopped (normal event end)""" + self.status = "stopped" + self.event_id = None + self.event_type = None + self.process_name = None + self.process_pid = None + self.restart_count = 0 + self.save() + monitoring_logger.info("Process stopped (event ended or no active event)") + class HDMICECController: """Controls HDMI-CEC to turn TV on/off automatically @@ -327,7 +415,7 @@ class DisplayProcess: self.player = player self.event_type = event_type self.event_id = event_id - self.start_time = datetime.now() + self.start_time = datetime.now(timezone.utc) self.log_file = log_file self.log_path = log_path @@ -473,6 +561,9 @@ class DisplayManager: self.last_file_mtime: Optional[float] = None self.running = True + # Initialize health state tracking for process monitoring + self.health = ProcessHealthState() + # Initialize HDMI-CEC controller self.cec = HDMICECController( enabled=CEC_ENABLED, @@ -629,7 +720,9 @@ class DisplayManager: time.sleep(1) if self.current_process.is_running(): self.current_process.terminate(force=True) - + + # Update health state to reflect process stopped + self.health.update_stopped() self.current_process = None self.current_event_data = None @@ -852,6 +945,9 @@ class DisplayManager: event_id = self.get_event_identifier(event) logging.info(f"Presentation started with PID: {process.pid}") + # Update health state for monitoring + self.health.update_running(event_id, 'presentation', 'impressive', process.pid) + return DisplayProcess(process, 'presentation', event_id, log_file=impressive_log, log_path=impressive_log_path) except Exception as e: @@ -904,6 +1000,8 @@ class DisplayManager: ) event_id = self.get_event_identifier(event) logging.info(f"Video started with PID: {process.pid} (external vlc)") + # Update health state for monitoring + self.health.update_running(event_id, 'video', 'vlc', process.pid) return DisplayProcess(process=process, event_type='video', event_id=event_id, log_file=video_log, log_path=video_log_path) else: logging.error("No video player found (python-vlc or vlc binary)") @@ -911,7 +1009,24 @@ class DisplayManager: # Use libvlc via python-vlc try: - instance = vlc.Instance() + # Keep python-vlc behavior aligned with external vlc fullscreen mode. + instance = vlc.Instance('--fullscreen', '--no-video-title-show') + + def _force_fullscreen(player_obj, label: str): + """Retry fullscreen toggle because video outputs may attach asynchronously.""" + if not player_obj: + return + + def _worker(): + for delay in (0.0, 0.4, 1.0): + if delay > 0: + time.sleep(delay) + try: + player_obj.set_fullscreen(True) + except Exception as e: + logging.debug(f"Could not force fullscreen for {label}: {e}") + + threading.Thread(target=_worker, daemon=True).start() autoplay = bool(video.get('autoplay', True)) loop_flag = bool(video.get('loop', False)) @@ -957,10 +1072,14 @@ class DisplayManager: if autoplay: try: mlp.play() + _force_fullscreen(mp, 'python-vlc MediaListPlayer') except Exception as e: logging.error(f"Failed to play media list: {e}") event_id = self.get_event_identifier(event) - logging.info("Video started via python-vlc (MediaListPlayer)" ) + runtime_pid = os.getpid() + logging.info(f"Video started via python-vlc (MediaListPlayer), runtime PID: {runtime_pid}") + # python-vlc runs in-process (no external vlc child), so publish this process PID. + self.health.update_running(event_id, 'video', 'vlc', runtime_pid) return DisplayProcess(process=None, event_type='video', event_id=event_id, log_file=None, log_path=None, player=mlp) else: @@ -976,11 +1095,15 @@ class DisplayManager: if autoplay: try: mp.play() + _force_fullscreen(mp, 'python-vlc MediaPlayer') except Exception as e: logging.error(f"Failed to start media player: {e}") event_id = self.get_event_identifier(event) - logging.info("Video started via python-vlc (MediaPlayer)") + runtime_pid = os.getpid() + logging.info(f"Video started via python-vlc (MediaPlayer), runtime PID: {runtime_pid}") + # python-vlc runs in-process (no external vlc child), so publish this process PID. + self.health.update_running(event_id, 'video', 'vlc', runtime_pid) return DisplayProcess(process=None, event_type='video', event_id=event_id, log_file=None, log_path=None, player=mp) except Exception as e: @@ -1137,6 +1260,10 @@ class DisplayManager: event_id = self.get_event_identifier(event) logging.info(f"Webpage started with PID: {process.pid}") + + # Update health state for monitoring (track chromium browser) + browser_name = 'chromium-browser' if self._command_exists('chromium-browser') else 'chromium' + self.health.update_running(event_id, etype, browser_name, process.pid) # Inject auto-scroll JS via Chrome DevTools Protocol (CDP) if enabled and available if autoscroll_enabled: @@ -1367,9 +1494,20 @@ class DisplayManager: if self.current_process.event_type == 'presentation' and exit_code == 0: logging.info("Presentation process ended with exit code 0 (likely normal completion).") self.current_process = None + # Update health state to show normal completion + self.health.update_stopped() # Don't turn off TV yet - event might still be active return + # Process crashed unexpectedly + self.health.update_crashed() + self.health.update_restart_attempt() + + if self.health.restart_count > self.health.max_restarts: + logging.error(f"Max restart attempts exceeded - giving up on {self.health.process_name}") + self.current_process = None + return + logging.info("Restarting display process...") self.current_process = None # Don't turn off TV when restarting same event @@ -1413,10 +1551,8 @@ class DisplayManager: # Log timezone information for debugging now_utc = datetime.now(timezone.utc) - now_local = datetime.now() logging.info(f"Current time (UTC): {now_utc.strftime('%Y-%m-%d %H:%M:%S %Z')}") - logging.info(f"Current time (Local): {now_local.strftime('%Y-%m-%d %H:%M:%S')}") - logging.info("Note: Event times are expected in UTC format") + logging.info("Event times are expected in UTC format") while self.running: try: @@ -1469,7 +1605,7 @@ class DisplayManager: Maintains a 'latest.jpg' file and rotates older screenshots beyond SCREENSHOT_MAX_FILES. """ try: - ts = datetime.now().strftime('%Y%m%d_%H%M%S') + ts = datetime.now(timezone.utc).strftime('%Y%m%d_%H%M%S') raw_path = os.path.join(self.screenshot_dir, f'screenshot_{ts}.png') final_path = os.path.join(self.screenshot_dir, f'screenshot_{ts}.jpg') diff --git a/src/simclient.py b/src/simclient.py index 4ca5696..261011e 100644 --- a/src/simclient.py +++ b/src/simclient.py @@ -15,7 +15,7 @@ import logging from dotenv import load_dotenv import requests import base64 -from datetime import datetime +from datetime import datetime, timezone import threading from urllib.parse import urlsplit, urlunsplit, unquote @@ -123,9 +123,26 @@ if DEBUG_MODE: log_handlers.append(logging.StreamHandler()) logging.basicConfig( level=getattr(logging, LOG_LEVEL.upper(), logging.INFO), - format="%(asctime)s [%(levelname)s] %(message)s", + format="%(asctime)s.%(msecs)03dZ [%(levelname)s] %(message)s", + datefmt="%Y-%m-%dT%H:%M:%S", handlers=log_handlers ) +# Force all logging timestamps to UTC (affects %(asctime)s in all formatters). +logging.Formatter.converter = time.gmtime + +# Setup monitoring logger (separate for health/crash events, local rotation only) +MONITORING_LOG_PATH = os.path.join(os.path.dirname(__file__), "..", "logs", "monitoring.log") +os.makedirs(os.path.dirname(MONITORING_LOG_PATH), exist_ok=True) +monitoring_logger = logging.getLogger("monitoring") +monitoring_logger.setLevel(getattr(logging, LOG_LEVEL.upper(), logging.INFO)) +monitoring_handler = RotatingFileHandler(MONITORING_LOG_PATH, maxBytes=5*1024*1024, backupCount=5) +monitoring_handler.setFormatter(logging.Formatter("%(asctime)s.%(msecs)03dZ [%(levelname)s] %(message)s", "%Y-%m-%dT%H:%M:%S")) +monitoring_logger.addHandler(monitoring_handler) +monitoring_logger.propagate = False # Don't duplicate to main logger +logging.info(f"Monitoring logger initialized: {MONITORING_LOG_PATH}") + +# Health state file (written by display_manager, read by simclient) +HEALTH_STATE_FILE = os.path.join(os.path.dirname(__file__), "current_process_health.json") discovered = False @@ -485,7 +502,7 @@ def get_latest_screenshot(): return { "filename": os.path.basename(preferred_path), "data": screenshot_data, - "timestamp": datetime.fromtimestamp(file_stats.st_mtime).isoformat(), + "timestamp": datetime.fromtimestamp(file_stats.st_mtime, tz=timezone.utc).isoformat(), "size": file_stats.st_size } except Exception as e: @@ -514,7 +531,7 @@ def get_latest_screenshot(): info = { "filename": latest_file, "data": screenshot_data, - "timestamp": datetime.fromtimestamp(file_stats.st_mtime).isoformat(), + "timestamp": datetime.fromtimestamp(file_stats.st_mtime, tz=timezone.utc).isoformat(), "size": file_stats.st_size } logging.debug(f"Selected latest screenshot: {latest_file} ({file_stats.st_size} bytes)") @@ -525,13 +542,82 @@ def get_latest_screenshot(): return None +def read_health_state(): + """Read the health state file written by display_manager""" + try: + if not os.path.exists(HEALTH_STATE_FILE): + return None + with open(HEALTH_STATE_FILE, 'r') as f: + return json.load(f) + except Exception as e: + logging.debug(f"Could not read health state file: {e}") + return None + + +def publish_health_message(client, client_id): + """Publish health status to server via MQTT""" + try: + health = read_health_state() + if not health: + return # No active process + + payload = { + "timestamp": datetime.now(timezone.utc).isoformat(), + "expected_state": { + "event_id": health.get("event_id") + }, + "actual_state": { + "process": health.get("current_process"), + "pid": health.get("process_pid"), + "status": health.get("process_status") + } + } + + topic = f"infoscreen/{client_id}/health" + res = client.publish(topic, json.dumps(payload), qos=1) + if res.rc == mqtt.MQTT_ERR_SUCCESS: + logging.debug(f"Health message published: {health.get('current_process')} status={health.get('process_status')}") + else: + logging.debug(f"Health publish failed with code: {res.rc}") + except Exception as e: + logging.debug(f"Error publishing health: {e}") + + +def publish_log_message(client, client_id, level: str, message: str, context: dict = None): + """Publish log message to server via MQTT (only if level is ERROR or WARN, unless DEBUG_MODE)""" + try: + # Filter logs: only send ERROR/WARN to server, keep INFO/DEBUG local-only unless DEBUG_MODE + if level.upper() == "INFO" and not DEBUG_MODE: + return # Keep INFO logs local only in production + if level.upper() == "DEBUG": + return # DEBUG logs always local-only + + topic = f"infoscreen/{client_id}/logs/{level.lower()}" + payload = { + "timestamp": datetime.now(timezone.utc).isoformat(), + "message": message, + "context": context or {} + } + + res = client.publish(topic, json.dumps(payload), qos=1) + if res.rc == mqtt.MQTT_ERR_SUCCESS: + monitoring_logger.log(getattr(logging, level.upper(), logging.INFO), f"[MQTT] {message}") + else: + logging.debug(f"Log publish failed ({level}) with code: {res.rc}") + except Exception as e: + logging.debug(f"Error publishing log: {e}") + + def send_screenshot_heartbeat(client, client_id): """Send heartbeat with screenshot to server for dashboard monitoring""" try: screenshot_info = get_latest_screenshot() + # Also read health state and include in heartbeat + health = read_health_state() + heartbeat_data = { - "timestamp": datetime.now().isoformat(), + "timestamp": datetime.now(timezone.utc).isoformat(), "client_id": client_id, "status": "alive", "screenshot": screenshot_info, @@ -542,6 +628,17 @@ def send_screenshot_heartbeat(client, client_id): } } + # Include health info if available (from display_manager) + if health: + heartbeat_data["process_health"] = { + "event_id": health.get("event_id"), + "event_type": health.get("event_type"), + "current_process": health.get("current_process"), + "process_pid": health.get("process_pid"), + "process_status": health.get("process_status"), + "restart_count": health.get("restart_count", 0) + } + # Send to dashboard monitoring topic dashboard_topic = f"infoscreen/{client_id}/dashboard" payload = json.dumps(heartbeat_data) @@ -575,7 +672,7 @@ def screenshot_service_thread(client, client_id): def main(): global discovered - print(f"[{datetime.now().isoformat()}] simclient.py: program started") + print(f"[{datetime.now(timezone.utc).isoformat()}] simclient.py: program started") logging.info("Client starting - deleting old event file if present") delete_event_file() @@ -840,6 +937,8 @@ def main(): result = client.publish(f"infoscreen/{client_id}/heartbeat", "alive", qos=0) if result.rc == mqtt.MQTT_ERR_SUCCESS: logging.info("Heartbeat sent.") + # Also send health and screenshot heartbeats + publish_health_message(client, client_id) elif result.rc == mqtt.MQTT_ERR_NO_CONN: logging.debug("Heartbeat publish returned NO_CONN; retrying in 2s...") time.sleep(2) @@ -847,6 +946,7 @@ def main(): retry = client.publish(f"infoscreen/{client_id}/heartbeat", "alive", qos=0) if retry.rc == mqtt.MQTT_ERR_SUCCESS: logging.info("Heartbeat sent after retry.") + publish_health_message(client, client_id) else: logging.warning(f"Heartbeat publish failed after retry with code: {retry.rc}") else: