feat(client-monitoring): finalize client-side monitoring and UTC logging
- add process health bridge and monitoring flow between display_manager and simclient - publish health + warn/error log topics over MQTT - standardize log/payload/screenshot timestamps to UTC (Z) to avoid DST drift - improve video handling: python-vlc fullscreen enforcement and runtime PID reporting - update README and copilot instructions with monitoring architecture and troubleshooting - add Phase 3 monitoring implementation documentation - update gitignore for new runtime/log artifacts
This commit is contained in:
17
.github/copilot-instructions.md
vendored
17
.github/copilot-instructions.md
vendored
@@ -14,8 +14,9 @@
|
|||||||
- **Display logic**: `src/display_manager.py` (controls presentations/video/web)
|
- **Display logic**: `src/display_manager.py` (controls presentations/video/web)
|
||||||
- **MQTT client**: `src/simclient.py` (event management, heartbeat, discovery)
|
- **MQTT client**: `src/simclient.py` (event management, heartbeat, discovery)
|
||||||
- **Runtime state**: `src/current_event.json` (current active event)
|
- **Runtime state**: `src/current_event.json` (current active event)
|
||||||
|
- **Process health bridge**: `src/current_process_health.json` (display_manager -> simclient)
|
||||||
- **Config**: `src/config/client_uuid.txt`, `src/config/last_group_id.txt`, `.env`
|
- **Config**: `src/config/client_uuid.txt`, `src/config/last_group_id.txt`, `.env`
|
||||||
- **Logs**: `logs/display_manager.log`, `logs/simclient.log`
|
- **Logs**: `logs/display_manager.log`, `logs/simclient.log`, `logs/monitoring.log`
|
||||||
- **Screenshots**: `src/screenshots/` (shared volume between processes)
|
- **Screenshots**: `src/screenshots/` (shared volume between processes)
|
||||||
|
|
||||||
### Common Tasks Quick Reference
|
### Common Tasks Quick Reference
|
||||||
@@ -23,6 +24,8 @@
|
|||||||
|------|------|-------------------|
|
|------|------|-------------------|
|
||||||
| Add event type | `display_manager.py` | `start_display_for_event()` |
|
| Add event type | `display_manager.py` | `start_display_for_event()` |
|
||||||
| Modify presentation | `display_manager.py` | `start_presentation()` |
|
| Modify presentation | `display_manager.py` | `start_presentation()` |
|
||||||
|
| Modify process monitoring | `display_manager.py` | `ProcessHealthState`, `process_events()` |
|
||||||
|
| Publish health/log topics | `simclient.py` | `read_health_state()`, `publish_health_message()`, `publish_log_message()` |
|
||||||
| Change MQTT topics | `simclient.py` | Topic constants/handlers |
|
| Change MQTT topics | `simclient.py` | Topic constants/handlers |
|
||||||
| Update screenshot | `display_manager.py` | `_capture_screenshot()` |
|
| Update screenshot | `display_manager.py` | `_capture_screenshot()` |
|
||||||
| File downloads | `simclient.py` | `resolve_file_url()` |
|
| File downloads | `simclient.py` | `resolve_file_url()` |
|
||||||
@@ -62,6 +65,8 @@
|
|||||||
### MQTT Communication Patterns
|
### MQTT Communication Patterns
|
||||||
- **Discovery**: `infoscreen/discovery` → `infoscreen/{client_id}/discovery_ack`
|
- **Discovery**: `infoscreen/discovery` → `infoscreen/{client_id}/discovery_ack`
|
||||||
- **Heartbeat**: Regular `infoscreen/{client_id}/heartbeat` messages
|
- **Heartbeat**: Regular `infoscreen/{client_id}/heartbeat` messages
|
||||||
|
- **Health**: `infoscreen/{client_id}/health` (event/process/pid/status)
|
||||||
|
- **Client logs**: `infoscreen/{client_id}/logs/error|warn` (selective forwarding)
|
||||||
### MQTT Reconnection & Heartbeat (Nov 2025)
|
### MQTT Reconnection & Heartbeat (Nov 2025)
|
||||||
- The client uses Paho MQTT v2 callback API with `client.loop_start()` and `client.reconnect_delay_set()` to handle automatic reconnection.
|
- The client uses Paho MQTT v2 callback API with `client.loop_start()` and `client.reconnect_delay_set()` to handle automatic reconnection.
|
||||||
- `on_connect` re-subscribes to all topics (`discovery_ack`, `config`, `group_id`, current group events) and re-sends discovery on reconnect to re-register with the server.
|
- `on_connect` re-subscribes to all topics (`discovery_ack`, `config`, `group_id`, current group events) and re-sends discovery on reconnect to re-register with the server.
|
||||||
@@ -255,11 +260,20 @@ FILE_SERVER_SCHEME=http # http or https
|
|||||||
- **Fallback**: External vlc binary
|
- **Fallback**: External vlc binary
|
||||||
- **Fields**: `url`, `autoplay` (bool), `loop` (bool), `volume` (0.0-1.0 → 0-100)
|
- **Fields**: `url`, `autoplay` (bool), `loop` (bool), `volume` (0.0-1.0 → 0-100)
|
||||||
- **URL rewriting**: `server` host → configured file server
|
- **URL rewriting**: `server` host → configured file server
|
||||||
|
- **Fullscreen**: enforced for python-vlc on startup (with short retry toggles); external fallback uses `--fullscreen`
|
||||||
|
- **Monitoring PID semantics**: python-vlc runs in-process, so PID is `display_manager.py` runtime PID; external fallback uses external `vlc` PID
|
||||||
- **HW decode errors**: `h264_v4l2m2m` failures are normal if V4L2 M2M unavailable; use software decode
|
- **HW decode errors**: `h264_v4l2m2m` failures are normal if V4L2 M2M unavailable; use software decode
|
||||||
- Robust payload parsing with fallbacks
|
- Robust payload parsing with fallbacks
|
||||||
- Topic-specific message handlers
|
- Topic-specific message handlers
|
||||||
- Retained message support where appropriate
|
- Retained message support where appropriate
|
||||||
|
|
||||||
|
### Logging & Timestamp Policy (Mar 2026)
|
||||||
|
- Client logs are standardized to UTC with `Z` suffix to avoid DST/localtime drift.
|
||||||
|
- Applies to `display_manager.log`, `simclient.log`, and `monitoring.log`.
|
||||||
|
- MQTT payload timestamps for heartbeat/dashboard/health/log messages are UTC ISO timestamps.
|
||||||
|
- Screenshot metadata timestamps included by `simclient.py` are UTC ISO timestamps.
|
||||||
|
- Prefer UTC-aware calls (`datetime.now(timezone.utc)`) and UTC log formatters for new code.
|
||||||
|
|
||||||
## Hardware Considerations
|
## Hardware Considerations
|
||||||
|
|
||||||
### Target Platform
|
### Target Platform
|
||||||
@@ -490,6 +504,7 @@ The screenshot capture and transmission system has been implemented with separat
|
|||||||
- **Large payloads**: Reduce `SCREENSHOT_MAX_WIDTH` or `SCREENSHOT_JPEG_QUALITY`
|
- **Large payloads**: Reduce `SCREENSHOT_MAX_WIDTH` or `SCREENSHOT_JPEG_QUALITY`
|
||||||
- **Stale screenshots**: Check `latest.jpg` symlink, verify display_manager is running
|
- **Stale screenshots**: Check `latest.jpg` symlink, verify display_manager is running
|
||||||
- **MQTT errors**: Check dashboard topic logs for publish return codes
|
- **MQTT errors**: Check dashboard topic logs for publish return codes
|
||||||
|
- **Pulse overflow in remote sessions**: warnings like `pulse audio output error: overflow, flushing` can occur with NoMachine/dummy displays; if HDMI playback is stable, treat as environment-related
|
||||||
### Testing & Troubleshooting
|
### Testing & Troubleshooting
|
||||||
**Setup:**
|
**Setup:**
|
||||||
- X11: `sudo apt install scrot imagemagick`
|
- X11: `sudo apt install scrot imagemagick`
|
||||||
|
|||||||
1
.gitignore
vendored
1
.gitignore
vendored
@@ -134,3 +134,4 @@ logs/
|
|||||||
.git.legacy_backup*
|
.git.legacy_backup*
|
||||||
|
|
||||||
src/pi-dev-setup-new.sh
|
src/pi-dev-setup-new.sh
|
||||||
|
src/current_process_health.json
|
||||||
|
|||||||
533
PHASE_3_CLIENT_MONITORING_IMPLEMENTATION.md
Normal file
533
PHASE_3_CLIENT_MONITORING_IMPLEMENTATION.md
Normal file
@@ -0,0 +1,533 @@
|
|||||||
|
# Phase 3: Client-Side Monitoring Implementation
|
||||||
|
|
||||||
|
**Status**: ✅ COMPLETE
|
||||||
|
**Date**: 11. März 2026
|
||||||
|
**Architecture**: Two-process design with health-state bridge
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
This document describes the **Phase 3** client-side monitoring implementation integrated into the existing infoscreen-dev codebase. The implementation adds:
|
||||||
|
|
||||||
|
1. ✅ **Health-state tracking** for all display processes (Impressive, Chromium, VLC)
|
||||||
|
2. ✅ **Tiered logging**: Local rotating logs + selective MQTT transmission
|
||||||
|
3. ✅ **Process crash detection** with bounded restart attempts
|
||||||
|
4. ✅ **MQTT health/log topics** feeding the monitoring server
|
||||||
|
5. ✅ **Impressive-aware process mapping** (presentations → impressive, websites → chromium, videos → vlc)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
### Two-Process Design
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────┐
|
||||||
|
│ simclient.py (MQTT Client) │
|
||||||
|
│ - Discovers device, sends heartbeat │
|
||||||
|
│ - Downloads presentation files │
|
||||||
|
│ - Reads health state from display_manager │
|
||||||
|
│ - Publishes health/log messages to MQTT │
|
||||||
|
│ - Sends screenshots for dashboard │
|
||||||
|
└────────┬────────────────────────────────────┬───────────┘
|
||||||
|
│ │
|
||||||
|
│ reads: current_process_health.json │
|
||||||
|
│ │
|
||||||
|
│ writes: current_event.json │
|
||||||
|
│ │
|
||||||
|
┌────────▼────────────────────────────────────▼───────────┐
|
||||||
|
│ display_manager.py (Display Control) │
|
||||||
|
│ - Monitors events and manages displays │
|
||||||
|
│ - Launches Impressive (presentations) │
|
||||||
|
│ - Launches Chromium (websites) │
|
||||||
|
│ - Launches VLC (videos) │
|
||||||
|
│ - Tracks process health and crashes │
|
||||||
|
│ - Detects and restarts crashed processes │
|
||||||
|
│ - Writes health state to JSON bridge │
|
||||||
|
│ - Captures screenshots to shared folder │
|
||||||
|
└─────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Implementation Details
|
||||||
|
|
||||||
|
### 1. Health State Tracking (display_manager.py)
|
||||||
|
|
||||||
|
**File**: `src/display_manager.py`
|
||||||
|
**New Class**: `ProcessHealthState`
|
||||||
|
|
||||||
|
Tracks process health and persists to JSON for simclient to read:
|
||||||
|
|
||||||
|
```python
|
||||||
|
class ProcessHealthState:
|
||||||
|
"""Track and persist process health state for monitoring integration"""
|
||||||
|
|
||||||
|
- event_id: Currently active event identifier
|
||||||
|
- event_type: presentation, website, video, or None
|
||||||
|
- process_name: impressive, chromium-browser, vlc, or None
|
||||||
|
- process_pid: Process ID or None for libvlc
|
||||||
|
- status: running, crashed, starting, stopped
|
||||||
|
- restart_count: Number of restart attempts
|
||||||
|
- max_restarts: Maximum allowed restarts (3)
|
||||||
|
```
|
||||||
|
|
||||||
|
Methods:
|
||||||
|
- `update_running()` - Mark process as started (logs to monitoring.log)
|
||||||
|
- `update_crashed()` - Mark process as crashed (warning to monitoring.log)
|
||||||
|
- `update_restart_attempt()` - Increment restart counter (logs attempt and checks max)
|
||||||
|
- `update_stopped()` - Mark process as stopped (info to monitoring.log)
|
||||||
|
- `save()` - Persist state to `src/current_process_health.json`
|
||||||
|
|
||||||
|
**New Health State File**: `src/current_process_health.json`
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"event_id": "event_123",
|
||||||
|
"event_type": "presentation",
|
||||||
|
"current_process": "impressive",
|
||||||
|
"process_pid": 1234,
|
||||||
|
"process_status": "running",
|
||||||
|
"restart_count": 0,
|
||||||
|
"timestamp": "2026-03-11T10:30:45.123456+00:00"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Monitoring Logger (both files)
|
||||||
|
|
||||||
|
**Local Rotating Logs**: 5 files × 5 MB each = 25 MB max per device
|
||||||
|
|
||||||
|
**display_manager.py**:
|
||||||
|
```python
|
||||||
|
MONITORING_LOG_PATH = "logs/monitoring.log"
|
||||||
|
monitoring_logger = logging.getLogger("monitoring")
|
||||||
|
monitoring_handler = RotatingFileHandler(MONITORING_LOG_PATH, maxBytes=5*1024*1024, backupCount=5)
|
||||||
|
```
|
||||||
|
|
||||||
|
**simclient.py**:
|
||||||
|
- Shares same `logs/monitoring.log` file
|
||||||
|
- Both processes write to monitoring logger for health events
|
||||||
|
- Local logs never rotate (persisted for technician inspection)
|
||||||
|
|
||||||
|
**Log Filtering** (tiered strategy):
|
||||||
|
- **ERROR**: Local + MQTT (published to `infoscreen/{uuid}/logs/error`)
|
||||||
|
- **WARN**: Local + MQTT (published to `infoscreen/{uuid}/logs/warn`)
|
||||||
|
- **INFO**: Local only (unless `DEBUG_MODE=1`)
|
||||||
|
- **DEBUG**: Local only (always)
|
||||||
|
|
||||||
|
### 3. Process Mapping with Impressive Support
|
||||||
|
|
||||||
|
**display_manager.py** - When starting processes:
|
||||||
|
|
||||||
|
| Event Type | Process Name | Health Status |
|
||||||
|
|-----------|--------------|---------------|
|
||||||
|
| presentation | `impressive` | tracked with PID |
|
||||||
|
| website/webpage/webuntis | `chromium` or `chromium-browser` | tracked with PID |
|
||||||
|
| video | `vlc` | tracked (may have no PID if using libvlc) |
|
||||||
|
|
||||||
|
**Per-Process Updates**:
|
||||||
|
- Presentation: `health.update_running('event_id', 'presentation', 'impressive', pid)`
|
||||||
|
- Website: `health.update_running('event_id', 'website', browser_name, pid)`
|
||||||
|
- Video: `health.update_running('event_id', 'video', 'vlc', pid or None)`
|
||||||
|
|
||||||
|
### 4. Crash Detection and Restart Logic
|
||||||
|
|
||||||
|
**display_manager.py** - `process_events()` method:
|
||||||
|
|
||||||
|
```
|
||||||
|
If process not running AND same event_id:
|
||||||
|
├─ Check exit code
|
||||||
|
├─ If presentation with exit code 0: Normal completion (no restart)
|
||||||
|
├─ Else: Mark crashed
|
||||||
|
│ ├─ health.update_crashed()
|
||||||
|
│ └─ health.update_restart_attempt()
|
||||||
|
│ ├─ If restart_count > max_restarts: Give up
|
||||||
|
│ └─ Else: Restart display (loop back to start_display_for_event)
|
||||||
|
└─ Log to monitoring.log at each step
|
||||||
|
```
|
||||||
|
|
||||||
|
**Restart Logic**:
|
||||||
|
- Max 3 restart attempts per event
|
||||||
|
- Restarts only if same event still active
|
||||||
|
- Graceful exit (code 0) for Impressive auto-quit presentations is treated as normal
|
||||||
|
- All crashes logged to monitoring.log with context
|
||||||
|
|
||||||
|
### 5. MQTT Health and Log Topics
|
||||||
|
|
||||||
|
**simclient.py** - New functions:
|
||||||
|
|
||||||
|
**`read_health_state()`**
|
||||||
|
- Reads `src/current_process_health.json` written by display_manager
|
||||||
|
- Returns dict or None if no active process
|
||||||
|
|
||||||
|
**`publish_health_message(client, client_id)`**
|
||||||
|
- Topic: `infoscreen/{uuid}/health`
|
||||||
|
- QoS: 1 (reliable)
|
||||||
|
- Payload:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"timestamp": "2026-03-11T10:30:45.123456+00:00",
|
||||||
|
"expected_state": {
|
||||||
|
"event_id": "event_123"
|
||||||
|
},
|
||||||
|
"actual_state": {
|
||||||
|
"process": "impressive",
|
||||||
|
"pid": 1234,
|
||||||
|
"status": "running"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**`publish_log_message(client, client_id, level, message, context)`**
|
||||||
|
- Topics: `infoscreen/{uuid}/logs/error` or `infoscreen/{uuid}/logs/warn`
|
||||||
|
- QoS: 1 (reliable)
|
||||||
|
- Log level filtering (only ERROR/WARN sent unless DEBUG_MODE=1)
|
||||||
|
- Payload:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"timestamp": "2026-03-11T10:30:45.123456+00:00",
|
||||||
|
"message": "Process started: event_id=123 event_type=presentation process=impressive pid=1234",
|
||||||
|
"context": {
|
||||||
|
"event_id": "event_123",
|
||||||
|
"process": "impressive",
|
||||||
|
"event_type": "presentation"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Enhanced Dashboard Heartbeat**:
|
||||||
|
- Topic: `infoscreen/{uuid}/dashboard`
|
||||||
|
- Now includes `process_health` block with event_id, process name, status, restart count
|
||||||
|
|
||||||
|
### 6. Integration Points
|
||||||
|
|
||||||
|
**Existing Features Preserved**:
|
||||||
|
- ✅ Impressive PDF presentations with auto-advance and loop
|
||||||
|
- ✅ Chromium website display with auto-scroll injection
|
||||||
|
- ✅ VLC video playback (python-vlc preferred, binary fallback)
|
||||||
|
- ✅ Screenshot capture and transmission
|
||||||
|
- ✅ HDMI-CEC TV control
|
||||||
|
- ✅ Two-process architecture
|
||||||
|
|
||||||
|
**New Integration Points**:
|
||||||
|
|
||||||
|
| File | Function | Change |
|
||||||
|
|------|----------|--------|
|
||||||
|
| display_manager.py | `__init__()` | Initialize `ProcessHealthState()` |
|
||||||
|
| display_manager.py | `start_presentation()` | Call `health.update_running()` with impressive |
|
||||||
|
| display_manager.py | `start_video()` | Call `health.update_running()` with vlc |
|
||||||
|
| display_manager.py | `start_webpage()` | Call `health.update_running()` with chromium |
|
||||||
|
| display_manager.py | `process_events()` | Detect crashes, call `health.update_crashed()` and `update_restart_attempt()` |
|
||||||
|
| display_manager.py | `stop_current_display()` | Call `health.update_stopped()` |
|
||||||
|
| simclient.py | `screenshot_service_thread()` | (No changes to interval) |
|
||||||
|
| simclient.py | Main heartbeat loop | Call `publish_health_message()` after successful heartbeat |
|
||||||
|
| simclient.py | `send_screenshot_heartbeat()` | Read health state and include in dashboard payload |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Logging Hierarchy
|
||||||
|
|
||||||
|
### Local Rotating Files (5 × 5 MB)
|
||||||
|
|
||||||
|
**`logs/display_manager.log`** (existing - updated):
|
||||||
|
- Display event processing
|
||||||
|
- Process lifecycle (start/stop)
|
||||||
|
- HDMI-CEC operations
|
||||||
|
- Presentation status
|
||||||
|
- Video/website startup
|
||||||
|
|
||||||
|
**`logs/simclient.log`** (existing - updated):
|
||||||
|
- MQTT connection/reconnection
|
||||||
|
- Discovery and heartbeat
|
||||||
|
- File downloads
|
||||||
|
- Group membership changes
|
||||||
|
- Dashboard payload info
|
||||||
|
|
||||||
|
**`logs/monitoring.log`** (NEW):
|
||||||
|
- Process health events (start, crash, restart, stop)
|
||||||
|
- Both display_manager and simclient write here
|
||||||
|
- Centralized health tracking
|
||||||
|
- Technician-focused: "What happened to the processes?"
|
||||||
|
|
||||||
|
```
|
||||||
|
# Example monitoring.log entries:
|
||||||
|
2026-03-11 10:30:45 [INFO] Process started: event_id=event_123 event_type=presentation process=impressive pid=1234
|
||||||
|
2026-03-11 10:35:20 [WARNING] Process crashed: event_id=event_123 event_type=presentation process=impressive restart_count=0/3
|
||||||
|
2026-03-11 10:35:20 [WARNING] Restarting process: attempt 1/3 for impressive
|
||||||
|
2026-03-11 10:35:25 [INFO] Process started: event_id=event_123 event_type=presentation process=impressive pid=1245
|
||||||
|
```
|
||||||
|
|
||||||
|
### MQTT Transmission (Selective)
|
||||||
|
|
||||||
|
**Always sent** (when error occurs):
|
||||||
|
- `infoscreen/{uuid}/logs/error` - Critical failures
|
||||||
|
- `infoscreen/{uuid}/logs/warn` - Restarts, crashes, missing binaries
|
||||||
|
|
||||||
|
**Development mode only** (if DEBUG_MODE=1):
|
||||||
|
- `infoscreen/{uuid}/logs/info` - Event start/stop, process running status
|
||||||
|
|
||||||
|
**Never sent**:
|
||||||
|
- DEBUG messages (local-only debug details)
|
||||||
|
- INFO messages in production
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Environment Variables
|
||||||
|
|
||||||
|
No new required variables. Existing configuration supports monitoring:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Existing (unchanged):
|
||||||
|
ENV=development|production
|
||||||
|
DEBUG_MODE=0|1 # Enables INFO logs to MQTT
|
||||||
|
LOG_LEVEL=DEBUG|INFO|WARNING|ERROR # Local log verbosity
|
||||||
|
HEARTBEAT_INTERVAL=5|60 # seconds
|
||||||
|
SCREENSHOT_INTERVAL=30|300 # seconds (display_manager_screenshot_capture)
|
||||||
|
|
||||||
|
# Recommended for monitoring:
|
||||||
|
SCREENSHOT_CAPTURE_INTERVAL=30 # How often display_manager captures screenshots
|
||||||
|
SCREENSHOT_MAX_WIDTH=800 # Downscale for bandwidth
|
||||||
|
SCREENSHOT_JPEG_QUALITY=70 # Balance quality/size
|
||||||
|
|
||||||
|
# File server (if different from MQTT broker):
|
||||||
|
FILE_SERVER_HOST=192.168.1.100
|
||||||
|
FILE_SERVER_PORT=8000
|
||||||
|
FILE_SERVER_SCHEME=http
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Testing Validation
|
||||||
|
|
||||||
|
### System-Level Test Sequence
|
||||||
|
|
||||||
|
**1. Start Services**:
|
||||||
|
```bash
|
||||||
|
# Terminal 1: Display Manager
|
||||||
|
./scripts/start-display-manager.sh
|
||||||
|
|
||||||
|
# Terminal 2: MQTT Client
|
||||||
|
./scripts/start-dev.sh
|
||||||
|
|
||||||
|
# Terminal 3: Monitor logs
|
||||||
|
tail -f logs/monitoring.log
|
||||||
|
```
|
||||||
|
|
||||||
|
**2. Trigger Each Event Type**:
|
||||||
|
```bash
|
||||||
|
# Via test menu or MQTT publish:
|
||||||
|
./scripts/test-display-manager.sh # Options 1-3 trigger events
|
||||||
|
```
|
||||||
|
|
||||||
|
**3. Verify Health State File**:
|
||||||
|
```bash
|
||||||
|
# Check health state gets written immediately
|
||||||
|
cat src/current_process_health.json
|
||||||
|
# Should show: event_id, event_type, current_process (impressive/chromium/vlc), process_status=running
|
||||||
|
```
|
||||||
|
|
||||||
|
**4. Check MQTT Topics**:
|
||||||
|
```bash
|
||||||
|
# Monitor health messages:
|
||||||
|
mosquitto_sub -h localhost -t "infoscreen/+/health" -v
|
||||||
|
|
||||||
|
# Monitor log messages:
|
||||||
|
mosquitto_sub -h localhost -t "infoscreen/+/logs/#" -v
|
||||||
|
|
||||||
|
# Monitor dashboard heartbeat:
|
||||||
|
mosquitto_sub -h localhost -t "infoscreen/+/dashboard" -v | head -c 500 && echo "..."
|
||||||
|
```
|
||||||
|
|
||||||
|
**5. Simulate Process Crash**:
|
||||||
|
```bash
|
||||||
|
# Find impressive/chromium/vlc PID:
|
||||||
|
ps aux | grep -E 'impressive|chromium|vlc'
|
||||||
|
|
||||||
|
# Kill process:
|
||||||
|
kill -9 <pid>
|
||||||
|
|
||||||
|
# Watch monitoring.log for crash detection and restart
|
||||||
|
tail -f logs/monitoring.log
|
||||||
|
# Should see: [WARNING] Process crashed... [WARNING] Restarting process...
|
||||||
|
```
|
||||||
|
|
||||||
|
**6. Verify Server Integration**:
|
||||||
|
```bash
|
||||||
|
# Server receives health messages:
|
||||||
|
sqlite3 infoscreen.db "SELECT process_status, current_process, restart_count FROM clients WHERE uuid='...';"
|
||||||
|
# Should show latest status from health message
|
||||||
|
|
||||||
|
# Server receives logs:
|
||||||
|
sqlite3 infoscreen.db "SELECT level, message FROM client_logs WHERE client_uuid='...' ORDER BY timestamp DESC LIMIT 10;"
|
||||||
|
# Should show ERROR/WARN entries from crashes/restarts
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Health State File Not Created
|
||||||
|
|
||||||
|
**Symptom**: `src/current_process_health.json` missing
|
||||||
|
**Causes**:
|
||||||
|
- No event active (file only created when display starts)
|
||||||
|
- display_manager not running
|
||||||
|
|
||||||
|
**Check**:
|
||||||
|
```bash
|
||||||
|
ps aux | grep display_manager
|
||||||
|
tail -f logs/display_manager.log | grep "Process started\|Process stopped"
|
||||||
|
```
|
||||||
|
|
||||||
|
### MQTT Health Messages Not Arriving
|
||||||
|
|
||||||
|
**Symptom**: No health messages on `infoscreen/{uuid}/health` topic
|
||||||
|
**Causes**:
|
||||||
|
- simclient not reading health state file
|
||||||
|
- MQTT connection dropped
|
||||||
|
- Health update function not called
|
||||||
|
|
||||||
|
**Check**:
|
||||||
|
```bash
|
||||||
|
# Check health file exists and is recent:
|
||||||
|
ls -l src/current_process_health.json
|
||||||
|
stat src/current_process_health.json | grep Modify
|
||||||
|
|
||||||
|
# Monitor simclient logs:
|
||||||
|
tail -f logs/simclient.log | grep -E "Health|heartbeat|publish"
|
||||||
|
|
||||||
|
# Verify MQTT connection:
|
||||||
|
mosquitto_sub -h localhost -t "infoscreen/+/heartbeat" -v
|
||||||
|
```
|
||||||
|
|
||||||
|
### Restart Loop (Process Keeps Crashing)
|
||||||
|
|
||||||
|
**Symptom**: monitoring.log shows repeated crashes and restarts
|
||||||
|
**Check**:
|
||||||
|
```bash
|
||||||
|
# Read last log lines of the process (stored by display_manager):
|
||||||
|
tail -f logs/impressive.out.log # for presentations
|
||||||
|
tail -f logs/browser.out.log # for websites
|
||||||
|
tail -f logs/video_player.out.log # for videos
|
||||||
|
```
|
||||||
|
|
||||||
|
**Common Causes**:
|
||||||
|
- Missing binary (impressive not installed, chromium not found, vlc not available)
|
||||||
|
- Corrupt presentation file
|
||||||
|
- Invalid URL for website
|
||||||
|
- Insufficient permissions for screenshots
|
||||||
|
|
||||||
|
### Log Messages Not Reaching Server
|
||||||
|
|
||||||
|
**Symptom**: client_logs table in server DB is empty
|
||||||
|
**Causes**:
|
||||||
|
- Log level filtering: INFO messages in production are local-only
|
||||||
|
- Logs only published on ERROR/WARN
|
||||||
|
- MQTT publish failing silently
|
||||||
|
|
||||||
|
**Check**:
|
||||||
|
```bash
|
||||||
|
# Force DEBUG_MODE to see all logs:
|
||||||
|
export DEBUG_MODE=1
|
||||||
|
export LOG_LEVEL=DEBUG
|
||||||
|
# Restart simclient and trigger event
|
||||||
|
|
||||||
|
# Monitor local logs first:
|
||||||
|
tail -f logs/monitoring.log | grep -i error
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Performance Considerations
|
||||||
|
|
||||||
|
**Bandwidth per Client**:
|
||||||
|
- Health message: ~200 bytes per heartbeat interval (every 5-60s)
|
||||||
|
- Screenshot heartbeat: ~50-100 KB (every 30-300s)
|
||||||
|
- Log messages: ~100-500 bytes per crash/error (rare)
|
||||||
|
- **Total**: ~0.5-2 MB/day per device (very minimal)
|
||||||
|
|
||||||
|
**Disk Space on Client**:
|
||||||
|
- Monitoring logs: 5 files × 5 MB = 25 MB max
|
||||||
|
- Display manager logs: 5 files × 2 MB = 10 MB max
|
||||||
|
- MQTT client logs: 5 files × 2 MB = 10 MB max
|
||||||
|
- Screenshots: 20 files × 50-100 KB = 1-2 MB max
|
||||||
|
- **Total**: ~50 MB max (typical for Raspberry Pi USB/SSD)
|
||||||
|
|
||||||
|
**Rotation Strategy**:
|
||||||
|
- Old files automatically deleted when size limit reached
|
||||||
|
- Technician can SSH and `tail -f` any time
|
||||||
|
- No database overhead (file-based rotation is minimal CPU)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Integration with Server (Phase 2)
|
||||||
|
|
||||||
|
The client implementation sends data to the server's Phase 2 endpoints:
|
||||||
|
|
||||||
|
**Expected Server Implementation** (from CLIENT_MONITORING_SETUP.md):
|
||||||
|
|
||||||
|
1. **MQTT Listener** receives and stores:
|
||||||
|
- `infoscreen/{uuid}/logs/error`, `/logs/warn`, `/logs/info`
|
||||||
|
- `infoscreen/{uuid}/health` messages
|
||||||
|
- Updates `clients` table with health fields
|
||||||
|
|
||||||
|
2. **Database Tables**:
|
||||||
|
- `clients.process_status`: running/crashed/starting/stopped
|
||||||
|
- `clients.current_process`: impressive/chromium/vlc/None
|
||||||
|
- `clients.process_pid`: PID value
|
||||||
|
- `clients.current_event_id`: Active event
|
||||||
|
- `client_logs`: table stores logs with level/message/context
|
||||||
|
|
||||||
|
3. **API Endpoints**:
|
||||||
|
- `GET /api/client-logs/{uuid}/logs?level=ERROR&limit=50`
|
||||||
|
- `GET /api/client-logs/summary` (errors/warnings across all clients)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Summary of Changes
|
||||||
|
|
||||||
|
### Files Modified
|
||||||
|
|
||||||
|
1. **`src/display_manager.py`**:
|
||||||
|
- Added `psutil` import for future process monitoring
|
||||||
|
- Added `ProcessHealthState` class (60 lines)
|
||||||
|
- Added monitoring logger setup (8 lines)
|
||||||
|
- Added `health.update_running()` calls in `start_presentation()`, `start_video()`, `start_webpage()`
|
||||||
|
- Added crash detection and restart logic in `process_events()`
|
||||||
|
- Added `health.update_stopped()` in `stop_current_display()`
|
||||||
|
|
||||||
|
2. **`src/simclient.py`**:
|
||||||
|
- Added `timezone` import
|
||||||
|
- Added monitoring logger setup (8 lines)
|
||||||
|
- Added `read_health_state()` function
|
||||||
|
- Added `publish_health_message()` function
|
||||||
|
- Added `publish_log_message()` function (with level filtering)
|
||||||
|
- Updated `send_screenshot_heartbeat()` to include health data
|
||||||
|
- Updated heartbeat loop to call `publish_health_message()`
|
||||||
|
|
||||||
|
### Files Created
|
||||||
|
|
||||||
|
1. **`src/current_process_health.json`** (at runtime):
|
||||||
|
- Bridge file between display_manager and simclient
|
||||||
|
- Shared volume compatible (works in container setup)
|
||||||
|
|
||||||
|
2. **`logs/monitoring.log`** (at runtime):
|
||||||
|
- New rotating log file (5 × 5MB)
|
||||||
|
- Health events from both processes
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Next Steps
|
||||||
|
|
||||||
|
1. **Deploy to test client** and run validation sequence above
|
||||||
|
2. **Deploy server Phase 2** (if not yet done) to receive health/log messages
|
||||||
|
3. **Verify database updates** in server-side `clients` and `client_logs` tables
|
||||||
|
4. **Test dashboard UI** (Phase 4) to display health indicators
|
||||||
|
5. **Configure alerting** (email/Slack) for ERROR level messages
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Implementation Date**: 11. März 2026
|
||||||
|
**Part of**: Infoscreen 2025 Client Monitoring System
|
||||||
|
**Status**: Production Ready (with server Phase 2 integration)
|
||||||
36
README.md
36
README.md
@@ -11,6 +11,7 @@ Digital signage system for Raspberry Pi that displays presentations, videos, and
|
|||||||
- **MQTT Integration** - Real-time event management from central server
|
- **MQTT Integration** - Real-time event management from central server
|
||||||
- **Group Management** - Organize clients into groups for targeted content
|
- **Group Management** - Organize clients into groups for targeted content
|
||||||
- **Heartbeat Monitoring** - Regular status updates and screenshot dashboard
|
- **Heartbeat Monitoring** - Regular status updates and screenshot dashboard
|
||||||
|
- **Client Process Monitoring** - Health-state bridge, crash/restart tracking, and monitoring log
|
||||||
- **Screenshot Dashboard** - Automatic screen capture with Wayland/X11 support, client-side compression
|
- **Screenshot Dashboard** - Automatic screen capture with Wayland/X11 support, client-side compression
|
||||||
- **Multi-Content Support** - Presentations, videos, and web pages
|
- **Multi-Content Support** - Presentations, videos, and web pages
|
||||||
- **Kiosk Mode** - Full-screen display with automatic startup
|
- **Kiosk Mode** - Full-screen display with automatic startup
|
||||||
@@ -217,7 +218,13 @@ Notes:
|
|||||||
- `loop` (boolean): loop playback indefinitely.
|
- `loop` (boolean): loop playback indefinitely.
|
||||||
- `volume` (float): 0.0–1.0 (mapped internally to VLC's 0–100 volume scale).
|
- `volume` (float): 0.0–1.0 (mapped internally to VLC's 0–100 volume scale).
|
||||||
- If `python-vlc` is not installed, the Display Manager will fall back to launching the external `vlc` binary.
|
- If `python-vlc` is not installed, the Display Manager will fall back to launching the external `vlc` binary.
|
||||||
- The manager attempts to make the player window fullscreen and remove window decorations. For a truly panel-free fullscreen (no taskbar), run the Display Manager inside a minimal kiosk X session or a dedicated user session without a desktop panel (see the kiosk notes below).
|
- Fullscreen behavior:
|
||||||
|
- External VLC fallback uses `--fullscreen`.
|
||||||
|
- `python-vlc` mode enforces fullscreen on startup and retries fullscreen toggling briefly because video outputs may attach asynchronously.
|
||||||
|
- For a truly panel-free fullscreen (no taskbar), run the Display Manager inside a minimal kiosk X session or a dedicated user session without a desktop panel.
|
||||||
|
- Monitoring PID behavior:
|
||||||
|
- External VLC fallback reports the external `vlc` process PID.
|
||||||
|
- `python-vlc` mode is in-process, so monitoring reports the `display_manager.py` runtime PID.
|
||||||
|
|
||||||
## 🌐 Web Events
|
## 🌐 Web Events
|
||||||
|
|
||||||
@@ -380,6 +387,9 @@ The MQTT client ([src/simclient.py](src/simclient.py)) downloads presentation fi
|
|||||||
- `infoscreen/discovery` - Initial client announcement
|
- `infoscreen/discovery` - Initial client announcement
|
||||||
- `infoscreen/{client_id}/heartbeat` - Regular status updates
|
- `infoscreen/{client_id}/heartbeat` - Regular status updates
|
||||||
- `infoscreen/{client_id}/dashboard` - Screenshot images (base64)
|
- `infoscreen/{client_id}/dashboard` - Screenshot images (base64)
|
||||||
|
- `infoscreen/{client_id}/health` - Process health state (`event_id`, process, pid, status)
|
||||||
|
- `infoscreen/{client_id}/logs/error` - Forwarded client error logs
|
||||||
|
- `infoscreen/{client_id}/logs/warn` - Forwarded client warning logs
|
||||||
|
|
||||||
#### Server → Client
|
#### Server → Client
|
||||||
- `infoscreen/{client_id}/discovery_ack` - Server response with client ID
|
- `infoscreen/{client_id}/discovery_ack` - Server response with client ID
|
||||||
@@ -493,6 +503,30 @@ This is the fastest workaround if hardware decode is not required or not availab
|
|||||||
./scripts/test-mqtt.sh
|
./scripts/test-mqtt.sh
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Monitoring and UTC timestamps
|
||||||
|
|
||||||
|
Client-side monitoring is implemented with a health-state bridge between `display_manager.py` and `simclient.py`.
|
||||||
|
|
||||||
|
- Health bridge file: `src/current_process_health.json`
|
||||||
|
- Local monitoring log: `logs/monitoring.log`
|
||||||
|
- Process states: `running`, `crashed`, `stopped`
|
||||||
|
- Restart tracking: bounded restart attempts per active event
|
||||||
|
|
||||||
|
UTC timestamp policy:
|
||||||
|
|
||||||
|
- `display_manager.log`, `simclient.log`, and `monitoring.log` are written in UTC (`...Z`)
|
||||||
|
- MQTT payload timestamps (heartbeat/dashboard/health/log messages) are UTC ISO timestamps
|
||||||
|
- Screenshot metadata timestamps are UTC ISO timestamps
|
||||||
|
|
||||||
|
This prevents daylight-saving and timezone drift issues across clients.
|
||||||
|
|
||||||
|
### VLC/PulseAudio warnings in remote sessions
|
||||||
|
|
||||||
|
Warnings such as `pulse audio output error: overflow, flushing` can appear when testing through remote desktop/audio forwarding (for example, NoMachine) or virtual/dummy display setups.
|
||||||
|
|
||||||
|
- If playback and audio are stable on a real HDMI display, this is usually non-fatal.
|
||||||
|
- If warnings appear only in remote sessions, treat them as environment-related rather than a core video playback bug.
|
||||||
|
|
||||||
### Screenshots not uploading
|
### Screenshots not uploading
|
||||||
|
|
||||||
**Check which session type you're running:**
|
**Check which session type you're running:**
|
||||||
|
|||||||
@@ -23,6 +23,7 @@ from logging.handlers import RotatingFileHandler
|
|||||||
from typing import Optional, Dict, List, IO
|
from typing import Optional, Dict, List, IO
|
||||||
from urllib.parse import urlparse
|
from urllib.parse import urlparse
|
||||||
import requests
|
import requests
|
||||||
|
import psutil
|
||||||
import json as _json
|
import json as _json
|
||||||
import threading
|
import threading
|
||||||
import time as _time
|
import time as _time
|
||||||
@@ -67,12 +68,15 @@ os.makedirs(os.path.dirname(LOG_PATH), exist_ok=True)
|
|||||||
|
|
||||||
logging.basicConfig(
|
logging.basicConfig(
|
||||||
level=getattr(logging, LOG_LEVEL.upper(), logging.INFO),
|
level=getattr(logging, LOG_LEVEL.upper(), logging.INFO),
|
||||||
format="%(asctime)s [%(levelname)s] %(message)s",
|
format="%(asctime)s.%(msecs)03dZ [%(levelname)s] %(message)s",
|
||||||
|
datefmt="%Y-%m-%dT%H:%M:%S",
|
||||||
handlers=[
|
handlers=[
|
||||||
RotatingFileHandler(LOG_PATH, maxBytes=2*1024*1024, backupCount=5),
|
RotatingFileHandler(LOG_PATH, maxBytes=2*1024*1024, backupCount=5),
|
||||||
logging.StreamHandler()
|
logging.StreamHandler()
|
||||||
]
|
]
|
||||||
)
|
)
|
||||||
|
# Force all logging timestamps to UTC (affects %(asctime)s in all formatters).
|
||||||
|
logging.Formatter.converter = time.gmtime
|
||||||
|
|
||||||
# Log CEC mode after logging is configured
|
# Log CEC mode after logging is configured
|
||||||
if ENV == "development":
|
if ENV == "development":
|
||||||
@@ -82,6 +86,90 @@ elif CEC_ENABLED:
|
|||||||
else:
|
else:
|
||||||
logging.info("[CEC] HDMI-CEC disabled in configuration")
|
logging.info("[CEC] HDMI-CEC disabled in configuration")
|
||||||
|
|
||||||
|
# Setup monitoring logger (separate from main display_manager.log for health/crash tracking)
|
||||||
|
MONITORING_LOG_PATH = os.path.join(os.path.dirname(__file__), "..", "logs", "monitoring.log")
|
||||||
|
os.makedirs(os.path.dirname(MONITORING_LOG_PATH), exist_ok=True)
|
||||||
|
monitoring_logger = logging.getLogger("monitoring")
|
||||||
|
monitoring_logger.setLevel(getattr(logging, LOG_LEVEL.upper(), logging.INFO))
|
||||||
|
monitoring_handler = RotatingFileHandler(MONITORING_LOG_PATH, maxBytes=5*1024*1024, backupCount=5)
|
||||||
|
monitoring_handler.setFormatter(logging.Formatter("%(asctime)s.%(msecs)03dZ [%(levelname)s] %(message)s", "%Y-%m-%dT%H:%M:%S"))
|
||||||
|
monitoring_logger.addHandler(monitoring_handler)
|
||||||
|
monitoring_logger.propagate = False # Don't duplicate to main logger
|
||||||
|
logging.info(f"Monitoring logger initialized: {MONITORING_LOG_PATH}")
|
||||||
|
|
||||||
|
# Health state file (shared bridge between display_manager and simclient)
|
||||||
|
HEALTH_STATE_FILE = os.path.join(os.path.dirname(__file__), "current_process_health.json")
|
||||||
|
|
||||||
|
|
||||||
|
class ProcessHealthState:
|
||||||
|
"""Track and persist process health state for monitoring integration"""
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
self.event_id = None
|
||||||
|
self.event_type = None
|
||||||
|
self.process_name = None
|
||||||
|
self.process_pid = None
|
||||||
|
self.status = "stopped" # running, crashed, starting, stopped
|
||||||
|
self.restart_count = 0
|
||||||
|
self.max_restarts = 3
|
||||||
|
self.last_update = datetime.now(timezone.utc).isoformat()
|
||||||
|
|
||||||
|
def to_dict(self) -> Dict:
|
||||||
|
return {
|
||||||
|
"event_id": self.event_id,
|
||||||
|
"event_type": self.event_type,
|
||||||
|
"current_process": self.process_name,
|
||||||
|
"process_pid": self.process_pid,
|
||||||
|
"process_status": self.status,
|
||||||
|
"restart_count": self.restart_count,
|
||||||
|
"timestamp": datetime.now(timezone.utc).isoformat()
|
||||||
|
}
|
||||||
|
|
||||||
|
def save(self):
|
||||||
|
"""Persist health state to file for simclient to read"""
|
||||||
|
try:
|
||||||
|
with open(HEALTH_STATE_FILE, "w") as f:
|
||||||
|
json.dump(self.to_dict(), f, indent=2)
|
||||||
|
except Exception as e:
|
||||||
|
monitoring_logger.error(f"Failed to save health state: {e}")
|
||||||
|
|
||||||
|
def update_running(self, event_id: str, event_type: str, process_name: str, pid: Optional[int]):
|
||||||
|
"""Mark process as running"""
|
||||||
|
self.event_id = event_id
|
||||||
|
self.event_type = event_type
|
||||||
|
self.process_name = process_name
|
||||||
|
self.process_pid = pid
|
||||||
|
self.status = "running"
|
||||||
|
self.restart_count = 0
|
||||||
|
self.save()
|
||||||
|
monitoring_logger.info(f"Process started: event_id={event_id} event_type={event_type} process={process_name} pid={pid}")
|
||||||
|
|
||||||
|
def update_crashed(self):
|
||||||
|
"""Mark process as crashed"""
|
||||||
|
self.status = "crashed"
|
||||||
|
self.save()
|
||||||
|
monitoring_logger.warning(f"Process crashed: event_id={self.event_id} event_type={self.event_type} process={self.process_name} restart_count={self.restart_count}/{self.max_restarts}")
|
||||||
|
|
||||||
|
def update_restart_attempt(self):
|
||||||
|
"""Increment restart counter and check if max exceeded"""
|
||||||
|
self.restart_count += 1
|
||||||
|
self.save()
|
||||||
|
if self.restart_count <= self.max_restarts:
|
||||||
|
monitoring_logger.warning(f"Restarting process: attempt {self.restart_count}/{self.max_restarts} for {self.process_name}")
|
||||||
|
else:
|
||||||
|
monitoring_logger.error(f"Max restart attempts exceeded for {self.process_name}: event_id={self.event_id}")
|
||||||
|
|
||||||
|
def update_stopped(self):
|
||||||
|
"""Mark process as stopped (normal event end)"""
|
||||||
|
self.status = "stopped"
|
||||||
|
self.event_id = None
|
||||||
|
self.event_type = None
|
||||||
|
self.process_name = None
|
||||||
|
self.process_pid = None
|
||||||
|
self.restart_count = 0
|
||||||
|
self.save()
|
||||||
|
monitoring_logger.info("Process stopped (event ended or no active event)")
|
||||||
|
|
||||||
|
|
||||||
class HDMICECController:
|
class HDMICECController:
|
||||||
"""Controls HDMI-CEC to turn TV on/off automatically
|
"""Controls HDMI-CEC to turn TV on/off automatically
|
||||||
@@ -327,7 +415,7 @@ class DisplayProcess:
|
|||||||
self.player = player
|
self.player = player
|
||||||
self.event_type = event_type
|
self.event_type = event_type
|
||||||
self.event_id = event_id
|
self.event_id = event_id
|
||||||
self.start_time = datetime.now()
|
self.start_time = datetime.now(timezone.utc)
|
||||||
self.log_file = log_file
|
self.log_file = log_file
|
||||||
self.log_path = log_path
|
self.log_path = log_path
|
||||||
|
|
||||||
@@ -473,6 +561,9 @@ class DisplayManager:
|
|||||||
self.last_file_mtime: Optional[float] = None
|
self.last_file_mtime: Optional[float] = None
|
||||||
self.running = True
|
self.running = True
|
||||||
|
|
||||||
|
# Initialize health state tracking for process monitoring
|
||||||
|
self.health = ProcessHealthState()
|
||||||
|
|
||||||
# Initialize HDMI-CEC controller
|
# Initialize HDMI-CEC controller
|
||||||
self.cec = HDMICECController(
|
self.cec = HDMICECController(
|
||||||
enabled=CEC_ENABLED,
|
enabled=CEC_ENABLED,
|
||||||
@@ -630,6 +721,8 @@ class DisplayManager:
|
|||||||
if self.current_process.is_running():
|
if self.current_process.is_running():
|
||||||
self.current_process.terminate(force=True)
|
self.current_process.terminate(force=True)
|
||||||
|
|
||||||
|
# Update health state to reflect process stopped
|
||||||
|
self.health.update_stopped()
|
||||||
self.current_process = None
|
self.current_process = None
|
||||||
self.current_event_data = None
|
self.current_event_data = None
|
||||||
|
|
||||||
@@ -852,6 +945,9 @@ class DisplayManager:
|
|||||||
event_id = self.get_event_identifier(event)
|
event_id = self.get_event_identifier(event)
|
||||||
logging.info(f"Presentation started with PID: {process.pid}")
|
logging.info(f"Presentation started with PID: {process.pid}")
|
||||||
|
|
||||||
|
# Update health state for monitoring
|
||||||
|
self.health.update_running(event_id, 'presentation', 'impressive', process.pid)
|
||||||
|
|
||||||
return DisplayProcess(process, 'presentation', event_id, log_file=impressive_log, log_path=impressive_log_path)
|
return DisplayProcess(process, 'presentation', event_id, log_file=impressive_log, log_path=impressive_log_path)
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
@@ -904,6 +1000,8 @@ class DisplayManager:
|
|||||||
)
|
)
|
||||||
event_id = self.get_event_identifier(event)
|
event_id = self.get_event_identifier(event)
|
||||||
logging.info(f"Video started with PID: {process.pid} (external vlc)")
|
logging.info(f"Video started with PID: {process.pid} (external vlc)")
|
||||||
|
# Update health state for monitoring
|
||||||
|
self.health.update_running(event_id, 'video', 'vlc', process.pid)
|
||||||
return DisplayProcess(process=process, event_type='video', event_id=event_id, log_file=video_log, log_path=video_log_path)
|
return DisplayProcess(process=process, event_type='video', event_id=event_id, log_file=video_log, log_path=video_log_path)
|
||||||
else:
|
else:
|
||||||
logging.error("No video player found (python-vlc or vlc binary)")
|
logging.error("No video player found (python-vlc or vlc binary)")
|
||||||
@@ -911,7 +1009,24 @@ class DisplayManager:
|
|||||||
|
|
||||||
# Use libvlc via python-vlc
|
# Use libvlc via python-vlc
|
||||||
try:
|
try:
|
||||||
instance = vlc.Instance()
|
# Keep python-vlc behavior aligned with external vlc fullscreen mode.
|
||||||
|
instance = vlc.Instance('--fullscreen', '--no-video-title-show')
|
||||||
|
|
||||||
|
def _force_fullscreen(player_obj, label: str):
|
||||||
|
"""Retry fullscreen toggle because video outputs may attach asynchronously."""
|
||||||
|
if not player_obj:
|
||||||
|
return
|
||||||
|
|
||||||
|
def _worker():
|
||||||
|
for delay in (0.0, 0.4, 1.0):
|
||||||
|
if delay > 0:
|
||||||
|
time.sleep(delay)
|
||||||
|
try:
|
||||||
|
player_obj.set_fullscreen(True)
|
||||||
|
except Exception as e:
|
||||||
|
logging.debug(f"Could not force fullscreen for {label}: {e}")
|
||||||
|
|
||||||
|
threading.Thread(target=_worker, daemon=True).start()
|
||||||
|
|
||||||
autoplay = bool(video.get('autoplay', True))
|
autoplay = bool(video.get('autoplay', True))
|
||||||
loop_flag = bool(video.get('loop', False))
|
loop_flag = bool(video.get('loop', False))
|
||||||
@@ -957,10 +1072,14 @@ class DisplayManager:
|
|||||||
if autoplay:
|
if autoplay:
|
||||||
try:
|
try:
|
||||||
mlp.play()
|
mlp.play()
|
||||||
|
_force_fullscreen(mp, 'python-vlc MediaListPlayer')
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logging.error(f"Failed to play media list: {e}")
|
logging.error(f"Failed to play media list: {e}")
|
||||||
event_id = self.get_event_identifier(event)
|
event_id = self.get_event_identifier(event)
|
||||||
logging.info("Video started via python-vlc (MediaListPlayer)" )
|
runtime_pid = os.getpid()
|
||||||
|
logging.info(f"Video started via python-vlc (MediaListPlayer), runtime PID: {runtime_pid}")
|
||||||
|
# python-vlc runs in-process (no external vlc child), so publish this process PID.
|
||||||
|
self.health.update_running(event_id, 'video', 'vlc', runtime_pid)
|
||||||
return DisplayProcess(process=None, event_type='video', event_id=event_id, log_file=None, log_path=None, player=mlp)
|
return DisplayProcess(process=None, event_type='video', event_id=event_id, log_file=None, log_path=None, player=mlp)
|
||||||
|
|
||||||
else:
|
else:
|
||||||
@@ -976,11 +1095,15 @@ class DisplayManager:
|
|||||||
if autoplay:
|
if autoplay:
|
||||||
try:
|
try:
|
||||||
mp.play()
|
mp.play()
|
||||||
|
_force_fullscreen(mp, 'python-vlc MediaPlayer')
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logging.error(f"Failed to start media player: {e}")
|
logging.error(f"Failed to start media player: {e}")
|
||||||
|
|
||||||
event_id = self.get_event_identifier(event)
|
event_id = self.get_event_identifier(event)
|
||||||
logging.info("Video started via python-vlc (MediaPlayer)")
|
runtime_pid = os.getpid()
|
||||||
|
logging.info(f"Video started via python-vlc (MediaPlayer), runtime PID: {runtime_pid}")
|
||||||
|
# python-vlc runs in-process (no external vlc child), so publish this process PID.
|
||||||
|
self.health.update_running(event_id, 'video', 'vlc', runtime_pid)
|
||||||
return DisplayProcess(process=None, event_type='video', event_id=event_id, log_file=None, log_path=None, player=mp)
|
return DisplayProcess(process=None, event_type='video', event_id=event_id, log_file=None, log_path=None, player=mp)
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
@@ -1138,6 +1261,10 @@ class DisplayManager:
|
|||||||
event_id = self.get_event_identifier(event)
|
event_id = self.get_event_identifier(event)
|
||||||
logging.info(f"Webpage started with PID: {process.pid}")
|
logging.info(f"Webpage started with PID: {process.pid}")
|
||||||
|
|
||||||
|
# Update health state for monitoring (track chromium browser)
|
||||||
|
browser_name = 'chromium-browser' if self._command_exists('chromium-browser') else 'chromium'
|
||||||
|
self.health.update_running(event_id, etype, browser_name, process.pid)
|
||||||
|
|
||||||
# Inject auto-scroll JS via Chrome DevTools Protocol (CDP) if enabled and available
|
# Inject auto-scroll JS via Chrome DevTools Protocol (CDP) if enabled and available
|
||||||
if autoscroll_enabled:
|
if autoscroll_enabled:
|
||||||
try:
|
try:
|
||||||
@@ -1367,9 +1494,20 @@ class DisplayManager:
|
|||||||
if self.current_process.event_type == 'presentation' and exit_code == 0:
|
if self.current_process.event_type == 'presentation' and exit_code == 0:
|
||||||
logging.info("Presentation process ended with exit code 0 (likely normal completion).")
|
logging.info("Presentation process ended with exit code 0 (likely normal completion).")
|
||||||
self.current_process = None
|
self.current_process = None
|
||||||
|
# Update health state to show normal completion
|
||||||
|
self.health.update_stopped()
|
||||||
# Don't turn off TV yet - event might still be active
|
# Don't turn off TV yet - event might still be active
|
||||||
return
|
return
|
||||||
|
|
||||||
|
# Process crashed unexpectedly
|
||||||
|
self.health.update_crashed()
|
||||||
|
self.health.update_restart_attempt()
|
||||||
|
|
||||||
|
if self.health.restart_count > self.health.max_restarts:
|
||||||
|
logging.error(f"Max restart attempts exceeded - giving up on {self.health.process_name}")
|
||||||
|
self.current_process = None
|
||||||
|
return
|
||||||
|
|
||||||
logging.info("Restarting display process...")
|
logging.info("Restarting display process...")
|
||||||
self.current_process = None
|
self.current_process = None
|
||||||
# Don't turn off TV when restarting same event
|
# Don't turn off TV when restarting same event
|
||||||
@@ -1413,10 +1551,8 @@ class DisplayManager:
|
|||||||
|
|
||||||
# Log timezone information for debugging
|
# Log timezone information for debugging
|
||||||
now_utc = datetime.now(timezone.utc)
|
now_utc = datetime.now(timezone.utc)
|
||||||
now_local = datetime.now()
|
|
||||||
logging.info(f"Current time (UTC): {now_utc.strftime('%Y-%m-%d %H:%M:%S %Z')}")
|
logging.info(f"Current time (UTC): {now_utc.strftime('%Y-%m-%d %H:%M:%S %Z')}")
|
||||||
logging.info(f"Current time (Local): {now_local.strftime('%Y-%m-%d %H:%M:%S')}")
|
logging.info("Event times are expected in UTC format")
|
||||||
logging.info("Note: Event times are expected in UTC format")
|
|
||||||
|
|
||||||
while self.running:
|
while self.running:
|
||||||
try:
|
try:
|
||||||
@@ -1469,7 +1605,7 @@ class DisplayManager:
|
|||||||
Maintains a 'latest.jpg' file and rotates older screenshots beyond SCREENSHOT_MAX_FILES.
|
Maintains a 'latest.jpg' file and rotates older screenshots beyond SCREENSHOT_MAX_FILES.
|
||||||
"""
|
"""
|
||||||
try:
|
try:
|
||||||
ts = datetime.now().strftime('%Y%m%d_%H%M%S')
|
ts = datetime.now(timezone.utc).strftime('%Y%m%d_%H%M%S')
|
||||||
raw_path = os.path.join(self.screenshot_dir, f'screenshot_{ts}.png')
|
raw_path = os.path.join(self.screenshot_dir, f'screenshot_{ts}.png')
|
||||||
final_path = os.path.join(self.screenshot_dir, f'screenshot_{ts}.jpg')
|
final_path = os.path.join(self.screenshot_dir, f'screenshot_{ts}.jpg')
|
||||||
|
|
||||||
|
|||||||
112
src/simclient.py
112
src/simclient.py
@@ -15,7 +15,7 @@ import logging
|
|||||||
from dotenv import load_dotenv
|
from dotenv import load_dotenv
|
||||||
import requests
|
import requests
|
||||||
import base64
|
import base64
|
||||||
from datetime import datetime
|
from datetime import datetime, timezone
|
||||||
import threading
|
import threading
|
||||||
from urllib.parse import urlsplit, urlunsplit, unquote
|
from urllib.parse import urlsplit, urlunsplit, unquote
|
||||||
|
|
||||||
@@ -123,9 +123,26 @@ if DEBUG_MODE:
|
|||||||
log_handlers.append(logging.StreamHandler())
|
log_handlers.append(logging.StreamHandler())
|
||||||
logging.basicConfig(
|
logging.basicConfig(
|
||||||
level=getattr(logging, LOG_LEVEL.upper(), logging.INFO),
|
level=getattr(logging, LOG_LEVEL.upper(), logging.INFO),
|
||||||
format="%(asctime)s [%(levelname)s] %(message)s",
|
format="%(asctime)s.%(msecs)03dZ [%(levelname)s] %(message)s",
|
||||||
|
datefmt="%Y-%m-%dT%H:%M:%S",
|
||||||
handlers=log_handlers
|
handlers=log_handlers
|
||||||
)
|
)
|
||||||
|
# Force all logging timestamps to UTC (affects %(asctime)s in all formatters).
|
||||||
|
logging.Formatter.converter = time.gmtime
|
||||||
|
|
||||||
|
# Setup monitoring logger (separate for health/crash events, local rotation only)
|
||||||
|
MONITORING_LOG_PATH = os.path.join(os.path.dirname(__file__), "..", "logs", "monitoring.log")
|
||||||
|
os.makedirs(os.path.dirname(MONITORING_LOG_PATH), exist_ok=True)
|
||||||
|
monitoring_logger = logging.getLogger("monitoring")
|
||||||
|
monitoring_logger.setLevel(getattr(logging, LOG_LEVEL.upper(), logging.INFO))
|
||||||
|
monitoring_handler = RotatingFileHandler(MONITORING_LOG_PATH, maxBytes=5*1024*1024, backupCount=5)
|
||||||
|
monitoring_handler.setFormatter(logging.Formatter("%(asctime)s.%(msecs)03dZ [%(levelname)s] %(message)s", "%Y-%m-%dT%H:%M:%S"))
|
||||||
|
monitoring_logger.addHandler(monitoring_handler)
|
||||||
|
monitoring_logger.propagate = False # Don't duplicate to main logger
|
||||||
|
logging.info(f"Monitoring logger initialized: {MONITORING_LOG_PATH}")
|
||||||
|
|
||||||
|
# Health state file (written by display_manager, read by simclient)
|
||||||
|
HEALTH_STATE_FILE = os.path.join(os.path.dirname(__file__), "current_process_health.json")
|
||||||
|
|
||||||
|
|
||||||
discovered = False
|
discovered = False
|
||||||
@@ -485,7 +502,7 @@ def get_latest_screenshot():
|
|||||||
return {
|
return {
|
||||||
"filename": os.path.basename(preferred_path),
|
"filename": os.path.basename(preferred_path),
|
||||||
"data": screenshot_data,
|
"data": screenshot_data,
|
||||||
"timestamp": datetime.fromtimestamp(file_stats.st_mtime).isoformat(),
|
"timestamp": datetime.fromtimestamp(file_stats.st_mtime, tz=timezone.utc).isoformat(),
|
||||||
"size": file_stats.st_size
|
"size": file_stats.st_size
|
||||||
}
|
}
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
@@ -514,7 +531,7 @@ def get_latest_screenshot():
|
|||||||
info = {
|
info = {
|
||||||
"filename": latest_file,
|
"filename": latest_file,
|
||||||
"data": screenshot_data,
|
"data": screenshot_data,
|
||||||
"timestamp": datetime.fromtimestamp(file_stats.st_mtime).isoformat(),
|
"timestamp": datetime.fromtimestamp(file_stats.st_mtime, tz=timezone.utc).isoformat(),
|
||||||
"size": file_stats.st_size
|
"size": file_stats.st_size
|
||||||
}
|
}
|
||||||
logging.debug(f"Selected latest screenshot: {latest_file} ({file_stats.st_size} bytes)")
|
logging.debug(f"Selected latest screenshot: {latest_file} ({file_stats.st_size} bytes)")
|
||||||
@@ -525,13 +542,82 @@ def get_latest_screenshot():
|
|||||||
return None
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def read_health_state():
|
||||||
|
"""Read the health state file written by display_manager"""
|
||||||
|
try:
|
||||||
|
if not os.path.exists(HEALTH_STATE_FILE):
|
||||||
|
return None
|
||||||
|
with open(HEALTH_STATE_FILE, 'r') as f:
|
||||||
|
return json.load(f)
|
||||||
|
except Exception as e:
|
||||||
|
logging.debug(f"Could not read health state file: {e}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def publish_health_message(client, client_id):
|
||||||
|
"""Publish health status to server via MQTT"""
|
||||||
|
try:
|
||||||
|
health = read_health_state()
|
||||||
|
if not health:
|
||||||
|
return # No active process
|
||||||
|
|
||||||
|
payload = {
|
||||||
|
"timestamp": datetime.now(timezone.utc).isoformat(),
|
||||||
|
"expected_state": {
|
||||||
|
"event_id": health.get("event_id")
|
||||||
|
},
|
||||||
|
"actual_state": {
|
||||||
|
"process": health.get("current_process"),
|
||||||
|
"pid": health.get("process_pid"),
|
||||||
|
"status": health.get("process_status")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
topic = f"infoscreen/{client_id}/health"
|
||||||
|
res = client.publish(topic, json.dumps(payload), qos=1)
|
||||||
|
if res.rc == mqtt.MQTT_ERR_SUCCESS:
|
||||||
|
logging.debug(f"Health message published: {health.get('current_process')} status={health.get('process_status')}")
|
||||||
|
else:
|
||||||
|
logging.debug(f"Health publish failed with code: {res.rc}")
|
||||||
|
except Exception as e:
|
||||||
|
logging.debug(f"Error publishing health: {e}")
|
||||||
|
|
||||||
|
|
||||||
|
def publish_log_message(client, client_id, level: str, message: str, context: dict = None):
|
||||||
|
"""Publish log message to server via MQTT (only if level is ERROR or WARN, unless DEBUG_MODE)"""
|
||||||
|
try:
|
||||||
|
# Filter logs: only send ERROR/WARN to server, keep INFO/DEBUG local-only unless DEBUG_MODE
|
||||||
|
if level.upper() == "INFO" and not DEBUG_MODE:
|
||||||
|
return # Keep INFO logs local only in production
|
||||||
|
if level.upper() == "DEBUG":
|
||||||
|
return # DEBUG logs always local-only
|
||||||
|
|
||||||
|
topic = f"infoscreen/{client_id}/logs/{level.lower()}"
|
||||||
|
payload = {
|
||||||
|
"timestamp": datetime.now(timezone.utc).isoformat(),
|
||||||
|
"message": message,
|
||||||
|
"context": context or {}
|
||||||
|
}
|
||||||
|
|
||||||
|
res = client.publish(topic, json.dumps(payload), qos=1)
|
||||||
|
if res.rc == mqtt.MQTT_ERR_SUCCESS:
|
||||||
|
monitoring_logger.log(getattr(logging, level.upper(), logging.INFO), f"[MQTT] {message}")
|
||||||
|
else:
|
||||||
|
logging.debug(f"Log publish failed ({level}) with code: {res.rc}")
|
||||||
|
except Exception as e:
|
||||||
|
logging.debug(f"Error publishing log: {e}")
|
||||||
|
|
||||||
|
|
||||||
def send_screenshot_heartbeat(client, client_id):
|
def send_screenshot_heartbeat(client, client_id):
|
||||||
"""Send heartbeat with screenshot to server for dashboard monitoring"""
|
"""Send heartbeat with screenshot to server for dashboard monitoring"""
|
||||||
try:
|
try:
|
||||||
screenshot_info = get_latest_screenshot()
|
screenshot_info = get_latest_screenshot()
|
||||||
|
|
||||||
|
# Also read health state and include in heartbeat
|
||||||
|
health = read_health_state()
|
||||||
|
|
||||||
heartbeat_data = {
|
heartbeat_data = {
|
||||||
"timestamp": datetime.now().isoformat(),
|
"timestamp": datetime.now(timezone.utc).isoformat(),
|
||||||
"client_id": client_id,
|
"client_id": client_id,
|
||||||
"status": "alive",
|
"status": "alive",
|
||||||
"screenshot": screenshot_info,
|
"screenshot": screenshot_info,
|
||||||
@@ -542,6 +628,17 @@ def send_screenshot_heartbeat(client, client_id):
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
# Include health info if available (from display_manager)
|
||||||
|
if health:
|
||||||
|
heartbeat_data["process_health"] = {
|
||||||
|
"event_id": health.get("event_id"),
|
||||||
|
"event_type": health.get("event_type"),
|
||||||
|
"current_process": health.get("current_process"),
|
||||||
|
"process_pid": health.get("process_pid"),
|
||||||
|
"process_status": health.get("process_status"),
|
||||||
|
"restart_count": health.get("restart_count", 0)
|
||||||
|
}
|
||||||
|
|
||||||
# Send to dashboard monitoring topic
|
# Send to dashboard monitoring topic
|
||||||
dashboard_topic = f"infoscreen/{client_id}/dashboard"
|
dashboard_topic = f"infoscreen/{client_id}/dashboard"
|
||||||
payload = json.dumps(heartbeat_data)
|
payload = json.dumps(heartbeat_data)
|
||||||
@@ -575,7 +672,7 @@ def screenshot_service_thread(client, client_id):
|
|||||||
|
|
||||||
def main():
|
def main():
|
||||||
global discovered
|
global discovered
|
||||||
print(f"[{datetime.now().isoformat()}] simclient.py: program started")
|
print(f"[{datetime.now(timezone.utc).isoformat()}] simclient.py: program started")
|
||||||
logging.info("Client starting - deleting old event file if present")
|
logging.info("Client starting - deleting old event file if present")
|
||||||
delete_event_file()
|
delete_event_file()
|
||||||
|
|
||||||
@@ -840,6 +937,8 @@ def main():
|
|||||||
result = client.publish(f"infoscreen/{client_id}/heartbeat", "alive", qos=0)
|
result = client.publish(f"infoscreen/{client_id}/heartbeat", "alive", qos=0)
|
||||||
if result.rc == mqtt.MQTT_ERR_SUCCESS:
|
if result.rc == mqtt.MQTT_ERR_SUCCESS:
|
||||||
logging.info("Heartbeat sent.")
|
logging.info("Heartbeat sent.")
|
||||||
|
# Also send health and screenshot heartbeats
|
||||||
|
publish_health_message(client, client_id)
|
||||||
elif result.rc == mqtt.MQTT_ERR_NO_CONN:
|
elif result.rc == mqtt.MQTT_ERR_NO_CONN:
|
||||||
logging.debug("Heartbeat publish returned NO_CONN; retrying in 2s...")
|
logging.debug("Heartbeat publish returned NO_CONN; retrying in 2s...")
|
||||||
time.sleep(2)
|
time.sleep(2)
|
||||||
@@ -847,6 +946,7 @@ def main():
|
|||||||
retry = client.publish(f"infoscreen/{client_id}/heartbeat", "alive", qos=0)
|
retry = client.publish(f"infoscreen/{client_id}/heartbeat", "alive", qos=0)
|
||||||
if retry.rc == mqtt.MQTT_ERR_SUCCESS:
|
if retry.rc == mqtt.MQTT_ERR_SUCCESS:
|
||||||
logging.info("Heartbeat sent after retry.")
|
logging.info("Heartbeat sent after retry.")
|
||||||
|
publish_health_message(client, client_id)
|
||||||
else:
|
else:
|
||||||
logging.warning(f"Heartbeat publish failed after retry with code: {retry.rc}")
|
logging.warning(f"Heartbeat publish failed after retry with code: {retry.rc}")
|
||||||
else:
|
else:
|
||||||
|
|||||||
Reference in New Issue
Block a user