- add period-scoped holiday architecture end-to-end - model: scope `SchoolHoliday` to `academic_period_id` - migrations: add holiday-period scoping, academic-period archive lifecycle, and merge migration head - API: extend holidays with manual CRUD, period validation, duplicate prevention, and overlap merge/conflict handling - recurrence: regenerate holiday exceptions using period-scoped holiday sets - improve frontend settings and holiday workflows - bind holiday import/list/manual CRUD to selected academic period - show detailed import outcomes (inserted/updated/merged/skipped/conflicts) - fix file-picker UX (visible selected filename) - align settings controls/dialogs with defined frontend design rules - scope appointments/dashboard holiday loading to active period - add shared date formatting utility - strengthen academic period lifecycle handling - add archive/restore/delete flow and backend validations/blocker checks - extend API client support for lifecycle operations - release/docs updates and cleanup - bump user-facing version to `2026.1.0-alpha.15` with new changelog entry - add tech changelog entry for alpha.15 backend changes - refactor README to concise index and archive historical implementation docs - fix Copilot instruction link diagnostics via local `.github` design-rules reference
534 lines
18 KiB
Markdown
534 lines
18 KiB
Markdown
# Phase 3: Client-Side Monitoring Implementation
|
||
|
||
**Status**: ✅ COMPLETE
|
||
**Date**: 11. März 2026
|
||
**Architecture**: Two-process design with health-state bridge
|
||
|
||
---
|
||
|
||
## Overview
|
||
|
||
This document describes the **Phase 3** client-side monitoring implementation integrated into the existing infoscreen-dev codebase. The implementation adds:
|
||
|
||
1. ✅ **Health-state tracking** for all display processes (Impressive, Chromium, VLC)
|
||
2. ✅ **Tiered logging**: Local rotating logs + selective MQTT transmission
|
||
3. ✅ **Process crash detection** with bounded restart attempts
|
||
4. ✅ **MQTT health/log topics** feeding the monitoring server
|
||
5. ✅ **Impressive-aware process mapping** (presentations → impressive, websites → chromium, videos → vlc)
|
||
|
||
---
|
||
|
||
## Architecture
|
||
|
||
### Two-Process Design
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────┐
|
||
│ simclient.py (MQTT Client) │
|
||
│ - Discovers device, sends heartbeat │
|
||
│ - Downloads presentation files │
|
||
│ - Reads health state from display_manager │
|
||
│ - Publishes health/log messages to MQTT │
|
||
│ - Sends screenshots for dashboard │
|
||
└────────┬────────────────────────────────────┬───────────┘
|
||
│ │
|
||
│ reads: current_process_health.json │
|
||
│ │
|
||
│ writes: current_event.json │
|
||
│ │
|
||
┌────────▼────────────────────────────────────▼───────────┐
|
||
│ display_manager.py (Display Control) │
|
||
│ - Monitors events and manages displays │
|
||
│ - Launches Impressive (presentations) │
|
||
│ - Launches Chromium (websites) │
|
||
│ - Launches VLC (videos) │
|
||
│ - Tracks process health and crashes │
|
||
│ - Detects and restarts crashed processes │
|
||
│ - Writes health state to JSON bridge │
|
||
│ - Captures screenshots to shared folder │
|
||
└─────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
---
|
||
|
||
## Implementation Details
|
||
|
||
### 1. Health State Tracking (display_manager.py)
|
||
|
||
**File**: `src/display_manager.py`
|
||
**New Class**: `ProcessHealthState`
|
||
|
||
Tracks process health and persists to JSON for simclient to read:
|
||
|
||
```python
|
||
class ProcessHealthState:
|
||
"""Track and persist process health state for monitoring integration"""
|
||
|
||
- event_id: Currently active event identifier
|
||
- event_type: presentation, website, video, or None
|
||
- process_name: impressive, chromium-browser, vlc, or None
|
||
- process_pid: Process ID or None for libvlc
|
||
- status: running, crashed, starting, stopped
|
||
- restart_count: Number of restart attempts
|
||
- max_restarts: Maximum allowed restarts (3)
|
||
```
|
||
|
||
Methods:
|
||
- `update_running()` - Mark process as started (logs to monitoring.log)
|
||
- `update_crashed()` - Mark process as crashed (warning to monitoring.log)
|
||
- `update_restart_attempt()` - Increment restart counter (logs attempt and checks max)
|
||
- `update_stopped()` - Mark process as stopped (info to monitoring.log)
|
||
- `save()` - Persist state to `src/current_process_health.json`
|
||
|
||
**New Health State File**: `src/current_process_health.json`
|
||
|
||
```json
|
||
{
|
||
"event_id": "event_123",
|
||
"event_type": "presentation",
|
||
"current_process": "impressive",
|
||
"process_pid": 1234,
|
||
"process_status": "running",
|
||
"restart_count": 0,
|
||
"timestamp": "2026-03-11T10:30:45.123456+00:00"
|
||
}
|
||
```
|
||
|
||
### 2. Monitoring Logger (both files)
|
||
|
||
**Local Rotating Logs**: 5 files × 5 MB each = 25 MB max per device
|
||
|
||
**display_manager.py**:
|
||
```python
|
||
MONITORING_LOG_PATH = "logs/monitoring.log"
|
||
monitoring_logger = logging.getLogger("monitoring")
|
||
monitoring_handler = RotatingFileHandler(MONITORING_LOG_PATH, maxBytes=5*1024*1024, backupCount=5)
|
||
```
|
||
|
||
**simclient.py**:
|
||
- Shares same `logs/monitoring.log` file
|
||
- Both processes write to monitoring logger for health events
|
||
- Local logs never rotate (persisted for technician inspection)
|
||
|
||
**Log Filtering** (tiered strategy):
|
||
- **ERROR**: Local + MQTT (published to `infoscreen/{uuid}/logs/error`)
|
||
- **WARN**: Local + MQTT (published to `infoscreen/{uuid}/logs/warn`)
|
||
- **INFO**: Local only (unless `DEBUG_MODE=1`)
|
||
- **DEBUG**: Local only (always)
|
||
|
||
### 3. Process Mapping with Impressive Support
|
||
|
||
**display_manager.py** - When starting processes:
|
||
|
||
| Event Type | Process Name | Health Status |
|
||
|-----------|--------------|---------------|
|
||
| presentation | `impressive` | tracked with PID |
|
||
| website/webpage/webuntis | `chromium` or `chromium-browser` | tracked with PID |
|
||
| video | `vlc` | tracked (may have no PID if using libvlc) |
|
||
|
||
**Per-Process Updates**:
|
||
- Presentation: `health.update_running('event_id', 'presentation', 'impressive', pid)`
|
||
- Website: `health.update_running('event_id', 'website', browser_name, pid)`
|
||
- Video: `health.update_running('event_id', 'video', 'vlc', pid or None)`
|
||
|
||
### 4. Crash Detection and Restart Logic
|
||
|
||
**display_manager.py** - `process_events()` method:
|
||
|
||
```
|
||
If process not running AND same event_id:
|
||
├─ Check exit code
|
||
├─ If presentation with exit code 0: Normal completion (no restart)
|
||
├─ Else: Mark crashed
|
||
│ ├─ health.update_crashed()
|
||
│ └─ health.update_restart_attempt()
|
||
│ ├─ If restart_count > max_restarts: Give up
|
||
│ └─ Else: Restart display (loop back to start_display_for_event)
|
||
└─ Log to monitoring.log at each step
|
||
```
|
||
|
||
**Restart Logic**:
|
||
- Max 3 restart attempts per event
|
||
- Restarts only if same event still active
|
||
- Graceful exit (code 0) for Impressive auto-quit presentations is treated as normal
|
||
- All crashes logged to monitoring.log with context
|
||
|
||
### 5. MQTT Health and Log Topics
|
||
|
||
**simclient.py** - New functions:
|
||
|
||
**`read_health_state()`**
|
||
- Reads `src/current_process_health.json` written by display_manager
|
||
- Returns dict or None if no active process
|
||
|
||
**`publish_health_message(client, client_id)`**
|
||
- Topic: `infoscreen/{uuid}/health`
|
||
- QoS: 1 (reliable)
|
||
- Payload:
|
||
```json
|
||
{
|
||
"timestamp": "2026-03-11T10:30:45.123456+00:00",
|
||
"expected_state": {
|
||
"event_id": "event_123"
|
||
},
|
||
"actual_state": {
|
||
"process": "impressive",
|
||
"pid": 1234,
|
||
"status": "running"
|
||
}
|
||
}
|
||
```
|
||
|
||
**`publish_log_message(client, client_id, level, message, context)`**
|
||
- Topics: `infoscreen/{uuid}/logs/error` or `infoscreen/{uuid}/logs/warn`
|
||
- QoS: 1 (reliable)
|
||
- Log level filtering (only ERROR/WARN sent unless DEBUG_MODE=1)
|
||
- Payload:
|
||
```json
|
||
{
|
||
"timestamp": "2026-03-11T10:30:45.123456+00:00",
|
||
"message": "Process started: event_id=123 event_type=presentation process=impressive pid=1234",
|
||
"context": {
|
||
"event_id": "event_123",
|
||
"process": "impressive",
|
||
"event_type": "presentation"
|
||
}
|
||
}
|
||
```
|
||
|
||
**Enhanced Dashboard Heartbeat**:
|
||
- Topic: `infoscreen/{uuid}/dashboard`
|
||
- Now includes `process_health` block with event_id, process name, status, restart count
|
||
|
||
### 6. Integration Points
|
||
|
||
**Existing Features Preserved**:
|
||
- ✅ Impressive PDF presentations with auto-advance and loop
|
||
- ✅ Chromium website display with auto-scroll injection
|
||
- ✅ VLC video playback (python-vlc preferred, binary fallback)
|
||
- ✅ Screenshot capture and transmission
|
||
- ✅ HDMI-CEC TV control
|
||
- ✅ Two-process architecture
|
||
|
||
**New Integration Points**:
|
||
|
||
| File | Function | Change |
|
||
|------|----------|--------|
|
||
| display_manager.py | `__init__()` | Initialize `ProcessHealthState()` |
|
||
| display_manager.py | `start_presentation()` | Call `health.update_running()` with impressive |
|
||
| display_manager.py | `start_video()` | Call `health.update_running()` with vlc |
|
||
| display_manager.py | `start_webpage()` | Call `health.update_running()` with chromium |
|
||
| display_manager.py | `process_events()` | Detect crashes, call `health.update_crashed()` and `update_restart_attempt()` |
|
||
| display_manager.py | `stop_current_display()` | Call `health.update_stopped()` |
|
||
| simclient.py | `screenshot_service_thread()` | (No changes to interval) |
|
||
| simclient.py | Main heartbeat loop | Call `publish_health_message()` after successful heartbeat |
|
||
| simclient.py | `send_screenshot_heartbeat()` | Read health state and include in dashboard payload |
|
||
|
||
---
|
||
|
||
## Logging Hierarchy
|
||
|
||
### Local Rotating Files (5 × 5 MB)
|
||
|
||
**`logs/display_manager.log`** (existing - updated):
|
||
- Display event processing
|
||
- Process lifecycle (start/stop)
|
||
- HDMI-CEC operations
|
||
- Presentation status
|
||
- Video/website startup
|
||
|
||
**`logs/simclient.log`** (existing - updated):
|
||
- MQTT connection/reconnection
|
||
- Discovery and heartbeat
|
||
- File downloads
|
||
- Group membership changes
|
||
- Dashboard payload info
|
||
|
||
**`logs/monitoring.log`** (NEW):
|
||
- Process health events (start, crash, restart, stop)
|
||
- Both display_manager and simclient write here
|
||
- Centralized health tracking
|
||
- Technician-focused: "What happened to the processes?"
|
||
|
||
```
|
||
# Example monitoring.log entries:
|
||
2026-03-11 10:30:45 [INFO] Process started: event_id=event_123 event_type=presentation process=impressive pid=1234
|
||
2026-03-11 10:35:20 [WARNING] Process crashed: event_id=event_123 event_type=presentation process=impressive restart_count=0/3
|
||
2026-03-11 10:35:20 [WARNING] Restarting process: attempt 1/3 for impressive
|
||
2026-03-11 10:35:25 [INFO] Process started: event_id=event_123 event_type=presentation process=impressive pid=1245
|
||
```
|
||
|
||
### MQTT Transmission (Selective)
|
||
|
||
**Always sent** (when error occurs):
|
||
- `infoscreen/{uuid}/logs/error` - Critical failures
|
||
- `infoscreen/{uuid}/logs/warn` - Restarts, crashes, missing binaries
|
||
|
||
**Development mode only** (if DEBUG_MODE=1):
|
||
- `infoscreen/{uuid}/logs/info` - Event start/stop, process running status
|
||
|
||
**Never sent**:
|
||
- DEBUG messages (local-only debug details)
|
||
- INFO messages in production
|
||
|
||
---
|
||
|
||
## Environment Variables
|
||
|
||
No new required variables. Existing configuration supports monitoring:
|
||
|
||
```bash
|
||
# Existing (unchanged):
|
||
ENV=development|production
|
||
DEBUG_MODE=0|1 # Enables INFO logs to MQTT
|
||
LOG_LEVEL=DEBUG|INFO|WARNING|ERROR # Local log verbosity
|
||
HEARTBEAT_INTERVAL=5|60 # seconds
|
||
SCREENSHOT_INTERVAL=30|300 # seconds (display_manager_screenshot_capture)
|
||
|
||
# Recommended for monitoring:
|
||
SCREENSHOT_CAPTURE_INTERVAL=30 # How often display_manager captures screenshots
|
||
SCREENSHOT_MAX_WIDTH=800 # Downscale for bandwidth
|
||
SCREENSHOT_JPEG_QUALITY=70 # Balance quality/size
|
||
|
||
# File server (if different from MQTT broker):
|
||
FILE_SERVER_HOST=192.168.1.100
|
||
FILE_SERVER_PORT=8000
|
||
FILE_SERVER_SCHEME=http
|
||
```
|
||
|
||
---
|
||
|
||
## Testing Validation
|
||
|
||
### System-Level Test Sequence
|
||
|
||
**1. Start Services**:
|
||
```bash
|
||
# Terminal 1: Display Manager
|
||
./scripts/start-display-manager.sh
|
||
|
||
# Terminal 2: MQTT Client
|
||
./scripts/start-dev.sh
|
||
|
||
# Terminal 3: Monitor logs
|
||
tail -f logs/monitoring.log
|
||
```
|
||
|
||
**2. Trigger Each Event Type**:
|
||
```bash
|
||
# Via test menu or MQTT publish:
|
||
./scripts/test-display-manager.sh # Options 1-3 trigger events
|
||
```
|
||
|
||
**3. Verify Health State File**:
|
||
```bash
|
||
# Check health state gets written immediately
|
||
cat src/current_process_health.json
|
||
# Should show: event_id, event_type, current_process (impressive/chromium/vlc), process_status=running
|
||
```
|
||
|
||
**4. Check MQTT Topics**:
|
||
```bash
|
||
# Monitor health messages:
|
||
mosquitto_sub -h localhost -t "infoscreen/+/health" -v
|
||
|
||
# Monitor log messages:
|
||
mosquitto_sub -h localhost -t "infoscreen/+/logs/#" -v
|
||
|
||
# Monitor dashboard heartbeat:
|
||
mosquitto_sub -h localhost -t "infoscreen/+/dashboard" -v | head -c 500 && echo "..."
|
||
```
|
||
|
||
**5. Simulate Process Crash**:
|
||
```bash
|
||
# Find impressive/chromium/vlc PID:
|
||
ps aux | grep -E 'impressive|chromium|vlc'
|
||
|
||
# Kill process:
|
||
kill -9 <pid>
|
||
|
||
# Watch monitoring.log for crash detection and restart
|
||
tail -f logs/monitoring.log
|
||
# Should see: [WARNING] Process crashed... [WARNING] Restarting process...
|
||
```
|
||
|
||
**6. Verify Server Integration**:
|
||
```bash
|
||
# Server receives health messages:
|
||
sqlite3 infoscreen.db "SELECT process_status, current_process, restart_count FROM clients WHERE uuid='...';"
|
||
# Should show latest status from health message
|
||
|
||
# Server receives logs:
|
||
sqlite3 infoscreen.db "SELECT level, message FROM client_logs WHERE client_uuid='...' ORDER BY timestamp DESC LIMIT 10;"
|
||
# Should show ERROR/WARN entries from crashes/restarts
|
||
```
|
||
|
||
---
|
||
|
||
## Troubleshooting
|
||
|
||
### Health State File Not Created
|
||
|
||
**Symptom**: `src/current_process_health.json` missing
|
||
**Causes**:
|
||
- No event active (file only created when display starts)
|
||
- display_manager not running
|
||
|
||
**Check**:
|
||
```bash
|
||
ps aux | grep display_manager
|
||
tail -f logs/display_manager.log | grep "Process started\|Process stopped"
|
||
```
|
||
|
||
### MQTT Health Messages Not Arriving
|
||
|
||
**Symptom**: No health messages on `infoscreen/{uuid}/health` topic
|
||
**Causes**:
|
||
- simclient not reading health state file
|
||
- MQTT connection dropped
|
||
- Health update function not called
|
||
|
||
**Check**:
|
||
```bash
|
||
# Check health file exists and is recent:
|
||
ls -l src/current_process_health.json
|
||
stat src/current_process_health.json | grep Modify
|
||
|
||
# Monitor simclient logs:
|
||
tail -f logs/simclient.log | grep -E "Health|heartbeat|publish"
|
||
|
||
# Verify MQTT connection:
|
||
mosquitto_sub -h localhost -t "infoscreen/+/heartbeat" -v
|
||
```
|
||
|
||
### Restart Loop (Process Keeps Crashing)
|
||
|
||
**Symptom**: monitoring.log shows repeated crashes and restarts
|
||
**Check**:
|
||
```bash
|
||
# Read last log lines of the process (stored by display_manager):
|
||
tail -f logs/impressive.out.log # for presentations
|
||
tail -f logs/browser.out.log # for websites
|
||
tail -f logs/video_player.out.log # for videos
|
||
```
|
||
|
||
**Common Causes**:
|
||
- Missing binary (impressive not installed, chromium not found, vlc not available)
|
||
- Corrupt presentation file
|
||
- Invalid URL for website
|
||
- Insufficient permissions for screenshots
|
||
|
||
### Log Messages Not Reaching Server
|
||
|
||
**Symptom**: client_logs table in server DB is empty
|
||
**Causes**:
|
||
- Log level filtering: INFO messages in production are local-only
|
||
- Logs only published on ERROR/WARN
|
||
- MQTT publish failing silently
|
||
|
||
**Check**:
|
||
```bash
|
||
# Force DEBUG_MODE to see all logs:
|
||
export DEBUG_MODE=1
|
||
export LOG_LEVEL=DEBUG
|
||
# Restart simclient and trigger event
|
||
|
||
# Monitor local logs first:
|
||
tail -f logs/monitoring.log | grep -i error
|
||
```
|
||
|
||
---
|
||
|
||
## Performance Considerations
|
||
|
||
**Bandwidth per Client**:
|
||
- Health message: ~200 bytes per heartbeat interval (every 5-60s)
|
||
- Screenshot heartbeat: ~50-100 KB (every 30-300s)
|
||
- Log messages: ~100-500 bytes per crash/error (rare)
|
||
- **Total**: ~0.5-2 MB/day per device (very minimal)
|
||
|
||
**Disk Space on Client**:
|
||
- Monitoring logs: 5 files × 5 MB = 25 MB max
|
||
- Display manager logs: 5 files × 2 MB = 10 MB max
|
||
- MQTT client logs: 5 files × 2 MB = 10 MB max
|
||
- Screenshots: 20 files × 50-100 KB = 1-2 MB max
|
||
- **Total**: ~50 MB max (typical for Raspberry Pi USB/SSD)
|
||
|
||
**Rotation Strategy**:
|
||
- Old files automatically deleted when size limit reached
|
||
- Technician can SSH and `tail -f` any time
|
||
- No database overhead (file-based rotation is minimal CPU)
|
||
|
||
---
|
||
|
||
## Integration with Server (Phase 2)
|
||
|
||
The client implementation sends data to the server's Phase 2 endpoints:
|
||
|
||
**Expected Server Implementation** (from CLIENT_MONITORING_SETUP.md):
|
||
|
||
1. **MQTT Listener** receives and stores:
|
||
- `infoscreen/{uuid}/logs/error`, `/logs/warn`, `/logs/info`
|
||
- `infoscreen/{uuid}/health` messages
|
||
- Updates `clients` table with health fields
|
||
|
||
2. **Database Tables**:
|
||
- `clients.process_status`: running/crashed/starting/stopped
|
||
- `clients.current_process`: impressive/chromium/vlc/None
|
||
- `clients.process_pid`: PID value
|
||
- `clients.current_event_id`: Active event
|
||
- `client_logs`: table stores logs with level/message/context
|
||
|
||
3. **API Endpoints**:
|
||
- `GET /api/client-logs/{uuid}/logs?level=ERROR&limit=50`
|
||
- `GET /api/client-logs/summary` (errors/warnings across all clients)
|
||
|
||
---
|
||
|
||
## Summary of Changes
|
||
|
||
### Files Modified
|
||
|
||
1. **`src/display_manager.py`**:
|
||
- Added `psutil` import for future process monitoring
|
||
- Added `ProcessHealthState` class (60 lines)
|
||
- Added monitoring logger setup (8 lines)
|
||
- Added `health.update_running()` calls in `start_presentation()`, `start_video()`, `start_webpage()`
|
||
- Added crash detection and restart logic in `process_events()`
|
||
- Added `health.update_stopped()` in `stop_current_display()`
|
||
|
||
2. **`src/simclient.py`**:
|
||
- Added `timezone` import
|
||
- Added monitoring logger setup (8 lines)
|
||
- Added `read_health_state()` function
|
||
- Added `publish_health_message()` function
|
||
- Added `publish_log_message()` function (with level filtering)
|
||
- Updated `send_screenshot_heartbeat()` to include health data
|
||
- Updated heartbeat loop to call `publish_health_message()`
|
||
|
||
### Files Created
|
||
|
||
1. **`src/current_process_health.json`** (at runtime):
|
||
- Bridge file between display_manager and simclient
|
||
- Shared volume compatible (works in container setup)
|
||
|
||
2. **`logs/monitoring.log`** (at runtime):
|
||
- New rotating log file (5 × 5MB)
|
||
- Health events from both processes
|
||
|
||
---
|
||
|
||
## Next Steps
|
||
|
||
1. **Deploy to test client** and run validation sequence above
|
||
2. **Deploy server Phase 2** (if not yet done) to receive health/log messages
|
||
3. **Verify database updates** in server-side `clients` and `client_logs` tables
|
||
4. **Test dashboard UI** (Phase 4) to display health indicators
|
||
5. **Configure alerting** (email/Slack) for ERROR level messages
|
||
|
||
---
|
||
|
||
**Implementation Date**: 11. März 2026
|
||
**Part of**: Infoscreen 2025 Client Monitoring System
|
||
**Status**: Production Ready (with server Phase 2 integration)
|