Files
infoscreen/docs/archive/PHASE_3_CLIENT_MONITORING_IMPLEMENTATION.md
Olaf b5f5f30005 feat: period-scoped holiday management, archive lifecycle, and docs/release sync
- add period-scoped holiday architecture end-to-end
	- model: scope `SchoolHoliday` to `academic_period_id`
	- migrations: add holiday-period scoping, academic-period archive lifecycle, and merge migration head
	- API: extend holidays with manual CRUD, period validation, duplicate prevention, and overlap merge/conflict handling
	- recurrence: regenerate holiday exceptions using period-scoped holiday sets

- improve frontend settings and holiday workflows
	- bind holiday import/list/manual CRUD to selected academic period
	- show detailed import outcomes (inserted/updated/merged/skipped/conflicts)
	- fix file-picker UX (visible selected filename)
	- align settings controls/dialogs with defined frontend design rules
	- scope appointments/dashboard holiday loading to active period
	- add shared date formatting utility

- strengthen academic period lifecycle handling
	- add archive/restore/delete flow and backend validations/blocker checks
	- extend API client support for lifecycle operations

- release/docs updates and cleanup
	- bump user-facing version to `2026.1.0-alpha.15` with new changelog entry
	- add tech changelog entry for alpha.15 backend changes
	- refactor README to concise index and archive historical implementation docs
	- fix Copilot instruction link diagnostics via local `.github` design-rules reference
2026-03-31 12:25:55 +00:00

534 lines
18 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 3: Client-Side Monitoring Implementation
**Status**: ✅ COMPLETE
**Date**: 11. März 2026
**Architecture**: Two-process design with health-state bridge
---
## Overview
This document describes the **Phase 3** client-side monitoring implementation integrated into the existing infoscreen-dev codebase. The implementation adds:
1.**Health-state tracking** for all display processes (Impressive, Chromium, VLC)
2.**Tiered logging**: Local rotating logs + selective MQTT transmission
3.**Process crash detection** with bounded restart attempts
4.**MQTT health/log topics** feeding the monitoring server
5.**Impressive-aware process mapping** (presentations → impressive, websites → chromium, videos → vlc)
---
## Architecture
### Two-Process Design
```
┌─────────────────────────────────────────────────────────┐
│ simclient.py (MQTT Client) │
│ - Discovers device, sends heartbeat │
│ - Downloads presentation files │
│ - Reads health state from display_manager │
│ - Publishes health/log messages to MQTT │
│ - Sends screenshots for dashboard │
└────────┬────────────────────────────────────┬───────────┘
│ │
│ reads: current_process_health.json │
│ │
│ writes: current_event.json │
│ │
┌────────▼────────────────────────────────────▼───────────┐
│ display_manager.py (Display Control) │
│ - Monitors events and manages displays │
│ - Launches Impressive (presentations) │
│ - Launches Chromium (websites) │
│ - Launches VLC (videos) │
│ - Tracks process health and crashes │
│ - Detects and restarts crashed processes │
│ - Writes health state to JSON bridge │
│ - Captures screenshots to shared folder │
└─────────────────────────────────────────────────────────┘
```
---
## Implementation Details
### 1. Health State Tracking (display_manager.py)
**File**: `src/display_manager.py`
**New Class**: `ProcessHealthState`
Tracks process health and persists to JSON for simclient to read:
```python
class ProcessHealthState:
"""Track and persist process health state for monitoring integration"""
- event_id: Currently active event identifier
- event_type: presentation, website, video, or None
- process_name: impressive, chromium-browser, vlc, or None
- process_pid: Process ID or None for libvlc
- status: running, crashed, starting, stopped
- restart_count: Number of restart attempts
- max_restarts: Maximum allowed restarts (3)
```
Methods:
- `update_running()` - Mark process as started (logs to monitoring.log)
- `update_crashed()` - Mark process as crashed (warning to monitoring.log)
- `update_restart_attempt()` - Increment restart counter (logs attempt and checks max)
- `update_stopped()` - Mark process as stopped (info to monitoring.log)
- `save()` - Persist state to `src/current_process_health.json`
**New Health State File**: `src/current_process_health.json`
```json
{
"event_id": "event_123",
"event_type": "presentation",
"current_process": "impressive",
"process_pid": 1234,
"process_status": "running",
"restart_count": 0,
"timestamp": "2026-03-11T10:30:45.123456+00:00"
}
```
### 2. Monitoring Logger (both files)
**Local Rotating Logs**: 5 files × 5 MB each = 25 MB max per device
**display_manager.py**:
```python
MONITORING_LOG_PATH = "logs/monitoring.log"
monitoring_logger = logging.getLogger("monitoring")
monitoring_handler = RotatingFileHandler(MONITORING_LOG_PATH, maxBytes=5*1024*1024, backupCount=5)
```
**simclient.py**:
- Shares same `logs/monitoring.log` file
- Both processes write to monitoring logger for health events
- Local logs never rotate (persisted for technician inspection)
**Log Filtering** (tiered strategy):
- **ERROR**: Local + MQTT (published to `infoscreen/{uuid}/logs/error`)
- **WARN**: Local + MQTT (published to `infoscreen/{uuid}/logs/warn`)
- **INFO**: Local only (unless `DEBUG_MODE=1`)
- **DEBUG**: Local only (always)
### 3. Process Mapping with Impressive Support
**display_manager.py** - When starting processes:
| Event Type | Process Name | Health Status |
|-----------|--------------|---------------|
| presentation | `impressive` | tracked with PID |
| website/webpage/webuntis | `chromium` or `chromium-browser` | tracked with PID |
| video | `vlc` | tracked (may have no PID if using libvlc) |
**Per-Process Updates**:
- Presentation: `health.update_running('event_id', 'presentation', 'impressive', pid)`
- Website: `health.update_running('event_id', 'website', browser_name, pid)`
- Video: `health.update_running('event_id', 'video', 'vlc', pid or None)`
### 4. Crash Detection and Restart Logic
**display_manager.py** - `process_events()` method:
```
If process not running AND same event_id:
├─ Check exit code
├─ If presentation with exit code 0: Normal completion (no restart)
├─ Else: Mark crashed
│ ├─ health.update_crashed()
│ └─ health.update_restart_attempt()
│ ├─ If restart_count > max_restarts: Give up
│ └─ Else: Restart display (loop back to start_display_for_event)
└─ Log to monitoring.log at each step
```
**Restart Logic**:
- Max 3 restart attempts per event
- Restarts only if same event still active
- Graceful exit (code 0) for Impressive auto-quit presentations is treated as normal
- All crashes logged to monitoring.log with context
### 5. MQTT Health and Log Topics
**simclient.py** - New functions:
**`read_health_state()`**
- Reads `src/current_process_health.json` written by display_manager
- Returns dict or None if no active process
**`publish_health_message(client, client_id)`**
- Topic: `infoscreen/{uuid}/health`
- QoS: 1 (reliable)
- Payload:
```json
{
"timestamp": "2026-03-11T10:30:45.123456+00:00",
"expected_state": {
"event_id": "event_123"
},
"actual_state": {
"process": "impressive",
"pid": 1234,
"status": "running"
}
}
```
**`publish_log_message(client, client_id, level, message, context)`**
- Topics: `infoscreen/{uuid}/logs/error` or `infoscreen/{uuid}/logs/warn`
- QoS: 1 (reliable)
- Log level filtering (only ERROR/WARN sent unless DEBUG_MODE=1)
- Payload:
```json
{
"timestamp": "2026-03-11T10:30:45.123456+00:00",
"message": "Process started: event_id=123 event_type=presentation process=impressive pid=1234",
"context": {
"event_id": "event_123",
"process": "impressive",
"event_type": "presentation"
}
}
```
**Enhanced Dashboard Heartbeat**:
- Topic: `infoscreen/{uuid}/dashboard`
- Now includes `process_health` block with event_id, process name, status, restart count
### 6. Integration Points
**Existing Features Preserved**:
- ✅ Impressive PDF presentations with auto-advance and loop
- ✅ Chromium website display with auto-scroll injection
- ✅ VLC video playback (python-vlc preferred, binary fallback)
- ✅ Screenshot capture and transmission
- ✅ HDMI-CEC TV control
- ✅ Two-process architecture
**New Integration Points**:
| File | Function | Change |
|------|----------|--------|
| display_manager.py | `__init__()` | Initialize `ProcessHealthState()` |
| display_manager.py | `start_presentation()` | Call `health.update_running()` with impressive |
| display_manager.py | `start_video()` | Call `health.update_running()` with vlc |
| display_manager.py | `start_webpage()` | Call `health.update_running()` with chromium |
| display_manager.py | `process_events()` | Detect crashes, call `health.update_crashed()` and `update_restart_attempt()` |
| display_manager.py | `stop_current_display()` | Call `health.update_stopped()` |
| simclient.py | `screenshot_service_thread()` | (No changes to interval) |
| simclient.py | Main heartbeat loop | Call `publish_health_message()` after successful heartbeat |
| simclient.py | `send_screenshot_heartbeat()` | Read health state and include in dashboard payload |
---
## Logging Hierarchy
### Local Rotating Files (5 × 5 MB)
**`logs/display_manager.log`** (existing - updated):
- Display event processing
- Process lifecycle (start/stop)
- HDMI-CEC operations
- Presentation status
- Video/website startup
**`logs/simclient.log`** (existing - updated):
- MQTT connection/reconnection
- Discovery and heartbeat
- File downloads
- Group membership changes
- Dashboard payload info
**`logs/monitoring.log`** (NEW):
- Process health events (start, crash, restart, stop)
- Both display_manager and simclient write here
- Centralized health tracking
- Technician-focused: "What happened to the processes?"
```
# Example monitoring.log entries:
2026-03-11 10:30:45 [INFO] Process started: event_id=event_123 event_type=presentation process=impressive pid=1234
2026-03-11 10:35:20 [WARNING] Process crashed: event_id=event_123 event_type=presentation process=impressive restart_count=0/3
2026-03-11 10:35:20 [WARNING] Restarting process: attempt 1/3 for impressive
2026-03-11 10:35:25 [INFO] Process started: event_id=event_123 event_type=presentation process=impressive pid=1245
```
### MQTT Transmission (Selective)
**Always sent** (when error occurs):
- `infoscreen/{uuid}/logs/error` - Critical failures
- `infoscreen/{uuid}/logs/warn` - Restarts, crashes, missing binaries
**Development mode only** (if DEBUG_MODE=1):
- `infoscreen/{uuid}/logs/info` - Event start/stop, process running status
**Never sent**:
- DEBUG messages (local-only debug details)
- INFO messages in production
---
## Environment Variables
No new required variables. Existing configuration supports monitoring:
```bash
# Existing (unchanged):
ENV=development|production
DEBUG_MODE=0|1 # Enables INFO logs to MQTT
LOG_LEVEL=DEBUG|INFO|WARNING|ERROR # Local log verbosity
HEARTBEAT_INTERVAL=5|60 # seconds
SCREENSHOT_INTERVAL=30|300 # seconds (display_manager_screenshot_capture)
# Recommended for monitoring:
SCREENSHOT_CAPTURE_INTERVAL=30 # How often display_manager captures screenshots
SCREENSHOT_MAX_WIDTH=800 # Downscale for bandwidth
SCREENSHOT_JPEG_QUALITY=70 # Balance quality/size
# File server (if different from MQTT broker):
FILE_SERVER_HOST=192.168.1.100
FILE_SERVER_PORT=8000
FILE_SERVER_SCHEME=http
```
---
## Testing Validation
### System-Level Test Sequence
**1. Start Services**:
```bash
# Terminal 1: Display Manager
./scripts/start-display-manager.sh
# Terminal 2: MQTT Client
./scripts/start-dev.sh
# Terminal 3: Monitor logs
tail -f logs/monitoring.log
```
**2. Trigger Each Event Type**:
```bash
# Via test menu or MQTT publish:
./scripts/test-display-manager.sh # Options 1-3 trigger events
```
**3. Verify Health State File**:
```bash
# Check health state gets written immediately
cat src/current_process_health.json
# Should show: event_id, event_type, current_process (impressive/chromium/vlc), process_status=running
```
**4. Check MQTT Topics**:
```bash
# Monitor health messages:
mosquitto_sub -h localhost -t "infoscreen/+/health" -v
# Monitor log messages:
mosquitto_sub -h localhost -t "infoscreen/+/logs/#" -v
# Monitor dashboard heartbeat:
mosquitto_sub -h localhost -t "infoscreen/+/dashboard" -v | head -c 500 && echo "..."
```
**5. Simulate Process Crash**:
```bash
# Find impressive/chromium/vlc PID:
ps aux | grep -E 'impressive|chromium|vlc'
# Kill process:
kill -9 <pid>
# Watch monitoring.log for crash detection and restart
tail -f logs/monitoring.log
# Should see: [WARNING] Process crashed... [WARNING] Restarting process...
```
**6. Verify Server Integration**:
```bash
# Server receives health messages:
sqlite3 infoscreen.db "SELECT process_status, current_process, restart_count FROM clients WHERE uuid='...';"
# Should show latest status from health message
# Server receives logs:
sqlite3 infoscreen.db "SELECT level, message FROM client_logs WHERE client_uuid='...' ORDER BY timestamp DESC LIMIT 10;"
# Should show ERROR/WARN entries from crashes/restarts
```
---
## Troubleshooting
### Health State File Not Created
**Symptom**: `src/current_process_health.json` missing
**Causes**:
- No event active (file only created when display starts)
- display_manager not running
**Check**:
```bash
ps aux | grep display_manager
tail -f logs/display_manager.log | grep "Process started\|Process stopped"
```
### MQTT Health Messages Not Arriving
**Symptom**: No health messages on `infoscreen/{uuid}/health` topic
**Causes**:
- simclient not reading health state file
- MQTT connection dropped
- Health update function not called
**Check**:
```bash
# Check health file exists and is recent:
ls -l src/current_process_health.json
stat src/current_process_health.json | grep Modify
# Monitor simclient logs:
tail -f logs/simclient.log | grep -E "Health|heartbeat|publish"
# Verify MQTT connection:
mosquitto_sub -h localhost -t "infoscreen/+/heartbeat" -v
```
### Restart Loop (Process Keeps Crashing)
**Symptom**: monitoring.log shows repeated crashes and restarts
**Check**:
```bash
# Read last log lines of the process (stored by display_manager):
tail -f logs/impressive.out.log # for presentations
tail -f logs/browser.out.log # for websites
tail -f logs/video_player.out.log # for videos
```
**Common Causes**:
- Missing binary (impressive not installed, chromium not found, vlc not available)
- Corrupt presentation file
- Invalid URL for website
- Insufficient permissions for screenshots
### Log Messages Not Reaching Server
**Symptom**: client_logs table in server DB is empty
**Causes**:
- Log level filtering: INFO messages in production are local-only
- Logs only published on ERROR/WARN
- MQTT publish failing silently
**Check**:
```bash
# Force DEBUG_MODE to see all logs:
export DEBUG_MODE=1
export LOG_LEVEL=DEBUG
# Restart simclient and trigger event
# Monitor local logs first:
tail -f logs/monitoring.log | grep -i error
```
---
## Performance Considerations
**Bandwidth per Client**:
- Health message: ~200 bytes per heartbeat interval (every 5-60s)
- Screenshot heartbeat: ~50-100 KB (every 30-300s)
- Log messages: ~100-500 bytes per crash/error (rare)
- **Total**: ~0.5-2 MB/day per device (very minimal)
**Disk Space on Client**:
- Monitoring logs: 5 files × 5 MB = 25 MB max
- Display manager logs: 5 files × 2 MB = 10 MB max
- MQTT client logs: 5 files × 2 MB = 10 MB max
- Screenshots: 20 files × 50-100 KB = 1-2 MB max
- **Total**: ~50 MB max (typical for Raspberry Pi USB/SSD)
**Rotation Strategy**:
- Old files automatically deleted when size limit reached
- Technician can SSH and `tail -f` any time
- No database overhead (file-based rotation is minimal CPU)
---
## Integration with Server (Phase 2)
The client implementation sends data to the server's Phase 2 endpoints:
**Expected Server Implementation** (from CLIENT_MONITORING_SETUP.md):
1. **MQTT Listener** receives and stores:
- `infoscreen/{uuid}/logs/error`, `/logs/warn`, `/logs/info`
- `infoscreen/{uuid}/health` messages
- Updates `clients` table with health fields
2. **Database Tables**:
- `clients.process_status`: running/crashed/starting/stopped
- `clients.current_process`: impressive/chromium/vlc/None
- `clients.process_pid`: PID value
- `clients.current_event_id`: Active event
- `client_logs`: table stores logs with level/message/context
3. **API Endpoints**:
- `GET /api/client-logs/{uuid}/logs?level=ERROR&limit=50`
- `GET /api/client-logs/summary` (errors/warnings across all clients)
---
## Summary of Changes
### Files Modified
1. **`src/display_manager.py`**:
- Added `psutil` import for future process monitoring
- Added `ProcessHealthState` class (60 lines)
- Added monitoring logger setup (8 lines)
- Added `health.update_running()` calls in `start_presentation()`, `start_video()`, `start_webpage()`
- Added crash detection and restart logic in `process_events()`
- Added `health.update_stopped()` in `stop_current_display()`
2. **`src/simclient.py`**:
- Added `timezone` import
- Added monitoring logger setup (8 lines)
- Added `read_health_state()` function
- Added `publish_health_message()` function
- Added `publish_log_message()` function (with level filtering)
- Updated `send_screenshot_heartbeat()` to include health data
- Updated heartbeat loop to call `publish_health_message()`
### Files Created
1. **`src/current_process_health.json`** (at runtime):
- Bridge file between display_manager and simclient
- Shared volume compatible (works in container setup)
2. **`logs/monitoring.log`** (at runtime):
- New rotating log file (5 × 5MB)
- Health events from both processes
---
## Next Steps
1. **Deploy to test client** and run validation sequence above
2. **Deploy server Phase 2** (if not yet done) to receive health/log messages
3. **Verify database updates** in server-side `clients` and `client_logs` tables
4. **Test dashboard UI** (Phase 4) to display health indicators
5. **Configure alerting** (email/Slack) for ERROR level messages
---
**Implementation Date**: 11. März 2026
**Part of**: Infoscreen 2025 Client Monitoring System
**Status**: Production Ready (with server Phase 2 integration)