- add Alembic migration c1d2e3f4g5h6 for client monitoring:
- create client_logs table with FK to clients.uuid and performance indexes
- extend clients with process/health tracking fields
- extend data model with ClientLog, LogLevel, ProcessStatus, and ScreenHealthStatus
- enhance listener MQTT handling:
- subscribe to logs and health topics
- persist client logs from infoscreen/{uuid}/logs/{level}
- process health payloads and enrich heartbeat-derived client state
- add monitoring API blueprint server/routes/client_logs.py:
- GET /api/client-logs/<uuid>/logs
- GET /api/client-logs/summary
- GET /api/client-logs/recent-errors
- GET /api/client-logs/test
- register client_logs blueprint in server/wsgi.py
- align compose/dev runtime for listener live-code execution
- add client-side implementation docs:
- CLIENT_MONITORING_SPECIFICATION.md
- CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md
- update TECH-CHANGELOG.md and copilot-instructions.md:
- document monitoring changes
- codify post-release technical-notes/no-version-bump convention
24 KiB
🚀 Client Monitoring Implementation Guide
Phase-based implementation guide for basic monitoring in development phase
✅ Phase 1: Server-Side Database Foundation
Status: ✅ COMPLETE
Dependencies: None - Already implemented
Time estimate: Completed
✅ Step 1.1: Database Migration
File: server/alembic/versions/c1d2e3f4g5h6_add_client_monitoring.py
What it does:
- Creates
client_logstable for centralized logging - Adds health monitoring columns to
clientstable - Creates indexes for efficient querying
To apply:
cd /workspace/server
alembic upgrade head
✅ Step 1.2: Update Data Models
File: models/models.py
What was added:
- New enums:
LogLevel,ProcessStatus,ScreenHealthStatus - Updated
Clientmodel with health tracking fields - New
ClientLogmodel for log storage
🔧 Phase 2: Server-Side Backend Logic
Status: 🚧 IN PROGRESS
Dependencies: Phase 1 complete
Time estimate: 2-3 hours
Step 2.1: Extend MQTT Listener
File: listener/listener.py
What to add:
# Add new topic subscriptions in on_connect():
client.subscribe("infoscreen/+/logs/error")
client.subscribe("infoscreen/+/logs/warn")
client.subscribe("infoscreen/+/logs/info") # Dev mode only
client.subscribe("infoscreen/+/health")
# Add new handler in on_message():
def handle_log_message(uuid, level, payload):
"""Store client log in database"""
from models.models import ClientLog, LogLevel
from server.database import Session
import json
session = Session()
try:
log_entry = ClientLog(
client_uuid=uuid,
timestamp=payload.get('timestamp', datetime.now(timezone.utc)),
level=LogLevel[level],
message=payload.get('message', ''),
context=json.dumps(payload.get('context', {}))
)
session.add(log_entry)
session.commit()
print(f"[LOG] {uuid} {level}: {payload.get('message', '')}")
except Exception as e:
print(f"Error saving log: {e}")
session.rollback()
finally:
session.close()
def handle_health_message(uuid, payload):
"""Update client health status"""
from models.models import Client, ProcessStatus
from server.database import Session
session = Session()
try:
client = session.query(Client).filter_by(uuid=uuid).first()
if client:
client.current_event_id = payload.get('expected_state', {}).get('event_id')
client.current_process = payload.get('actual_state', {}).get('process')
status_str = payload.get('actual_state', {}).get('status')
if status_str:
client.process_status = ProcessStatus[status_str]
client.process_pid = payload.get('actual_state', {}).get('pid')
session.commit()
except Exception as e:
print(f"Error updating health: {e}")
session.rollback()
finally:
session.close()
Topic routing logic:
# In on_message callback, add routing:
if topic.endswith('/logs/error'):
handle_log_message(uuid, 'ERROR', payload)
elif topic.endswith('/logs/warn'):
handle_log_message(uuid, 'WARN', payload)
elif topic.endswith('/logs/info'):
handle_log_message(uuid, 'INFO', payload)
elif topic.endswith('/health'):
handle_health_message(uuid, payload)
Step 2.2: Create API Routes
File: server/routes/client_logs.py (NEW)
from flask import Blueprint, jsonify, request
from server.database import Session
from server.permissions import admin_or_higher
from models.models import ClientLog, Client
from sqlalchemy import desc
import json
client_logs_bp = Blueprint("client_logs", __name__, url_prefix="/api/client-logs")
@client_logs_bp.route("/<uuid>/logs", methods=["GET"])
@admin_or_higher
def get_client_logs(uuid):
"""
Get logs for a specific client
Query params:
- level: ERROR, WARN, INFO, DEBUG (optional)
- limit: number of entries (default 50, max 500)
- since: ISO timestamp (optional)
"""
session = Session()
try:
level = request.args.get('level')
limit = min(int(request.args.get('limit', 50)), 500)
since = request.args.get('since')
query = session.query(ClientLog).filter_by(client_uuid=uuid)
if level:
from models.models import LogLevel
query = query.filter_by(level=LogLevel[level])
if since:
from datetime import datetime
since_dt = datetime.fromisoformat(since.replace('Z', '+00:00'))
query = query.filter(ClientLog.timestamp >= since_dt)
logs = query.order_by(desc(ClientLog.timestamp)).limit(limit).all()
result = []
for log in logs:
result.append({
"id": log.id,
"timestamp": log.timestamp.isoformat() if log.timestamp else None,
"level": log.level.value if log.level else None,
"message": log.message,
"context": json.loads(log.context) if log.context else {}
})
session.close()
return jsonify({"logs": result, "count": len(result)})
except Exception as e:
session.close()
return jsonify({"error": str(e)}), 500
@client_logs_bp.route("/summary", methods=["GET"])
@admin_or_higher
def get_logs_summary():
"""Get summary of errors/warnings across all clients"""
session = Session()
try:
from sqlalchemy import func
from models.models import LogLevel
from datetime import datetime, timedelta
# Last 24 hours
since = datetime.utcnow() - timedelta(hours=24)
stats = session.query(
ClientLog.client_uuid,
ClientLog.level,
func.count(ClientLog.id).label('count')
).filter(
ClientLog.timestamp >= since
).group_by(
ClientLog.client_uuid,
ClientLog.level
).all()
result = {}
for stat in stats:
uuid = stat.client_uuid
if uuid not in result:
result[uuid] = {"ERROR": 0, "WARN": 0, "INFO": 0}
result[uuid][stat.level.value] = stat.count
session.close()
return jsonify({"summary": result, "period_hours": 24})
except Exception as e:
session.close()
return jsonify({"error": str(e)}), 500
Register in server/wsgi.py:
from server.routes.client_logs import client_logs_bp
app.register_blueprint(client_logs_bp)
Step 2.3: Add Health Data to Heartbeat Handler
File: listener/listener.py (extend existing heartbeat handler)
# Modify existing heartbeat handler to capture health data
def on_message(client, userdata, message):
topic = message.topic
# Existing heartbeat logic...
if '/heartbeat' in topic:
uuid = extract_uuid_from_topic(topic)
try:
payload = json.loads(message.payload.decode())
# Update last_alive (existing)
session = Session()
client_obj = session.query(Client).filter_by(uuid=uuid).first()
if client_obj:
client_obj.last_alive = datetime.now(timezone.utc)
# NEW: Update health data if present in heartbeat
if 'process_status' in payload:
client_obj.process_status = ProcessStatus[payload['process_status']]
if 'current_process' in payload:
client_obj.current_process = payload['current_process']
if 'process_pid' in payload:
client_obj.process_pid = payload['process_pid']
if 'current_event_id' in payload:
client_obj.current_event_id = payload['current_event_id']
session.commit()
session.close()
except Exception as e:
print(f"Error processing heartbeat: {e}")
🖥️ Phase 3: Client-Side Implementation
Status: ⏳ PENDING (After Phase 2)
Dependencies: Phase 2 complete
Time estimate: 3-4 hours
Step 3.1: Create Client Watchdog Script
File: client/watchdog.py (NEW - on client device)
#!/usr/bin/env python3
"""
Client-side process watchdog
Monitors VLC, Chromium, PDF viewer and reports health to server
"""
import psutil
import paho.mqtt.client as mqtt
import json
import time
from datetime import datetime, timezone
import sys
import os
class MediaWatchdog:
def __init__(self, client_uuid, mqtt_broker, mqtt_port=1883):
self.uuid = client_uuid
self.mqtt_client = mqtt.Client()
self.mqtt_client.connect(mqtt_broker, mqtt_port, 60)
self.mqtt_client.loop_start()
self.current_process = None
self.current_event_id = None
self.restart_attempts = 0
self.MAX_RESTARTS = 3
def send_log(self, level, message, context=None):
"""Send log message to server via MQTT"""
topic = f"infoscreen/{self.uuid}/logs/{level.lower()}"
payload = {
"timestamp": datetime.now(timezone.utc).isoformat(),
"message": message,
"context": context or {}
}
self.mqtt_client.publish(topic, json.dumps(payload), qos=1)
print(f"[{level}] {message}")
def send_health(self, process_name, pid, status, event_id=None):
"""Send health status to server"""
topic = f"infoscreen/{self.uuid}/health"
payload = {
"timestamp": datetime.now(timezone.utc).isoformat(),
"expected_state": {
"event_id": event_id
},
"actual_state": {
"process": process_name,
"pid": pid,
"status": status # 'running', 'crashed', 'starting', 'stopped'
}
}
self.mqtt_client.publish(topic, json.dumps(payload), qos=1, retain=False)
def is_process_running(self, process_name):
"""Check if a process is running"""
for proc in psutil.process_iter(['name', 'pid']):
try:
if process_name.lower() in proc.info['name'].lower():
return proc.info['pid']
except (psutil.NoSuchProcess, psutil.AccessDenied):
pass
return None
def monitor_loop(self):
"""Main monitoring loop"""
print(f"Watchdog started for client {self.uuid}")
self.send_log("INFO", "Watchdog service started", {"uuid": self.uuid})
while True:
try:
# Check expected process (would be set by main event handler)
if self.current_process:
pid = self.is_process_running(self.current_process)
if pid:
# Process is running
self.send_health(
self.current_process,
pid,
"running",
self.current_event_id
)
self.restart_attempts = 0 # Reset on success
else:
# Process crashed
self.send_log(
"ERROR",
f"Process {self.current_process} crashed or stopped",
{
"event_id": self.current_event_id,
"process": self.current_process,
"restart_attempt": self.restart_attempts
}
)
if self.restart_attempts < self.MAX_RESTARTS:
self.send_log("WARN", f"Attempting restart ({self.restart_attempts + 1}/{self.MAX_RESTARTS})")
self.restart_attempts += 1
# TODO: Implement restart logic (call event handler)
else:
self.send_log("ERROR", "Max restart attempts exceeded", {
"event_id": self.current_event_id
})
time.sleep(5) # Check every 5 seconds
except KeyboardInterrupt:
print("Watchdog stopped by user")
break
except Exception as e:
self.send_log("ERROR", f"Watchdog error: {str(e)}", {
"exception": str(e),
"traceback": str(sys.exc_info())
})
time.sleep(10) # Wait longer on error
if __name__ == "__main__":
import sys
if len(sys.argv) < 3:
print("Usage: python watchdog.py <client_uuid> <mqtt_broker>")
sys.exit(1)
uuid = sys.argv[1]
broker = sys.argv[2]
watchdog = MediaWatchdog(uuid, broker)
watchdog.monitor_loop()
Step 3.2: Integrate with Existing Event Handler
File: client/event_handler.py (modify existing)
# When starting a new event, notify watchdog
def play_event(event_data):
event_type = event_data.get('event_type')
event_id = event_data.get('id')
if event_type == 'video':
process_name = 'vlc'
# Start VLC...
elif event_type == 'website':
process_name = 'chromium'
# Start Chromium...
elif event_type == 'presentation':
process_name = 'pdf_viewer' # or your PDF tool
# Start PDF viewer...
# Notify watchdog about expected process
watchdog.current_process = process_name
watchdog.current_event_id = event_id
watchdog.restart_attempts = 0
Step 3.3: Enhanced Heartbeat Payload
File: client/heartbeat.py (modify existing)
# Modify existing heartbeat to include process status
def send_heartbeat(mqtt_client, uuid):
# Get current process status
current_process = None
process_pid = None
process_status = "stopped"
# Check if expected process is running
if watchdog.current_process:
pid = watchdog.is_process_running(watchdog.current_process)
if pid:
current_process = watchdog.current_process
process_pid = pid
process_status = "running"
payload = {
"uuid": uuid,
"timestamp": datetime.now(timezone.utc).isoformat(),
# Existing fields...
# NEW health fields:
"current_process": current_process,
"process_pid": process_pid,
"process_status": process_status,
"current_event_id": watchdog.current_event_id
}
mqtt_client.publish(f"infoscreen/{uuid}/heartbeat", json.dumps(payload))
🎨 Phase 4: Dashboard UI Integration
Status: ⏳ PENDING (After Phase 3)
Dependencies: Phases 2 & 3 complete
Time estimate: 2-3 hours
Step 4.1: Create Log Viewer Component
File: dashboard/src/ClientLogs.tsx (NEW)
import React from 'react';
import { GridComponent, ColumnsDirective, ColumnDirective, Page, Inject } from '@syncfusion/ej2-react-grids';
interface LogEntry {
id: number;
timestamp: string;
level: 'ERROR' | 'WARN' | 'INFO' | 'DEBUG';
message: string;
context: any;
}
interface ClientLogsProps {
clientUuid: string;
}
export const ClientLogs: React.FC<ClientLogsProps> = ({ clientUuid }) => {
const [logs, setLogs] = React.useState<LogEntry[]>([]);
const [loading, setLoading] = React.useState(false);
const loadLogs = async (level?: string) => {
setLoading(true);
try {
const params = new URLSearchParams({ limit: '50' });
if (level) params.append('level', level);
const response = await fetch(`/api/client-logs/${clientUuid}/logs?${params}`);
const data = await response.json();
setLogs(data.logs);
} catch (err) {
console.error('Failed to load logs:', err);
} finally {
setLoading(false);
}
};
React.useEffect(() => {
loadLogs();
const interval = setInterval(() => loadLogs(), 30000); // Refresh every 30s
return () => clearInterval(interval);
}, [clientUuid]);
const levelTemplate = (props: any) => {
const colors = {
ERROR: 'text-red-600 bg-red-100',
WARN: 'text-yellow-600 bg-yellow-100',
INFO: 'text-blue-600 bg-blue-100',
DEBUG: 'text-gray-600 bg-gray-100'
};
return (
<span className={`px-2 py-1 rounded ${colors[props.level as keyof typeof colors]}`}>
{props.level}
</span>
);
};
return (
<div>
<div className="mb-4 flex gap-2">
<button onClick={() => loadLogs()} className="e-btn e-primary">All</button>
<button onClick={() => loadLogs('ERROR')} className="e-btn e-danger">Errors</button>
<button onClick={() => loadLogs('WARN')} className="e-btn e-warning">Warnings</button>
<button onClick={() => loadLogs('INFO')} className="e-btn e-info">Info</button>
</div>
<GridComponent
dataSource={logs}
allowPaging={true}
pageSettings={{ pageSize: 20 }}
>
<ColumnsDirective>
<ColumnDirective field='timestamp' headerText='Time' width='180' format='yMd HH:mm:ss' />
<ColumnDirective field='level' headerText='Level' width='100' template={levelTemplate} />
<ColumnDirective field='message' headerText='Message' width='400' />
</ColumnsDirective>
<Inject services={[Page]} />
</GridComponent>
</div>
);
};
Step 4.2: Add Health Indicators to Client Cards
File: dashboard/src/clients.tsx (modify existing)
// Add health indicator to client card
const getHealthBadge = (client: Client) => {
if (!client.process_status) {
return <span className="badge badge-secondary">Unknown</span>;
}
const badges = {
running: <span className="badge badge-success">✓ Running</span>,
crashed: <span className="badge badge-danger">✗ Crashed</span>,
starting: <span className="badge badge-warning">⟳ Starting</span>,
stopped: <span className="badge badge-secondary">■ Stopped</span>
};
return badges[client.process_status] || null;
};
// In client card render:
<div className="client-card">
<h3>{client.hostname || client.uuid}</h3>
<div>Status: {getHealthBadge(client)}</div>
<div>Process: {client.current_process || 'None'}</div>
<div>Event ID: {client.current_event_id || 'None'}</div>
<button onClick={() => showLogs(client.uuid)}>View Logs</button>
</div>
Step 4.3: Add System Health Dashboard (Superadmin)
File: dashboard/src/SystemMonitor.tsx (NEW)
import React from 'react';
import { ClientLogs } from './ClientLogs';
export const SystemMonitor: React.FC = () => {
const [summary, setSummary] = React.useState<any>({});
const loadSummary = async () => {
const response = await fetch('/api/client-logs/summary');
const data = await response.json();
setSummary(data.summary);
};
React.useEffect(() => {
loadSummary();
const interval = setInterval(loadSummary, 30000);
return () => clearInterval(interval);
}, []);
return (
<div className="system-monitor">
<h2>System Health Monitor (Superadmin)</h2>
<div className="alert-panel">
<h3>Active Issues</h3>
{Object.entries(summary).map(([uuid, stats]: [string, any]) => (
stats.ERROR > 0 || stats.WARN > 5 ? (
<div key={uuid} className="alert">
🔴 {uuid}: {stats.ERROR} errors, {stats.WARN} warnings (24h)
</div>
) : null
))}
</div>
{/* Real-time log stream */}
<div className="log-stream">
<h3>Recent Logs (All Clients)</h3>
{/* Implement real-time log aggregation */}
</div>
</div>
);
};
🧪 Phase 5: Testing & Validation
Status: ⏳ PENDING
Dependencies: All previous phases
Time estimate: 1-2 hours
Step 5.1: Server-Side Tests
# Test database migration
cd /workspace/server
alembic upgrade head
alembic downgrade -1
alembic upgrade head
# Test API endpoints
curl -X GET "http://localhost:8000/api/client-logs/<uuid>/logs?limit=10"
curl -X GET "http://localhost:8000/api/client-logs/summary"
Step 5.2: Client-Side Tests
# On client device
python3 watchdog.py <your-uuid> <mqtt-broker-ip>
# Simulate process crash
pkill vlc # Should trigger error log and restart attempt
# Check MQTT messages
mosquitto_sub -h <broker> -t "infoscreen/+/logs/#" -v
mosquitto_sub -h <broker> -t "infoscreen/+/health" -v
Step 5.3: Dashboard Tests
- Open dashboard and navigate to Clients page
- Verify health indicators show correct status
- Click "View Logs" and verify logs appear
- Navigate to System Monitor (superadmin)
- Verify summary statistics are correct
📝 Configuration Summary
Environment Variables
Server (docker-compose.yml):
- LOG_RETENTION_DAYS=90 # How long to keep logs
- DEBUG_MODE=true # Enable INFO level logging via MQTT
Client:
export MQTT_BROKER="your-server-ip"
export CLIENT_UUID="abc-123-def"
export WATCHDOG_ENABLED=true
MQTT Topics Reference
| Topic Pattern | Direction | Purpose |
|---|---|---|
infoscreen/{uuid}/logs/error |
Client → Server | Error messages |
infoscreen/{uuid}/logs/warn |
Client → Server | Warning messages |
infoscreen/{uuid}/logs/info |
Client → Server | Info (dev only) |
infoscreen/{uuid}/health |
Client → Server | Health metrics |
infoscreen/{uuid}/heartbeat |
Client → Server | Enhanced heartbeat |
Database Tables
client_logs:
- Stores all centralized logs
- Indexed by client_uuid, timestamp, level
- Auto-cleanup after 90 days (recommended)
clients (extended):
current_event_id: Which event should be playingcurrent_process: Expected process nameprocess_status: running/crashed/starting/stoppedprocess_pid: Process IDscreen_health_status: OK/BLACK/FROZEN/UNKNOWNlast_screenshot_analyzed: Last analysis timelast_screenshot_hash: For frozen detection
🎯 Next Steps After Implementation
- Deploy Phase 1-2 to staging environment
- Test with 1-2 pilot clients before full rollout
- Monitor traffic & performance (should be minimal)
- Fine-tune log levels based on actual noise
- Add alerting (email/Slack when errors > threshold)
- Implement screenshot analysis (Phase 2 enhancement)
- Add trending/analytics (which clients are least reliable)
🚨 Troubleshooting
Logs not appearing in database:
- Check MQTT broker logs:
docker logs infoscreen-mqtt - Verify listener subscriptions: Check
listener/listener.pylogs - Test MQTT manually:
mosquitto_pub -h broker -t "infoscreen/test/logs/error" -m '{"message":"test"}'
High database growth:
- Check log_retention cleanup cronjob
- Reduce INFO level logging frequency
- Add sampling (log every 10th occurrence instead of all)
Client watchdog not detecting crashes:
- Verify psutil can see processes:
ps aux | grep vlc - Check permissions (may need sudo for some process checks)
- Increase monitor loop frequency for faster detection
✅ Completion Checklist
- Phase 1: Database migration applied
- Phase 2: Listener extended for log topics
- Phase 2: API endpoints created and tested
- Phase 3: Client watchdog implemented
- Phase 3: Enhanced heartbeat deployed
- Phase 4: Dashboard log viewer working
- Phase 4: Health indicators visible
- Phase 5: End-to-end testing complete
- Documentation updated with new features
- Production deployment plan created
Last Updated: 2026-03-09
Author: GitHub Copilot
For: Infoscreen 2025 Project