feat(monitoring): add server-side client logging and health infrastructure

- add Alembic migration c1d2e3f4g5h6 for client monitoring: - create client_logs table with FK to clients.uuid and performance indexes - extend clients with process/health tracking fields - extend data model with ClientLog, LogLevel, ProcessStatus, and ScreenHealthStatus - enhance listener MQTT handling: - subscribe to logs and health topics - persist client logs from infoscreen/{uuid}/logs/{level} - process health payloads and enrich heartbeat-derived client state - add monitoring API blueprint server/routes/client_logs.py: - GET /api/client-logs/<uuid>/logs - GET /api/client-logs/summary - GET /api/client-logs/recent-errors - GET /api/client-logs/test - register client_logs blueprint in server/wsgi.py - align compose/dev runtime for listener live-code execution - add client-side implementation docs: - CLIENT_MONITORING_SPECIFICATION.md - CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md - update TECH-CHANGELOG.md and copilot-instructions.md: - document monitoring changes - codify post-release technical-notes/no-version-bump convention
2026-03-10 07:33:38 +00:00
parent 7746e26385
commit 3107d0f671
10 changed files with 2307 additions and 6 deletions
--- a/.github/copilot-instructions.md
+++ b/.github/copilot-instructions.md
@@ -124,9 +124,11 @@ Keep docs synced with code. When you change services/MQTT/API/UTC/env or dev/pro
  - Scheduler queries a future window (default: 7 days) to expand recurring events using RFC 5545 rules, applies event exceptions (skipped dates, detached occurrences), and publishes only events that are active at the current time (UTC). When a group has no active events, the scheduler clears its retained topic by publishing an empty list. Time comparisons are UTC; naive timestamps are normalized. Logging is concise; conversion lookups are cached and logged only once per media.
 - MQTT topics (paho-mqtt v2, use Callback API v2):
  - Discovery: `infoscreen/discovery` (JSON includes `uuid`, hw/ip data). ACK to `infoscreen/{uuid}/discovery_ack`. See `listener/listener.py`.
-  - Heartbeat: `infoscreen/{uuid}/heartbeat` updates `Client.last_alive` (UTC).
+  - Heartbeat: `infoscreen/{uuid}/heartbeat` updates `Client.last_alive` (UTC); enhanced payload includes `current_process`, `process_pid`, `process_status`, `current_event_id`.
  - Event lists (retained): `infoscreen/events/{group_id}` from `scheduler/scheduler.py`.
  - Per-client group assignment (retained): `infoscreen/{uuid}/group_id` via `server/mqtt_helper.py`.
  - Client logs: `infoscreen/{uuid}/logs/{error|warn|info}` with JSON payload (timestamp, message, context); QoS 1 for ERROR/WARN, QoS 0 for INFO.
  - Client health: `infoscreen/{uuid}/health` with metrics (expected_state, actual_state, health_metrics); QoS 0, published every 5 seconds.
 - Screenshots: server-side folders `server/received_screenshots/` and `server/screenshots/`; Nginx exposes `/screenshots/{uuid}.jpg` via `server/wsgi.py` route.
 - Dev Container guidance: If extensions reappear inside the container, remove UI-only extensions from `devcontainer.json` `extensions` and map them in `remote.extensionKind` as `"ui"`.
@@ -146,6 +148,11 @@ Keep docs synced with code. When you change services/MQTT/API/UTC/env or dev/pro
  - `locked_until`: TIMESTAMP placeholder for account lockout (infrastructure in place, not yet enforced)
  - `deactivated_at`, `deactivated_by`: Soft-delete audit trail (FK self-reference); soft deactivation is the default, hard delete superadmin-only
  - Role hierarchy (privilege escalation enforced): `user` < `editor` < `admin` < `superadmin`
 - Client monitoring (migration: `c1d2e3f4g5h6_add_client_monitoring.py`):
  - `ClientLog` model: Centralized log storage with fields (id, client_uuid, timestamp, level, message, context, created_at); FK to clients.uuid (CASCADE)
  - `Client` model extended: 7 health monitoring fields (`current_event_id`, `current_process`, `process_status`, `process_pid`, `last_screenshot_analyzed`, `screen_health_status`, `last_screenshot_hash`)
  - Enums: `LogLevel` (ERROR, WARN, INFO, DEBUG), `ProcessStatus` (running, crashed, starting, stopped), `ScreenHealthStatus` (OK, BLACK, FROZEN, UNKNOWN)
  - Indexes: (client_uuid, timestamp DESC), (level, timestamp DESC), (created_at DESC) for performance
 - System settings: `system_settings` key–value store via `SystemSetting` for global configuration (e.g., WebUntis/Vertretungsplan supplement-table). Managed through routes in `server/routes/system_settings.py`.
  - Presentation defaults (system-wide):
    - `presentation_interval` (seconds, default "10")
@@ -189,6 +196,11 @@ Keep docs synced with code. When you change services/MQTT/API/UTC/env or dev/pro
    - `PUT /api/users/<id>/password` — admin password reset (requires backend check to reject self-reset for consistency)
    - `DELETE /api/users/<id>` — hard delete (superadmin only, with self-deletion check)
  - Auth routes (`server/routes/auth.py`): Enhanced to track login events (sets `last_login_at`, resets `failed_login_attempts` on success; increments `failed_login_attempts` and `last_failed_login_at` on failure). Self-service password change via `PUT /api/auth/change-password` requires current password verification.
  - Client logs (`server/routes/client_logs.py`): Centralized log retrieval for monitoring:
    - `GET /api/client-logs/<uuid>/logs` – Query client logs with filters (level, limit, since); admin_or_higher
    - `GET /api/client-logs/summary` – Log counts by level per client (last 24h); admin_or_higher
    - `GET /api/client-logs/recent-errors` – System-wide error monitoring; admin_or_higher
    - `GET /api/client-logs/test` – Infrastructure validation (no auth); returns recent logs with counts
  Documentation maintenance: keep this file aligned with real patterns; update when routes/session/UTC rules change. Avoid long prose; link exact paths.
@@ -364,7 +376,8 @@ Docs maintenance guardrails (solo-friendly): Update this file alongside code cha
 ## Quick examples
 - Add client description persists to DB and publishes group via MQTT: see `PUT /api/clients/<uuid>/description` in `routes/clients.py`.
 - Bulk group assignment emits retained messages for each client: `PUT /api/clients/group`.
- Listener heartbeat path: `infoscreen/<uuid>/heartbeat` → sets `clients.last_alive`.
+- Listener heartbeat path: `infoscreen/<uuid>/heartbeat` → sets `clients.last_alive` and captures process health data.
 - Client monitoring flow: Client publishes to `infoscreen/{uuid}/logs/error` → listener stores in `client_logs` table → API serves via `/api/client-logs/<uuid>/logs` → dashboard displays (Phase 4, pending).
 ## Scheduler payloads: presentation extras
 - Presentation event payloads now include `page_progress` and `auto_progress` in addition to `slide_interval` and media files. These are sourced from per-event fields in the database (with system defaults applied on event creation).
@@ -393,3 +406,14 @@ Questions or unclear areas? Tell us if you need: exact devcontainer debugging st
 - Breaking changes must be prefixed with `BREAKING:`
 - Keep ≤ 8–10 bullets; summarize or group micro-changes
 - JSON hygiene: valid JSON, no trailing commas, don’t edit historical entries except typos
 ## Versioning Convention (Tech vs UI)
 - Use one unified app version across technical and user-facing release notes.
 - `dashboard/public/program-info.json` is user-facing and should list only user-visible changes.
 - `TECH-CHANGELOG.md` can include deeper technical details for the same released version.
 - If server/infrastructure work is implemented but not yet released or not user-visible, document it under the latest released section as:
  - `Backend technical work (post-release notes; no version bump)`
 - Do not create a new version header in `TECH-CHANGELOG.md` for internal milestones alone.
 - Bump version numbers when a release is actually cut/deployed (or when user-facing release notes are published), not for intermediate backend-only steps.
 - When UI integration lands later, include the user-visible part in the next release version and reference prior post-release technical groundwork when useful.
--- a/CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md
+++ b/CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md
@@ -0,0 +1,757 @@
 # 🚀 Client Monitoring Implementation Guide
 **Phase-based implementation guide for basic monitoring in development phase**
 ---
 ## ✅ Phase 1: Server-Side Database Foundation
 **Status:** ✅ COMPLETE  
 **Dependencies:** None - Already implemented  
 **Time estimate:** Completed
 ### ✅ Step 1.1: Database Migration
 **File:** `server/alembic/versions/c1d2e3f4g5h6_add_client_monitoring.py`  
 **What it does:**
 - Creates `client_logs` table for centralized logging
 - Adds health monitoring columns to `clients` table
 - Creates indexes for efficient querying
 **To apply:**
 ```bash
 cd /workspace/server
 alembic upgrade head
 ```
 ### ✅ Step 1.2: Update Data Models  
 **File:** `models/models.py`  
 **What was added:**
 - New enums: `LogLevel`, `ProcessStatus`, `ScreenHealthStatus`
 - Updated `Client` model with health tracking fields
 - New `ClientLog` model for log storage
 ---
 ## 🔧 Phase 2: Server-Side Backend Logic
 **Status:** 🚧 IN PROGRESS  
 **Dependencies:** Phase 1 complete  
 **Time estimate:** 2-3 hours
 ### Step 2.1: Extend MQTT Listener
 **File:** `listener/listener.py`  
 **What to add:**
 ```python
 # Add new topic subscriptions in on_connect():
 client.subscribe("infoscreen/+/logs/error")
 client.subscribe("infoscreen/+/logs/warn")
 client.subscribe("infoscreen/+/logs/info")  # Dev mode only
 client.subscribe("infoscreen/+/health")
 # Add new handler in on_message():
 def handle_log_message(uuid, level, payload):
    """Store client log in database"""
    from models.models import ClientLog, LogLevel
    from server.database import Session
    import json
    session = Session()
    try:
        log_entry = ClientLog(
            client_uuid=uuid,
            timestamp=payload.get('timestamp', datetime.now(timezone.utc)),
            level=LogLevel[level],
            message=payload.get('message', ''),
            context=json.dumps(payload.get('context', {}))
        )
        session.add(log_entry)
        session.commit()
        print(f"[LOG] {uuid} {level}: {payload.get('message', '')}")
    except Exception as e:
        print(f"Error saving log: {e}")
        session.rollback()
    finally:
        session.close()
 def handle_health_message(uuid, payload):
    """Update client health status"""
    from models.models import Client, ProcessStatus
    from server.database import Session
    session = Session()
    try:
        client = session.query(Client).filter_by(uuid=uuid).first()
        if client:
            client.current_event_id = payload.get('expected_state', {}).get('event_id')
            client.current_process = payload.get('actual_state', {}).get('process')
            status_str = payload.get('actual_state', {}).get('status')
            if status_str:
                client.process_status = ProcessStatus[status_str]
            client.process_pid = payload.get('actual_state', {}).get('pid')
            session.commit()
    except Exception as e:
        print(f"Error updating health: {e}")
        session.rollback()
    finally:
        session.close()
 ```
 **Topic routing logic:**
 ```python
 # In on_message callback, add routing:
 if topic.endswith('/logs/error'):
    handle_log_message(uuid, 'ERROR', payload)
 elif topic.endswith('/logs/warn'):
    handle_log_message(uuid, 'WARN', payload)
 elif topic.endswith('/logs/info'):
    handle_log_message(uuid, 'INFO', payload)
 elif topic.endswith('/health'):
    handle_health_message(uuid, payload)
 ```
 ### Step 2.2: Create API Routes
 **File:** `server/routes/client_logs.py` (NEW)
 ```python
 from flask import Blueprint, jsonify, request
 from server.database import Session
 from server.permissions import admin_or_higher
 from models.models import ClientLog, Client
 from sqlalchemy import desc
 import json
 client_logs_bp = Blueprint("client_logs", __name__, url_prefix="/api/client-logs")
@client_logs_bp.route("/<uuid>/logs", methods=["GET"])
@admin_or_higher
 def get_client_logs(uuid):
    """
    Get logs for a specific client
    Query params:
      - level: ERROR, WARN, INFO, DEBUG (optional)
      - limit: number of entries (default 50, max 500)
      - since: ISO timestamp (optional)
    """
    session = Session()
    try:
        level = request.args.get('level')
        limit = min(int(request.args.get('limit', 50)), 500)
        since = request.args.get('since')
        query = session.query(ClientLog).filter_by(client_uuid=uuid)
        if level:
            from models.models import LogLevel
            query = query.filter_by(level=LogLevel[level])
        if since:
            from datetime import datetime
            since_dt = datetime.fromisoformat(since.replace('Z', '+00:00'))
            query = query.filter(ClientLog.timestamp >= since_dt)
        logs = query.order_by(desc(ClientLog.timestamp)).limit(limit).all()
        result = []
        for log in logs:
            result.append({
                "id": log.id,
                "timestamp": log.timestamp.isoformat() if log.timestamp else None,
                "level": log.level.value if log.level else None,
                "message": log.message,
                "context": json.loads(log.context) if log.context else {}
            })
        session.close()
        return jsonify({"logs": result, "count": len(result)})
    except Exception as e:
        session.close()
        return jsonify({"error": str(e)}), 500
@client_logs_bp.route("/summary", methods=["GET"])
@admin_or_higher
 def get_logs_summary():
    """Get summary of errors/warnings across all clients"""
    session = Session()
    try:
        from sqlalchemy import func
        from models.models import LogLevel
        from datetime import datetime, timedelta
        # Last 24 hours
        since = datetime.utcnow() - timedelta(hours=24)
        stats = session.query(
            ClientLog.client_uuid,
            ClientLog.level,
            func.count(ClientLog.id).label('count')
        ).filter(
            ClientLog.timestamp >= since
        ).group_by(
            ClientLog.client_uuid,
            ClientLog.level
        ).all()
        result = {}
        for stat in stats:
            uuid = stat.client_uuid
            if uuid not in result:
                result[uuid] = {"ERROR": 0, "WARN": 0, "INFO": 0}
            result[uuid][stat.level.value] = stat.count
        session.close()
        return jsonify({"summary": result, "period_hours": 24})
    except Exception as e:
        session.close()
        return jsonify({"error": str(e)}), 500
 ```
 **Register in `server/wsgi.py`:**
 ```python
 from server.routes.client_logs import client_logs_bp
 app.register_blueprint(client_logs_bp)
 ```
 ### Step 2.3: Add Health Data to Heartbeat Handler
 **File:** `listener/listener.py` (extend existing heartbeat handler)
 ```python
 # Modify existing heartbeat handler to capture health data
 def on_message(client, userdata, message):
    topic = message.topic
    # Existing heartbeat logic...
    if '/heartbeat' in topic:
        uuid = extract_uuid_from_topic(topic)
        try:
            payload = json.loads(message.payload.decode())
            # Update last_alive (existing)
            session = Session()
            client_obj = session.query(Client).filter_by(uuid=uuid).first()
            if client_obj:
                client_obj.last_alive = datetime.now(timezone.utc)
                # NEW: Update health data if present in heartbeat
                if 'process_status' in payload:
                    client_obj.process_status = ProcessStatus[payload['process_status']]
                if 'current_process' in payload:
                    client_obj.current_process = payload['current_process']
                if 'process_pid' in payload:
                    client_obj.process_pid = payload['process_pid']
                if 'current_event_id' in payload:
                    client_obj.current_event_id = payload['current_event_id']
                session.commit()
            session.close()
        except Exception as e:
            print(f"Error processing heartbeat: {e}")
 ```
 ---
 ## 🖥️ Phase 3: Client-Side Implementation
 **Status:** ⏳ PENDING (After Phase 2)  
 **Dependencies:** Phase 2 complete  
 **Time estimate:** 3-4 hours
 ### Step 3.1: Create Client Watchdog Script
 **File:** `client/watchdog.py` (NEW - on client device)
 ```python
 #!/usr/bin/env python3
 """
 Client-side process watchdog
 Monitors VLC, Chromium, PDF viewer and reports health to server
 """
 import psutil
 import paho.mqtt.client as mqtt
 import json
 import time
 from datetime import datetime, timezone
 import sys
 import os
 class MediaWatchdog:
    def __init__(self, client_uuid, mqtt_broker, mqtt_port=1883):
        self.uuid = client_uuid
        self.mqtt_client = mqtt.Client()
        self.mqtt_client.connect(mqtt_broker, mqtt_port, 60)
        self.mqtt_client.loop_start()
        self.current_process = None
        self.current_event_id = None
        self.restart_attempts = 0
        self.MAX_RESTARTS = 3
    def send_log(self, level, message, context=None):
        """Send log message to server via MQTT"""
        topic = f"infoscreen/{self.uuid}/logs/{level.lower()}"
        payload = {
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "message": message,
            "context": context or {}
        }
        self.mqtt_client.publish(topic, json.dumps(payload), qos=1)
        print(f"[{level}] {message}")
    def send_health(self, process_name, pid, status, event_id=None):
        """Send health status to server"""
        topic = f"infoscreen/{self.uuid}/health"
        payload = {
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "expected_state": {
                "event_id": event_id
            },
            "actual_state": {
                "process": process_name,
                "pid": pid,
                "status": status  # 'running', 'crashed', 'starting', 'stopped'
            }
        }
        self.mqtt_client.publish(topic, json.dumps(payload), qos=1, retain=False)
    def is_process_running(self, process_name):
        """Check if a process is running"""
        for proc in psutil.process_iter(['name', 'pid']):
            try:
                if process_name.lower() in proc.info['name'].lower():
                    return proc.info['pid']
            except (psutil.NoSuchProcess, psutil.AccessDenied):
                pass
        return None
    def monitor_loop(self):
        """Main monitoring loop"""
        print(f"Watchdog started for client {self.uuid}")
        self.send_log("INFO", "Watchdog service started", {"uuid": self.uuid})
        while True:
            try:
                # Check expected process (would be set by main event handler)
                if self.current_process:
                    pid = self.is_process_running(self.current_process)
                    if pid:
                        # Process is running
                        self.send_health(
                            self.current_process,
                            pid,
                            "running",
                            self.current_event_id
                        )
                        self.restart_attempts = 0  # Reset on success
                    else:
                        # Process crashed
                        self.send_log(
                            "ERROR",
                            f"Process {self.current_process} crashed or stopped",
                            {
                                "event_id": self.current_event_id,
                                "process": self.current_process,
                                "restart_attempt": self.restart_attempts
                            }
                        )
                        if self.restart_attempts < self.MAX_RESTARTS:
                            self.send_log("WARN", f"Attempting restart ({self.restart_attempts + 1}/{self.MAX_RESTARTS})")
                            self.restart_attempts += 1
                            # TODO: Implement restart logic (call event handler)
                        else:
                            self.send_log("ERROR", "Max restart attempts exceeded", {
                                "event_id": self.current_event_id
                            })
                time.sleep(5)  # Check every 5 seconds
            except KeyboardInterrupt:
                print("Watchdog stopped by user")
                break
            except Exception as e:
                self.send_log("ERROR", f"Watchdog error: {str(e)}", {
                    "exception": str(e),
                    "traceback": str(sys.exc_info())
                })
                time.sleep(10)  # Wait longer on error
 if __name__ == "__main__":
    import sys
    if len(sys.argv) < 3:
        print("Usage: python watchdog.py <client_uuid> <mqtt_broker>")
        sys.exit(1)
    uuid = sys.argv[1]
    broker = sys.argv[2]
    watchdog = MediaWatchdog(uuid, broker)
    watchdog.monitor_loop()
 ```
 ### Step 3.2: Integrate with Existing Event Handler
 **File:** `client/event_handler.py` (modify existing)
 ```python
 # When starting a new event, notify watchdog
 def play_event(event_data):
    event_type = event_data.get('event_type')
    event_id = event_data.get('id')
    if event_type == 'video':
        process_name = 'vlc'
        # Start VLC...
    elif event_type == 'website':
        process_name = 'chromium'
        # Start Chromium...
    elif event_type == 'presentation':
        process_name = 'pdf_viewer'  # or your PDF tool
        # Start PDF viewer...
    # Notify watchdog about expected process
    watchdog.current_process = process_name
    watchdog.current_event_id = event_id
    watchdog.restart_attempts = 0
 ```
 ### Step 3.3: Enhanced Heartbeat Payload
 **File:** `client/heartbeat.py` (modify existing)
 ```python
 # Modify existing heartbeat to include process status
 def send_heartbeat(mqtt_client, uuid):
    # Get current process status
    current_process = None
    process_pid = None
    process_status = "stopped"
    # Check if expected process is running
    if watchdog.current_process:
        pid = watchdog.is_process_running(watchdog.current_process)
        if pid:
            current_process = watchdog.current_process
            process_pid = pid
            process_status = "running"
    payload = {
        "uuid": uuid,
        "timestamp": datetime.now(timezone.utc).isoformat(),
        # Existing fields...
        # NEW health fields:
        "current_process": current_process,
        "process_pid": process_pid,
        "process_status": process_status,
        "current_event_id": watchdog.current_event_id
    }
    mqtt_client.publish(f"infoscreen/{uuid}/heartbeat", json.dumps(payload))
 ```
 ---
 ## 🎨 Phase 4: Dashboard UI Integration
 **Status:** ⏳ PENDING (After Phase 3)  
 **Dependencies:** Phases 2 & 3 complete  
 **Time estimate:** 2-3 hours
 ### Step 4.1: Create Log Viewer Component
 **File:** `dashboard/src/ClientLogs.tsx` (NEW)
 ```typescript
 import React from 'react';
 import { GridComponent, ColumnsDirective, ColumnDirective, Page, Inject } from '@syncfusion/ej2-react-grids';
 interface LogEntry {
  id: number;
  timestamp: string;
  level: 'ERROR' | 'WARN' | 'INFO' | 'DEBUG';
  message: string;
  context: any;
 }
 interface ClientLogsProps {
  clientUuid: string;
 }
 export const ClientLogs: React.FC<ClientLogsProps> = ({ clientUuid }) => {
  const [logs, setLogs] = React.useState<LogEntry[]>([]);
  const [loading, setLoading] = React.useState(false);
  const loadLogs = async (level?: string) => {
    setLoading(true);
    try {
      const params = new URLSearchParams({ limit: '50' });
      if (level) params.append('level', level);
      const response = await fetch(`/api/client-logs/${clientUuid}/logs?${params}`);
      const data = await response.json();
      setLogs(data.logs);
    } catch (err) {
      console.error('Failed to load logs:', err);
    } finally {
      setLoading(false);
    }
  };
  React.useEffect(() => {
    loadLogs();
    const interval = setInterval(() => loadLogs(), 30000); // Refresh every 30s
    return () => clearInterval(interval);
  }, [clientUuid]);
  const levelTemplate = (props: any) => {
    const colors = {
      ERROR: 'text-red-600 bg-red-100',
      WARN: 'text-yellow-600 bg-yellow-100',
      INFO: 'text-blue-600 bg-blue-100',
      DEBUG: 'text-gray-600 bg-gray-100'
    };
    return (
      <span className={`px-2 py-1 rounded ${colors[props.level as keyof typeof colors]}`}>
        {props.level}
      </span>
    );
  };
  return (
    <div>
      <div className="mb-4 flex gap-2">
        <button onClick={() => loadLogs()} className="e-btn e-primary">All</button>
        <button onClick={() => loadLogs('ERROR')} className="e-btn e-danger">Errors</button>
        <button onClick={() => loadLogs('WARN')} className="e-btn e-warning">Warnings</button>
        <button onClick={() => loadLogs('INFO')} className="e-btn e-info">Info</button>
      </div>
      <GridComponent
        dataSource={logs}
        allowPaging={true}
        pageSettings={{ pageSize: 20 }}
      >
        <ColumnsDirective>
          <ColumnDirective field='timestamp' headerText='Time' width='180' format='yMd HH:mm:ss' />
          <ColumnDirective field='level' headerText='Level' width='100' template={levelTemplate} />
          <ColumnDirective field='message' headerText='Message' width='400' />
        </ColumnsDirective>
        <Inject services={[Page]} />
      </GridComponent>
    </div>
  );
 };
 ```
 ### Step 4.2: Add Health Indicators to Client Cards
 **File:** `dashboard/src/clients.tsx` (modify existing)
 ```typescript
 // Add health indicator to client card
 const getHealthBadge = (client: Client) => {
  if (!client.process_status) {
    return <span className="badge badge-secondary">Unknown</span>;
  }
  const badges = {
    running: <span className="badge badge-success">✓ Running</span>,
    crashed: <span className="badge badge-danger">✗ Crashed</span>,
    starting: <span className="badge badge-warning">⟳ Starting</span>,
    stopped: <span className="badge badge-secondary">■ Stopped</span>
  };
  return badges[client.process_status] || null;
 };
 // In client card render:
 <div className="client-card">
  <h3>{client.hostname || client.uuid}</h3>
  <div>Status: {getHealthBadge(client)}</div>
  <div>Process: {client.current_process || 'None'}</div>
  <div>Event ID: {client.current_event_id || 'None'}</div>
  <button onClick={() => showLogs(client.uuid)}>View Logs</button>
 </div>
 ```
 ### Step 4.3: Add System Health Dashboard (Superadmin)
 **File:** `dashboard/src/SystemMonitor.tsx` (NEW)
 ```typescript
 import React from 'react';
 import { ClientLogs } from './ClientLogs';
 export const SystemMonitor: React.FC = () => {
  const [summary, setSummary] = React.useState<any>({});
  const loadSummary = async () => {
    const response = await fetch('/api/client-logs/summary');
    const data = await response.json();
    setSummary(data.summary);
  };
  React.useEffect(() => {
    loadSummary();
    const interval = setInterval(loadSummary, 30000);
    return () => clearInterval(interval);
  }, []);
  return (
    <div className="system-monitor">
      <h2>System Health Monitor (Superadmin)</h2>
      <div className="alert-panel">
        <h3>Active Issues</h3>
        {Object.entries(summary).map(([uuid, stats]: [string, any]) => (
          stats.ERROR > 0 || stats.WARN > 5 ? (
            <div key={uuid} className="alert">
              🔴 {uuid}: {stats.ERROR} errors, {stats.WARN} warnings (24h)
            </div>
          ) : null
        ))}
      </div>
      {/* Real-time log stream */}
      <div className="log-stream">
        <h3>Recent Logs (All Clients)</h3>
        {/* Implement real-time log aggregation */}
      </div>
    </div>
  );
 };
 ```
 ---
 ## 🧪 Phase 5: Testing & Validation
 **Status:** ⏳ PENDING  
 **Dependencies:** All previous phases  
 **Time estimate:** 1-2 hours
 ### Step 5.1: Server-Side Tests
 ```bash
 # Test database migration
 cd /workspace/server
 alembic upgrade head
 alembic downgrade -1
 alembic upgrade head
 # Test API endpoints
 curl -X GET "http://localhost:8000/api/client-logs/<uuid>/logs?limit=10"
 curl -X GET "http://localhost:8000/api/client-logs/summary"
 ```
 ### Step 5.2: Client-Side Tests
 ```bash
 # On client device
 python3 watchdog.py <your-uuid> <mqtt-broker-ip>
 # Simulate process crash
 pkill vlc  # Should trigger error log and restart attempt
 # Check MQTT messages
 mosquitto_sub -h <broker> -t "infoscreen/+/logs/#" -v
 mosquitto_sub -h <broker> -t "infoscreen/+/health" -v
 ```
 ### Step 5.3: Dashboard Tests
 1. Open dashboard and navigate to Clients page
 2. Verify health indicators show correct status
 3. Click "View Logs" and verify logs appear
 4. Navigate to System Monitor (superadmin)
 5. Verify summary statistics are correct
 ---
 ## 📝 Configuration Summary
 ### Environment Variables
 **Server (docker-compose.yml):**
 ```yaml
 - LOG_RETENTION_DAYS=90  # How long to keep logs
 - DEBUG_MODE=true        # Enable INFO level logging via MQTT
 ```
 **Client:**
 ```bash
 export MQTT_BROKER="your-server-ip"
 export CLIENT_UUID="abc-123-def"
 export WATCHDOG_ENABLED=true
 ```
 ### MQTT Topics Reference
 | Topic Pattern | Direction | Purpose |
 |--------------|-----------|---------|
 | `infoscreen/{uuid}/logs/error` | Client → Server | Error messages |
 | `infoscreen/{uuid}/logs/warn` | Client → Server | Warning messages |
 | `infoscreen/{uuid}/logs/info` | Client → Server | Info (dev only) |
 | `infoscreen/{uuid}/health` | Client → Server | Health metrics |
 | `infoscreen/{uuid}/heartbeat` | Client → Server | Enhanced heartbeat |
 ### Database Tables
 **client_logs:**
 - Stores all centralized logs
 - Indexed by client_uuid, timestamp, level
 - Auto-cleanup after 90 days (recommended)
 **clients (extended):**
 - `current_event_id`: Which event should be playing
 - `current_process`: Expected process name
 - `process_status`: running/crashed/starting/stopped
 - `process_pid`: Process ID
 - `screen_health_status`: OK/BLACK/FROZEN/UNKNOWN
 - `last_screenshot_analyzed`: Last analysis time
 - `last_screenshot_hash`: For frozen detection
 ---
 ## 🎯 Next Steps After Implementation
 1. **Deploy Phase 1-2** to staging environment
 2. **Test with 1-2 pilot clients** before full rollout
 3. **Monitor traffic & performance** (should be minimal)
 4. **Fine-tune log levels** based on actual noise
 5. **Add alerting** (email/Slack when errors > threshold)
 6. **Implement screenshot analysis** (Phase 2 enhancement)
 7. **Add trending/analytics** (which clients are least reliable)
 ---
 ## 🚨 Troubleshooting
 **Logs not appearing in database:**
 - Check MQTT broker logs: `docker logs infoscreen-mqtt`
 - Verify listener subscriptions: Check `listener/listener.py` logs
 - Test MQTT manually: `mosquitto_pub -h broker -t "infoscreen/test/logs/error" -m '{"message":"test"}'`
 **High database growth:**
 - Check log_retention cleanup cronjob
 - Reduce INFO level logging frequency
 - Add sampling (log every 10th occurrence instead of all)
 **Client watchdog not detecting crashes:**
 - Verify psutil can see processes: `ps aux | grep vlc`
 - Check permissions (may need sudo for some process checks)
 - Increase monitor loop frequency for faster detection
 ---
 ## ✅ Completion Checklist
 - [ ] Phase 1: Database migration applied
 - [ ] Phase 2: Listener extended for log topics
 - [ ] Phase 2: API endpoints created and tested
 - [ ] Phase 3: Client watchdog implemented
 - [ ] Phase 3: Enhanced heartbeat deployed
 - [ ] Phase 4: Dashboard log viewer working
 - [ ] Phase 4: Health indicators visible
 - [ ] Phase 5: End-to-end testing complete
 - [ ] Documentation updated with new features
 - [ ] Production deployment plan created
 ---
 **Last Updated:** 2026-03-09  
 **Author:** GitHub Copilot  
 **For:** Infoscreen 2025 Project
--- a/CLIENT_MONITORING_SPECIFICATION.md
+++ b/CLIENT_MONITORING_SPECIFICATION.md
@@ -0,0 +1,972 @@
 # Client-Side Monitoring Specification
 **Version:** 1.0  
 **Date:** 2026-03-10  
 **For:** Infoscreen Client Implementation  
 **Server Endpoint:** `192.168.43.201:8000` (or your production server)  
 **MQTT Broker:** `192.168.43.201:1883` (or your production MQTT broker)
 ---
 ## 1. Overview
 Each infoscreen client must implement health monitoring and logging capabilities to report status to the central server via MQTT.
 ### 1.1 Goals
 - **Detect failures:** Process crashes, frozen screens, content mismatches
 - **Provide visibility:** Real-time health status visible on server dashboard
 - **Enable remote diagnosis:** Centralized log storage for debugging
 - **Auto-recovery:** Attempt automatic restart on failure
 ### 1.2 Architecture
 ```
 ┌─────────────────────────────────────────┐
 │         Infoscreen Client               │
 │                                         │
 │  ┌──────────────┐    ┌──────────────┐  │
 │  │ Media Player │    │   Watchdog   │  │
 │  │ (VLC/Chrome) │◄───│   Monitor    │  │
 │  └──────────────┘    └──────┬───────┘  │
 │                              │          │
 │  ┌──────────────┐            │          │
 │  │  Event Mgr   │            │          │
 │  │  (receives   │            │          │
 │  │   schedule)  │◄───────────┘          │
 │  └──────┬───────┘                       │
 │         │                               │
 │  ┌──────▼───────────────────────┐      │
 │  │     MQTT Client               │      │
 │  │  - Heartbeat (every 60s)      │      │
 │  │  - Logs (error/warn/info)     │      │
 │  │  - Health metrics (every 5s)  │      │
 │  └──────┬────────────────────────┘      │
 └─────────┼──────────────────────────────┘
          │
          │ MQTT over TCP
          ▼
    ┌─────────────┐
    │ MQTT Broker │
    │  (server)   │
    └─────────────┘
 ```
 ---
 ## 2. MQTT Protocol Specification
 ### 2.1 Connection Parameters
 ```
 Broker: 192.168.43.201 (or DNS hostname)
 Port: 1883 (standard MQTT)
 Protocol: MQTT v3.1.1
 Client ID: "infoscreen-{client_uuid}"
 Clean Session: false (retain subscriptions)
 Keep Alive: 60 seconds
 Username/Password: (if configured on broker)
 ```
 ### 2.2 QoS Levels
 - **Heartbeat:** QoS 0 (fire and forget, high frequency)
 - **Logs (ERROR/WARN):** QoS 1 (at least once delivery, important)
 - **Logs (INFO):** QoS 0 (optional, high volume)
 - **Health metrics:** QoS 0 (frequent, latest value matters)
 ---
 ## 3. Topic Structure & Payload Formats
 ### 3.1 Log Messages
 #### Topic Pattern:
 ```
 infoscreen/{client_uuid}/logs/{level}
 ```
 Where `{level}` is one of: `error`, `warn`, `info`
 #### Payload Format (JSON):
 ```json
 {
  "timestamp": "2026-03-10T07:30:00Z",
  "message": "Human-readable error description",
  "context": {
    "event_id": 42,
    "process": "vlc",
    "error_code": "NETWORK_TIMEOUT",
    "additional_key": "any relevant data"
  }
 }
 ```
 #### Field Specifications:
 | Field | Type | Required | Description |
 |-------|------|----------|-------------|
 | `timestamp` | string (ISO 8601 UTC) | Yes | When the event occurred. Use `YYYY-MM-DDTHH:MM:SSZ` format |
 | `message` | string | Yes | Human-readable description of the event (max 1000 chars) |
 | `context` | object | No | Additional structured data (will be stored as JSON) |
 #### Example Topics:
 ```
 infoscreen/9b8d1856-ff34-4864-a726-12de072d0f77/logs/error
 infoscreen/9b8d1856-ff34-4864-a726-12de072d0f77/logs/warn
 infoscreen/9b8d1856-ff34-4864-a726-12de072d0f77/logs/info
 ```
 #### When to Send Logs:
 **ERROR (Always send):**
 - Process crashed (VLC/Chromium/PDF viewer terminated unexpectedly)
 - Content failed to load (404, network timeout, corrupt file)
 - Hardware failure detected (display off, audio device missing)
 - Exception caught in main event loop
 - Maximum restart attempts exceeded
 **WARN (Always send):**
 - Process restarted automatically (after crash)
 - High resource usage (CPU >80%, RAM >90%)
 - Slow performance (frame drops, lag)
 - Non-critical failures (screenshot capture failed, cache full)
 - Fallback content displayed (primary source unavailable)
 **INFO (Send in development, optional in production):**
 - Process started successfully
 - Event transition (switched from video to presentation)
 - Content loaded successfully
 - Watchdog service started/stopped
 ---
 ### 3.2 Health Metrics
 #### Topic Pattern:
 ```
 infoscreen/{client_uuid}/health
 ```
 #### Payload Format (JSON):
 ```json
 {
  "timestamp": "2026-03-10T07:30:00Z",
  "expected_state": {
    "event_id": 42,
    "event_type": "video",
    "media_file": "presentation.mp4",
    "started_at": "2026-03-10T07:15:00Z"
  },
  "actual_state": {
    "process": "vlc",
    "pid": 1234,
    "status": "running",
    "uptime_seconds": 900,
    "position": 45.3,
    "duration": 180.0
  },
  "health_metrics": {
    "screen_on": true,
    "last_frame_update": "2026-03-10T07:29:58Z",
    "frames_dropped": 2,
    "network_errors": 0,
    "cpu_percent": 15.3,
    "memory_mb": 234
  }
 }
 ```
 #### Field Specifications:
 **expected_state:**
 | Field | Type | Required | Description |
 |-------|------|----------|-------------|
 | `event_id` | integer | Yes | Current event ID from scheduler |
 | `event_type` | string | Yes | `presentation`, `video`, `website`, `webuntis`, `message` |
 | `media_file` | string | No | Filename or URL of current content |
 | `started_at` | string (ISO 8601) | Yes | When this event started playing |
 **actual_state:**
 | Field | Type | Required | Description |
 |-------|------|----------|-------------|
 | `process` | string | Yes | `vlc`, `chromium`, `pdf_viewer`, `none` |
 | `pid` | integer | No | Process ID (if running) |
 | `status` | string | Yes | `running`, `crashed`, `starting`, `stopped` |
 | `uptime_seconds` | integer | No | How long process has been running |
 | `position` | float | No | Current playback position (seconds, for video/audio) |
 | `duration` | float | No | Total content duration (seconds) |
 **health_metrics:**
 | Field | Type | Required | Description |
 |-------|------|----------|-------------|
 | `screen_on` | boolean | Yes | Is display powered on? |
 | `last_frame_update` | string (ISO 8601) | No | Last time screen content changed |
 | `frames_dropped` | integer | No | Video frames dropped (performance indicator) |
 | `network_errors` | integer | No | Count of network errors in last interval |
 | `cpu_percent` | float | No | CPU usage (0-100) |
 | `memory_mb` | integer | No | RAM usage in megabytes |
 #### Sending Frequency:
 - **Normal operation:** Every 5 seconds
 - **During startup/transition:** Every 1 second
 - **After error:** Immediately + every 2 seconds until recovered
 ---
 ### 3.3 Enhanced Heartbeat
 The existing heartbeat topic should be enhanced to include process status.
 #### Topic Pattern:
 ```
 infoscreen/{client_uuid}/heartbeat
 ```
 #### Enhanced Payload Format (JSON):
 ```json
 {
  "uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
  "timestamp": "2026-03-10T07:30:00Z",
  "current_process": "vlc",
  "process_pid": 1234,
  "process_status": "running",
  "current_event_id": 42
 }
 ```
 #### New Fields (add to existing heartbeat):
 | Field | Type | Required | Description |
 |-------|------|----------|-------------|
 | `current_process` | string | No | Name of active media player process |
 | `process_pid` | integer | No | Process ID |
 | `process_status` | string | No | `running`, `crashed`, `starting`, `stopped` |
 | `current_event_id` | integer | No | Event ID currently being displayed |
 #### Sending Frequency:
 - Keep existing: **Every 60 seconds**
 - Include new fields if available
 ---
 ## 4. Process Monitoring Requirements
 ### 4.1 Processes to Monitor
 | Media Type | Process Name | How to Detect |
 |------------|--------------|---------------|
 | Video | `vlc` | `ps aux \| grep vlc` or `pgrep vlc` |
 | Website/WebUntis | `chromium` or `chromium-browser` | `pgrep chromium` |
 | PDF Presentation | `evince`, `okular`, or custom viewer | `pgrep {viewer_name}` |
 ### 4.2 Monitoring Checks (Every 5 seconds)
 #### Check 1: Process Alive
 ```
 Goal: Verify expected process is running
 Method: 
  - Get list of running processes (psutil or `ps`)
  - Check if expected process name exists
  - Match PID if known
 Result:
  - If missing → status = "crashed"
  - If found → status = "running"
 Action on crash:
  - Send ERROR log immediately
  - Attempt restart (max 3 attempts)
  - Send WARN log on each restart
  - If max restarts exceeded → send ERROR log, display fallback
 ```
 #### Check 2: Process Responsive
 ```
 Goal: Detect frozen processes
 Method:
  - For VLC: Query HTTP interface (status.json)
  - For Chromium: Use DevTools Protocol (CDP)
  - For custom viewers: Check last screen update time
 Result:
  - If same frame >30 seconds → likely frozen
  - If playback position not advancing → frozen
 Action on freeze:
  - Send WARN log
  - Force refresh (reload page, seek video, next slide)
  - If refresh fails → restart process
 ```
 #### Check 3: Content Match
 ```
 Goal: Verify correct content is displayed
 Method:
  - Compare expected event_id with actual media/URL
  - Check scheduled time window (is event still active?)
 Result:
  - Mismatch → content error
 Action:
  - Send WARN log
  - Reload correct event from scheduler
 ```
 ---
 ## 5. Process Control Interface Requirements
 ### 5.1 VLC Control
 **Requirement:** Enable VLC HTTP interface for monitoring
 **Launch Command:**
 ```bash
 vlc --intf http --http-host 127.0.0.1 --http-port 8080 --http-password "vlc_password" \
    --fullscreen --loop /path/to/video.mp4
 ```
 **Status Query:**
 ```bash
 curl http://127.0.0.1:8080/requests/status.json --user ":vlc_password"
 ```
 **Response Fields to Monitor:**
 ```json
 {
  "state": "playing",     // "playing", "paused", "stopped"
  "position": 0.25,       // 0.0-1.0 (25% through)
  "time": 45,             // seconds into playback
  "length": 180,          // total duration in seconds
  "volume": 256           // 0-512
 }
 ```
 ---
 ### 5.2 Chromium Control
 **Requirement:** Enable Chrome DevTools Protocol (CDP)
 **Launch Command:**
 ```bash
 chromium --remote-debugging-port=9222 --kiosk --app=https://example.com
 ```
 **Status Query:**
 ```bash
 curl http://127.0.0.1:9222/json
 ```
 **Response Fields to Monitor:**
 ```json
 [
  {
    "url": "https://example.com",
    "title": "Page Title",
    "type": "page"
  }
 ]
 ```
 **Advanced:** Use CDP WebSocket for events (page load, navigation, errors)
 ---
 ### 5.3 PDF Viewer (Custom or Standard)
 **Option A: Standard Viewer (e.g., Evince)**
 - No built-in API
 - Monitor via process check + screenshot comparison
 **Option B: Custom Python Viewer**
 - Implement REST API for status queries
 - Track: current page, total pages, last transition time
 ---
 ## 6. Watchdog Service Architecture
 ### 6.1 Service Components
 **Component 1: Process Monitor Thread**
 ```
 Responsibilities:
  - Check process alive every 5 seconds
  - Detect crashes and frozen processes
  - Attempt automatic restart
  - Send health metrics via MQTT
 State Machine:
  IDLE → STARTING → RUNNING → (if crash) → RESTARTING → RUNNING
                             → (if max restarts) → FAILED
 ```
 **Component 2: MQTT Publisher Thread**
 ```
 Responsibilities:
  - Maintain MQTT connection
  - Send heartbeat every 60 seconds
  - Send logs on-demand (queued from other components)
  - Send health metrics every 5 seconds
  - Reconnect on connection loss
 ```
 **Component 3: Event Manager Integration**
 ```
 Responsibilities:
  - Receive event schedule from server
  - Notify watchdog of expected process/content
  - Launch media player processes
  - Handle event transitions
 ```
 ### 6.2 Service Lifecycle
 **On Startup:**
 1. Load configuration (client UUID, MQTT broker, etc.)
 2. Connect to MQTT broker
 3. Send INFO log: "Watchdog service started"
 4. Wait for first event from scheduler
 **During Operation:**
 1. Monitor loop runs every 5 seconds
 2. Check expected vs actual process state
 3. Send health metrics
 4. Handle failures (log + restart)
 **On Shutdown:**
 1. Send INFO log: "Watchdog service stopping"
 2. Gracefully stop monitored processes
 3. Disconnect from MQTT
 4. Exit cleanly
 ---
 ## 7. Auto-Recovery Logic
 ### 7.1 Restart Strategy
 **Step 1: Detect Failure**
 ```
 Trigger: Process not found in process list
 Action:
  - Log ERROR: "Process {name} crashed"
  - Increment restart counter
  - Check if within retry limit (max 3)
 ```
 **Step 2: Attempt Restart**
 ```
 If restart_attempts < MAX_RESTARTS:
  - Log WARN: "Attempting restart ({attempt}/{MAX_RESTARTS})"
  - Kill any zombie processes
  - Wait 2 seconds (cooldown)
  - Launch process with same parameters
  - Wait 5 seconds for startup
  - Verify process is running
  - If success: reset restart counter, log INFO
  - If fail: increment counter, repeat
 ```
 **Step 3: Permanent Failure**
 ```
 If restart_attempts >= MAX_RESTARTS:
  - Log ERROR: "Max restart attempts exceeded, failing over"
  - Display fallback content (static image with error message)
  - Send notification to server (separate alert topic, optional)
  - Wait for manual intervention or scheduler event change
 ```
 ### 7.2 Restart Cooldown
 **Purpose:** Prevent rapid restart loops that waste resources
 **Implementation:**
 ```
 After each restart attempt:
  - Wait 2 seconds before next restart
  - After 3 failures: wait 30 seconds before trying again
  - Reset counter on successful run >5 minutes
 ```
 ---
 ## 8. Resource Monitoring
 ### 8.1 System Metrics to Track
 **CPU Usage:**
 ```
 Method: Read /proc/stat or use psutil.cpu_percent()
 Frequency: Every 5 seconds
 Threshold: Warn if >80% for >60 seconds
 ```
 **Memory Usage:**
 ```
 Method: Read /proc/meminfo or use psutil.virtual_memory()
 Frequency: Every 5 seconds
 Threshold: Warn if >90% for >30 seconds
 ```
 **Display Status:**
 ```
 Method: Check DPMS state or xset query
 Frequency: Every 30 seconds
 Threshold: Error if display off (unexpected)
 ```
 **Network Connectivity:**
 ```
 Method: Ping server or check MQTT connection
 Frequency: Every 60 seconds
 Threshold: Warn if no server connectivity
 ```
 ---
 ## 9. Development vs Production Mode
 ### 9.1 Development Mode
 **Enable via:** Environment variable `DEBUG=true` or `ENV=development`
 **Behavior:**
 - Send INFO level logs
 - More verbose logging to console
 - Shorter monitoring intervals (faster feedback)
 - Screenshot capture every 30 seconds
 - No rate limiting on logs
 ### 9.2 Production Mode
 **Enable via:** `ENV=production`
 **Behavior:**
 - Send only ERROR and WARN logs
 - Minimal console output
 - Standard monitoring intervals
 - Screenshot capture every 60 seconds
 - Rate limiting: max 10 logs per minute per level
 ---
 ## 10. Configuration File Format
 ### 10.1 Recommended Config: JSON
 **File:** `/etc/infoscreen/config.json` or `~/.config/infoscreen/config.json`
 ```json
 {
  "client": {
    "uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
    "hostname": "infoscreen-room-101"
  },
  "mqtt": {
    "broker": "192.168.43.201",
    "port": 1883,
    "username": "",
    "password": "",
    "keepalive": 60
  },
  "monitoring": {
    "enabled": true,
    "health_interval_seconds": 5,
    "heartbeat_interval_seconds": 60,
    "max_restart_attempts": 3,
    "restart_cooldown_seconds": 2
  },
  "logging": {
    "level": "INFO",
    "send_info_logs": false,
    "console_output": true,
    "local_log_file": "/var/log/infoscreen/watchdog.log"
  },
  "processes": {
    "vlc": {
      "http_port": 8080,
      "http_password": "vlc_password"
    },
    "chromium": {
      "debug_port": 9222
    }
  }
 }
 ```
 ---
 ## 11. Error Scenarios & Expected Behavior
 ### Scenario 1: VLC Crashes Mid-Video
 ```
 1. Watchdog detects: process_status = "crashed"
 2. Send ERROR log: "VLC process crashed"
 3. Attempt 1: Restart VLC with same video, seek to last position
 4. If success: Send INFO log "VLC restarted successfully"
 5. If fail: Repeat 2 more times
 6. After 3 failures: Send ERROR "Max restarts exceeded", show fallback
 ```
 ### Scenario 2: Network Timeout Loading Website
 ```
 1. Chromium fails to load page (CDP reports error)
 2. Send WARN log: "Page load timeout"
 3. Attempt reload (Chromium refresh)
 4. If success after 10s: Continue monitoring
 5. If timeout again: Send ERROR, try restarting Chromium
 ```
 ### Scenario 3: Display Powers Off (Hardware)
 ```
 1. DPMS check detects display off
 2. Send ERROR log: "Display powered off"
 3. Attempt to wake display (xset dpms force on)
 4. If success: Send INFO log
 5. If fail: Hardware issue, alert admin
 ```
 ### Scenario 4: High CPU Usage
 ```
 1. CPU >80% for 60 seconds
 2. Send WARN log: "High CPU usage: 85%"
 3. Check if expected (e.g., video playback is normal)
 4. If unexpected: investigate process causing it
 5. If critical (>95%): consider restarting offending process
 ```
 ---
 ## 12. Testing & Validation
 ### 12.1 Manual Tests (During Development)
 **Test 1: Process Crash Simulation**
 ```bash
 # Start video, then kill VLC manually
 killall vlc
 # Expected: ERROR log sent, automatic restart within 5 seconds
 ```
 **Test 2: MQTT Connectivity**
 ```bash
 # Subscribe to all client topics on server
 mosquitto_sub -h 192.168.43.201 -t "infoscreen/{uuid}/#" -v
 # Expected: See heartbeat every 60s, health every 5s
 ```
 **Test 3: Log Levels**
 ```bash
 # Trigger error condition and verify log appears in database
 curl http://192.168.43.201:8000/api/client-logs/test
 # Expected: See new log entry with correct level/message
 ```
 ### 12.2 Acceptance Criteria
 ✅ **Client must:**
 1. Send heartbeat every 60 seconds without gaps
 2. Send ERROR log within 5 seconds of process crash
 3. Attempt automatic restart (max 3 times)
 4. Report health metrics every 5 seconds
 5. Survive MQTT broker restart (reconnect automatically)
 6. Survive network interruption (buffer logs, send when reconnected)
 7. Use correct timestamp format (ISO 8601 UTC)
 8. Only send logs for real client UUID (FK constraint)
 ---
 ## 13. Python Libraries (Recommended)
 **For process monitoring:**
 - `psutil` - Cross-platform process and system utilities
 **For MQTT:**
 - `paho-mqtt` - Official MQTT client (use v2.x with Callback API v2)
 **For VLC control:**
 - `requests` - HTTP client for status queries
 **For Chromium control:**
 - `websocket-client` or `pychrome` - Chrome DevTools Protocol
 **For datetime:**
 - `datetime` (stdlib) - Use `datetime.now(timezone.utc).isoformat()`
 **Example requirements.txt:**
 ```
 paho-mqtt>=2.0.0
 psutil>=5.9.0
 requests>=2.31.0
 python-dateutil>=2.8.0
 ```
 ---
 ## 14. Security Considerations
 ### 14.1 MQTT Security
 - If broker requires auth, store credentials in config file with restricted permissions (`chmod 600`)
 - Consider TLS/SSL for MQTT (port 8883) if on untrusted network
 - Use unique client ID to prevent impersonation
 ### 14.2 Process Control APIs
 - VLC HTTP password should be random, not default
 - Chromium debug port should bind to `127.0.0.1` only (not `0.0.0.0`)
 - Restrict file system access for media player processes
 ### 14.3 Log Content
 - **Do not log:** Passwords, API keys, personal data
 - **Sanitize:** File paths (strip user directories), URLs (remove query params with tokens)
 ---
 ## 15. Performance Targets
 | Metric | Target | Acceptable | Critical |
 |--------|--------|------------|----------|
 | Health check interval | 5s | 10s | 30s |
 | Crash detection time | <5s | <10s | <30s |
 | Restart time | <10s | <20s | <60s |
 | MQTT publish latency | <100ms | <500ms | <2s |
 | CPU usage (watchdog) | <2% | <5% | <10% |
 | RAM usage (watchdog) | <50MB | <100MB | <200MB |
 | Log message size | <1KB | <10KB | <100KB |
 ---
 ## 16. Troubleshooting Guide (For Client Development)
 ### Issue: Logs not appearing in server database
 **Check:**
 1. Is MQTT broker reachable? (`mosquitto_pub` test from client)
 2. Is client UUID correct and exists in `clients` table?
 3. Is timestamp format correct (ISO 8601 with 'Z')?
 4. Check server listener logs for errors
 ### Issue: Health metrics not updating
 **Check:**
 1. Is health loop running? (check watchdog service status)
 2. Is MQTT connected? (check connection status in logs)
 3. Is payload JSON valid? (use JSON validator)
 ### Issue: Process restarts in loop
 **Check:**
 1. Is media file/URL accessible?
 2. Is process command correct? (test manually)
 3. Check process exit code (crash reason)
 4. Increase restart cooldown to avoid rapid loops
 ---
 ## 17. Complete Message Flow Diagram
 ```
 ┌─────────────────────────────────────────────────────────┐
 │                    Infoscreen Client                     │
 │                                                          │
 │  Event Occurs:                                           │
 │    - Process crashed                                     │
 │    - High CPU usage                                      │
 │    - Content loaded                                      │
 │                                                          │
 │  ┌────────────────┐                                     │
 │  │ Decision Logic │                                     │
 │  │  - Is it ERROR?│                                     │
 │  │  - Is it WARN? │                                     │
 │  │  - Is it INFO? │                                     │
 │  └────────┬───────┘                                     │
 │           │                                              │
 │           ▼                                              │
 │  ┌────────────────────────────────┐                    │
 │  │ Build JSON Payload              │                    │
 │  │ {                               │                    │
 │  │   "timestamp": "...",           │                    │
 │  │   "message": "...",             │                    │
 │  │   "context": {...}              │                    │
 │  │ }                               │                    │
 │  └────────┬───────────────────────┘                    │
 │           │                                              │
 │           ▼                                              │
 │  ┌────────────────────────────────┐                    │
 │  │ MQTT Publish                    │                    │
 │  │ Topic: infoscreen/{uuid}/logs/error                 │
 │  │ QoS: 1                          │                    │
 │  └────────┬───────────────────────┘                    │
 └───────────┼──────────────────────────────────────────┘
            │
            │ TCP/IP (MQTT Protocol)
            │
            ▼
     ┌──────────────┐
     │ MQTT Broker  │
     │ (Mosquitto)  │
     └──────┬───────┘
            │
            │ Topic: infoscreen/+/logs/#
            │
            ▼
     ┌──────────────────────────────┐
     │   Listener Service            │
     │   (Python)                    │
     │                               │
     │  - Parse JSON                 │
     │  - Validate UUID              │
     │  - Store in database          │
     └──────┬───────────────────────┘
            │
            ▼
     ┌──────────────────────────────┐
     │   MariaDB Database            │
     │                               │
     │   Table: client_logs          │
     │   - client_uuid               │
     │   - timestamp                 │
     │   - level                     │
     │   - message                   │
     │   - context (JSON)            │
     └──────┬───────────────────────┘
            │
            │ SQL Query
            │
            ▼
     ┌──────────────────────────────┐
     │   API Server (Flask)          │
     │                               │
     │   GET /api/client-logs/{uuid}/logs
     │   GET /api/client-logs/summary
     └──────┬───────────────────────┘
            │
            │ HTTP/JSON
            │
            ▼
     ┌──────────────────────────────┐
     │   Dashboard (React)           │
     │                               │
     │   - Display logs              │
     │   - Filter by level           │
     │   - Show health status        │
     └───────────────────────────────┘
 ```
 ---
 ## 18. Quick Reference Card
 ### MQTT Topics Summary
 ```
 infoscreen/{uuid}/logs/error    → Critical failures
 infoscreen/{uuid}/logs/warn     → Non-critical issues
 infoscreen/{uuid}/logs/info     → Informational (dev mode)
 infoscreen/{uuid}/health        → Health metrics (every 5s)
 infoscreen/{uuid}/heartbeat     → Enhanced heartbeat (every 60s)
 ```
 ### JSON Timestamp Format
 ```python
 from datetime import datetime, timezone
 timestamp = datetime.now(timezone.utc).isoformat()
 # Output: "2026-03-10T07:30:00+00:00" or "2026-03-10T07:30:00Z"
 ```
 ### Process Status Values
 ```
 "running"  - Process is alive and responding
 "crashed"  - Process terminated unexpectedly
 "starting" - Process is launching (startup phase)
 "stopped"  - Process intentionally stopped
 ```
 ### Restart Logic
 ```
 Max attempts: 3
 Cooldown: 2 seconds between attempts
 Reset: After 5 minutes of successful operation
 ```
 ---
 ## 19. Contact & Support
 **Server API Documentation:**
 - Base URL: `http://192.168.43.201:8000`
 - Health check: `GET /health`
 - Test logs: `GET /api/client-logs/test` (no auth)
 - Full API docs: See `CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md` on server
 **MQTT Broker:**
 - Host: `192.168.43.201`
 - Port: `1883` (standard), `9001` (WebSocket)
 - Test tool: `mosquitto_pub` / `mosquitto_sub`
 **Database Schema:**
 - Table: `client_logs`
 - Foreign Key: `client_uuid` → `clients.uuid` (ON DELETE CASCADE)
 - Constraint: UUID must exist in clients table before logging
 **Server-Side Logs:**
 ```bash
 # View listener logs (processes MQTT messages)
 docker compose logs -f listener
 # View server logs (API requests)
 docker compose logs -f server
 ```
 ---
 ## 20. Appendix: Example Implementations
 ### A. Minimal Python Watchdog (Pseudocode)
 ```python
 import time
 import json
 import psutil
 import paho.mqtt.client as mqtt
 from datetime import datetime, timezone
 class MinimalWatchdog:
    def __init__(self, client_uuid, mqtt_broker):
        self.uuid = client_uuid
        self.mqtt_client = mqtt.Client(callback_api_version=mqtt.CallbackAPIVersion.VERSION2)
        self.mqtt_client.connect(mqtt_broker, 1883, 60)
        self.mqtt_client.loop_start()
        self.expected_process = None
        self.restart_attempts = 0
        self.MAX_RESTARTS = 3
    def send_log(self, level, message, context=None):
        topic = f"infoscreen/{self.uuid}/logs/{level}"
        payload = {
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "message": message,
            "context": context or {}
        }
        self.mqtt_client.publish(topic, json.dumps(payload), qos=1)
    def is_process_running(self, process_name):
        for proc in psutil.process_iter(['name']):
            if process_name in proc.info['name']:
                return True
        return False
    def monitor_loop(self):
        while True:
            if self.expected_process:
                if not self.is_process_running(self.expected_process):
                    self.send_log("error", f"{self.expected_process} crashed")
                    if self.restart_attempts < self.MAX_RESTARTS:
                        self.restart_process()
                    else:
                        self.send_log("error", "Max restarts exceeded")
            time.sleep(5)
 # Usage:
 watchdog = MinimalWatchdog("9b8d1856-ff34-4864-a726-12de072d0f77", "192.168.43.201")
 watchdog.expected_process = "vlc"
 watchdog.monitor_loop()
 ```
 ---
 **END OF SPECIFICATION**
 Questions? Refer to:
 - `CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md` (server repo)
 - Server API: `http://192.168.43.201:8000/api/client-logs/test`
 - MQTT test: `mosquitto_sub -h 192.168.43.201 -t infoscreen/#`
--- a/TECH-CHANGELOG.md
+++ b/TECH-CHANGELOG.md
@@ -56,6 +56,51 @@ Notes for integrators:
 - CSS follows modern Material 3 color-function notation (`rgb(r g b / alpha%)`)
 - Syncfusion ScheduleComponent requires TimelineViews, Resize, and DragAndDrop modules injected
 Backend technical work (post-release notes; no version bump):
 - 📊 **Client Monitoring Infrastructure (Server-Side) (2026-03-10)**:
 	- Database schema: New Alembic migration `c1d2e3f4g5h6_add_client_monitoring.py` (idempotent) adds:
 		- `client_logs` table: Stores centralized logs with columns (id, client_uuid, timestamp, level, message, context, created_at)
 		- Foreign key: `client_logs.client_uuid` → `clients.uuid` (ON DELETE CASCADE)
 		- Health monitoring columns added to `clients` table: `current_event_id`, `current_process`, `process_status`, `process_pid`, `last_screenshot_analyzed`, `screen_health_status`, `last_screenshot_hash`
 		- Indexes for performance: (client_uuid, timestamp DESC), (level, timestamp DESC), (created_at DESC)
 	- Data models (`models/models.py`):
 		- New enums: `LogLevel` (ERROR, WARN, INFO, DEBUG), `ProcessStatus` (running, crashed, starting, stopped), `ScreenHealthStatus` (OK, BLACK, FROZEN, UNKNOWN)
 		- New model: `ClientLog` with foreign key to `Client` (CASCADE on delete)
 		- Extended `Client` model with 7 health monitoring fields
 	- MQTT listener extensions (`listener/listener.py`):
 		- New topic subscriptions: `infoscreen/+/logs/error`, `infoscreen/+/logs/warn`, `infoscreen/+/logs/info`, `infoscreen/+/health`
 		- Log handler: Parses JSON payloads, creates `ClientLog` entries, validates client UUID exists (FK constraint)
 		- Health handler: Updates client state from MQTT health messages
 		- Enhanced heartbeat handler: Captures `process_status`, `current_process`, `process_pid`, `current_event_id` from payload
 	- API endpoints (`server/routes/client_logs.py`):
 		- `GET /api/client-logs/<uuid>/logs` – Retrieve client logs with filters (level, limit, since); authenticated (admin_or_higher)
 		- `GET /api/client-logs/summary` – Get log counts by level per client for last 24h; authenticated (admin_or_higher)
 		- `GET /api/client-logs/recent-errors` – System-wide error monitoring; authenticated (admin_or_higher)
 		- `GET /api/client-logs/test` – Infrastructure validation endpoint (no auth required)
 		- Blueprint registered in `server/wsgi.py` as `client_logs_bp`
 	- Dev environment fix: Updated `docker-compose.override.yml` listener service to use `working_dir: /workspace` and direct command path for live code reload
 - 📡 **MQTT Protocol Extensions**:
 	- New log topics: `infoscreen/{uuid}/logs/{error|warn|info}` with JSON payload (timestamp, message, context)
 	- New health topic: `infoscreen/{uuid}/health` with metrics (expected_state, actual_state, health_metrics)
 	- Enhanced heartbeat: `infoscreen/{uuid}/heartbeat` now includes `current_process`, `process_pid`, `process_status`, `current_event_id`
 	- QoS levels: ERROR/WARN logs use QoS 1 (at least once), INFO/health use QoS 0 (fire and forget)
 - 📖 **Documentation**:
 	- New file: `CLIENT_MONITORING_SPECIFICATION.md` – Comprehensive 20-section technical spec for client-side implementation (MQTT protocol, process monitoring, auto-recovery, payload formats, testing guide)
 	- New file: `CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md` – 5-phase implementation guide (database, backend, client watchdog, dashboard UI, testing)
 	- Updated `.github/copilot-instructions.md`: Added MQTT topics section, client monitoring integration notes
 - ✅ **Validation**:
 	- End-to-end testing completed: MQTT message → listener → database → API confirmed working
 	- Test flow: Published message to `infoscreen/{real-uuid}/logs/error` → listener logs showed receipt → database stored entry → test API returned log data
 	- Known client UUIDs validated: 9b8d1856-ff34-4864-a726-12de072d0f77, 7f65c615-5827-4ada-9ac8-4727c2e8ee55, bdbfff95-0b2b-4265-8cc7-b0284509540a
 Notes for integrators:
 - Tiered logging strategy: ERROR/WARN always centralized (QoS 1), INFO dev-only (QoS 0), DEBUG local-only
 - Client-side implementation pending (Phase 3: watchdog service)
 - Dashboard UI pending (Phase 4: log viewer and health indicators)
 - Foreign key constraint prevents logging for non-existent clients (data integrity enforced)
 - Migration is idempotent and can be safely rerun after interruption
 - Use `GET /api/client-logs/test` for quick infrastructure validation without authentication
 ## 2025.1.0-beta.1 (TBD)
 - 🔐 **User Management & Role-Based Access Control**:
 	- Backend: Implemented comprehensive user management API (`server/routes/users.py`) with 6 endpoints (GET, POST, PUT, DELETE users + password reset).
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -18,6 +18,7 @@ services:
    environment:
      - DB_CONN=mysql+pymysql://${DB_USER}:${DB_PASSWORD}@db/${DB_NAME}
      - DB_URL=mysql+pymysql://${DB_USER}:${DB_PASSWORD}@db/${DB_NAME}
      - API_BASE_URL=http://server:8000
        - ENV=${ENV:-development}
        - FLASK_SECRET_KEY=${FLASK_SECRET_KEY:-dev-secret-key-change-in-production}
      - DEFAULT_SUPERADMIN_USERNAME=${DEFAULT_SUPERADMIN_USERNAME:-superadmin}
--- a/listener/listener.py
+++ b/listener/listener.py
@@ -7,7 +7,7 @@ import requests
 import paho.mqtt.client as mqtt
 from sqlalchemy import create_engine
 from sqlalchemy.orm import sessionmaker
-from models.models import Client
+from models.models import Client, ClientLog, LogLevel, ProcessStatus
 logging.basicConfig(level=logging.DEBUG, format='%(asctime)s [%(levelname)s] %(message)s')
 # Load .env in development
@@ -78,7 +78,14 @@ def on_connect(client, userdata, flags, reasonCode, properties):
        client.subscribe("infoscreen/+/heartbeat")
        client.subscribe("infoscreen/+/screenshot")
        client.subscribe("infoscreen/+/dashboard")
-        logging.info(f"MQTT connected (reasonCode: {reasonCode}); (re)subscribed to discovery, heartbeats, screenshots, and dashboards")
+        
        # Subscribe to monitoring topics
        client.subscribe("infoscreen/+/logs/error")
        client.subscribe("infoscreen/+/logs/warn")
        client.subscribe("infoscreen/+/logs/info")
        client.subscribe("infoscreen/+/health")
        logging.info(f"MQTT connected (reasonCode: {reasonCode}); (re)subscribed to discovery, heartbeats, screenshots, dashboards, logs, and health")
    except Exception as e:
        logging.error(f"Subscribe failed on connect: {e}")
@@ -124,16 +131,124 @@ def on_message(client, userdata, msg):
        # Heartbeat-Handling
        if topic.startswith("infoscreen/") and topic.endswith("/heartbeat"):
            uuid = topic.split("/")[1]
            try:
                # Parse payload to get optional health data
                payload_data = json.loads(msg.payload.decode())
            except (json.JSONDecodeError, UnicodeDecodeError):
                payload_data = {}
            session = Session()
            client_obj = session.query(Client).filter_by(uuid=uuid).first()
            if client_obj:
                client_obj.last_alive = datetime.datetime.now(datetime.UTC)
                # Update health fields if present in heartbeat
                if 'process_status' in payload_data:
                    try:
                        client_obj.process_status = ProcessStatus[payload_data['process_status']]
                    except (KeyError, TypeError):
                        pass
                if 'current_process' in payload_data:
                    client_obj.current_process = payload_data.get('current_process')
                if 'process_pid' in payload_data:
                    client_obj.process_pid = payload_data.get('process_pid')
                if 'current_event_id' in payload_data:
                    client_obj.current_event_id = payload_data.get('current_event_id')
                session.commit()
-                logging.info(
+                logging.info(f"Heartbeat von {uuid} empfangen, last_alive (UTC) aktualisiert.")
                    f"Heartbeat von {uuid} empfangen, last_alive (UTC) aktualisiert.")
            session.close()
            return
        # Log-Handling (ERROR, WARN, INFO)
        if topic.startswith("infoscreen/") and "/logs/" in topic:
            parts = topic.split("/")
            if len(parts) >= 4:
                uuid = parts[1]
                level_str = parts[3].upper()  # 'error', 'warn', 'info' -> 'ERROR', 'WARN', 'INFO'
                try:
                    payload_data = json.loads(msg.payload.decode())
                    message = payload_data.get('message', '')
                    timestamp_str = payload_data.get('timestamp')
                    context = payload_data.get('context', {})
                    # Parse timestamp or use current time
                    if timestamp_str:
                        try:
                            log_timestamp = datetime.datetime.fromisoformat(timestamp_str.replace('Z', '+00:00'))
                            if log_timestamp.tzinfo is None:
                                log_timestamp = log_timestamp.replace(tzinfo=datetime.UTC)
                        except ValueError:
                            log_timestamp = datetime.datetime.now(datetime.UTC)
                    else:
                        log_timestamp = datetime.datetime.now(datetime.UTC)
                    # Store in database
                    session = Session()
                    try:
                        log_level = LogLevel[level_str]
                        log_entry = ClientLog(
                            client_uuid=uuid,
                            timestamp=log_timestamp,
                            level=log_level,
                            message=message,
                            context=json.dumps(context) if context else None
                        )
                        session.add(log_entry)
                        session.commit()
                        logging.info(f"[{level_str}] {uuid}: {message}")
                    except Exception as e:
                        logging.error(f"Error saving log from {uuid}: {e}")
                        session.rollback()
                    finally:
                        session.close()
                except (json.JSONDecodeError, UnicodeDecodeError) as e:
                    logging.error(f"Could not parse log payload from {uuid}: {e}")
            return
        # Health-Handling
        if topic.startswith("infoscreen/") and topic.endswith("/health"):
            uuid = topic.split("/")[1]
            try:
                payload_data = json.loads(msg.payload.decode())
                session = Session()
                client_obj = session.query(Client).filter_by(uuid=uuid).first()
                if client_obj:
                    # Update expected state
                    expected = payload_data.get('expected_state', {})
                    if 'event_id' in expected:
                        client_obj.current_event_id = expected['event_id']
                    # Update actual state
                    actual = payload_data.get('actual_state', {})
                    if 'process' in actual:
                        client_obj.current_process = actual['process']
                    if 'pid' in actual:
                        client_obj.process_pid = actual['pid']
                    if 'status' in actual:
                        try:
                            client_obj.process_status = ProcessStatus[actual['status']]
                        except (KeyError, TypeError):
                            pass
                    session.commit()
                    logging.debug(f"Health update from {uuid}: {actual.get('process')} ({actual.get('status')})")
                session.close()
            except (json.JSONDecodeError, UnicodeDecodeError) as e:
                logging.error(f"Could not parse health payload from {uuid}: {e}")
            except Exception as e:
                logging.error(f"Error processing health from {uuid}: {e}")
            return
        # Discovery-Handling
        if topic == "infoscreen/discovery":
            payload = json.loads(msg.payload.decode())
--- a/models/models.py
+++ b/models/models.py
@@ -21,6 +21,27 @@ class AcademicPeriodType(enum.Enum):
    trimester = "trimester"
 class LogLevel(enum.Enum):
    ERROR = "ERROR"
    WARN = "WARN"
    INFO = "INFO"
    DEBUG = "DEBUG"
 class ProcessStatus(enum.Enum):
    running = "running"
    crashed = "crashed"
    starting = "starting"
    stopped = "stopped"
 class ScreenHealthStatus(enum.Enum):
    OK = "OK"
    BLACK = "BLACK"
    FROZEN = "FROZEN"
    UNKNOWN = "UNKNOWN"
 class User(Base):
    __tablename__ = 'users'
    id = Column(Integer, primary_key=True, autoincrement=True)
@@ -107,6 +128,31 @@ class Client(Base):
    group_id = Column(Integer, ForeignKey(
        'client_groups.id'), nullable=False, default=1)
    # Health monitoring fields
    current_event_id = Column(Integer, nullable=True)
    current_process = Column(String(50), nullable=True)  # 'vlc', 'chromium', 'pdf_viewer'
    process_status = Column(Enum(ProcessStatus), nullable=True)
    process_pid = Column(Integer, nullable=True)
    last_screenshot_analyzed = Column(TIMESTAMP(timezone=True), nullable=True)
    screen_health_status = Column(Enum(ScreenHealthStatus), nullable=True, server_default='UNKNOWN')
    last_screenshot_hash = Column(String(32), nullable=True)
 class ClientLog(Base):
    __tablename__ = 'client_logs'
    id = Column(Integer, primary_key=True, autoincrement=True)
    client_uuid = Column(String(36), ForeignKey('clients.uuid', ondelete='CASCADE'), nullable=False, index=True)
    timestamp = Column(TIMESTAMP(timezone=True), nullable=False, index=True)
    level = Column(Enum(LogLevel), nullable=False, index=True)
    message = Column(Text, nullable=False)
    context = Column(Text, nullable=True)  # JSON stored as text
    created_at = Column(TIMESTAMP(timezone=True), server_default=func.current_timestamp(), nullable=False)
    __table_args__ = (
        Index('ix_client_logs_client_timestamp', 'client_uuid', 'timestamp'),
        Index('ix_client_logs_level_timestamp', 'level', 'timestamp'),
    )
 class EventType(enum.Enum):
    presentation = "presentation"
--- a/server/alembic/versions/c1d2e3f4g5h6_add_client_monitoring.py
+++ b/server/alembic/versions/c1d2e3f4g5h6_add_client_monitoring.py
@@ -0,0 +1,84 @@
 """add client monitoring tables and columns
 Revision ID: c1d2e3f4g5h6
 Revises: 4f0b8a3e5c20
 Create Date: 2026-03-09 21:08:38.000000
 """
 from alembic import op
 import sqlalchemy as sa
 # revision identifiers, used by Alembic.
 revision = 'c1d2e3f4g5h6'
 down_revision = '4f0b8a3e5c20'
 branch_labels = None
 depends_on = None
 def upgrade():
    bind = op.get_bind()
    inspector = sa.inspect(bind)
    # 1. Add health monitoring columns to clients table (safe on rerun)
    existing_client_columns = {c['name'] for c in inspector.get_columns('clients')}
    if 'current_event_id' not in existing_client_columns:
        op.add_column('clients', sa.Column('current_event_id', sa.Integer(), nullable=True))
    if 'current_process' not in existing_client_columns:
        op.add_column('clients', sa.Column('current_process', sa.String(50), nullable=True))
    if 'process_status' not in existing_client_columns:
        op.add_column('clients', sa.Column('process_status', sa.Enum('running', 'crashed', 'starting', 'stopped', name='processstatus'), nullable=True))
    if 'process_pid' not in existing_client_columns:
        op.add_column('clients', sa.Column('process_pid', sa.Integer(), nullable=True))
    if 'last_screenshot_analyzed' not in existing_client_columns:
        op.add_column('clients', sa.Column('last_screenshot_analyzed', sa.TIMESTAMP(timezone=True), nullable=True))
    if 'screen_health_status' not in existing_client_columns:
        op.add_column('clients', sa.Column('screen_health_status', sa.Enum('OK', 'BLACK', 'FROZEN', 'UNKNOWN', name='screenhealthstatus'), nullable=True, server_default='UNKNOWN'))
    if 'last_screenshot_hash' not in existing_client_columns:
        op.add_column('clients', sa.Column('last_screenshot_hash', sa.String(32), nullable=True))
    # 2. Create client_logs table (safe on rerun)
    if not inspector.has_table('client_logs'):
        op.create_table('client_logs',
            sa.Column('id', sa.Integer(), autoincrement=True, nullable=False),
            sa.Column('client_uuid', sa.String(36), nullable=False),
            sa.Column('timestamp', sa.TIMESTAMP(timezone=True), nullable=False),
            sa.Column('level', sa.Enum('ERROR', 'WARN', 'INFO', 'DEBUG', name='loglevel'), nullable=False),
            sa.Column('message', sa.Text(), nullable=False),
            sa.Column('context', sa.JSON(), nullable=True),
            sa.Column('created_at', sa.TIMESTAMP(timezone=True), server_default=sa.func.current_timestamp(), nullable=False),
            sa.PrimaryKeyConstraint('id'),
            sa.ForeignKeyConstraint(['client_uuid'], ['clients.uuid'], ondelete='CASCADE'),
            mysql_charset='utf8mb4',
            mysql_collate='utf8mb4_unicode_ci',
            mysql_engine='InnoDB'
        )
    # 3. Create indexes for efficient querying (safe on rerun)
    client_log_indexes = {idx['name'] for idx in inspector.get_indexes('client_logs')} if inspector.has_table('client_logs') else set()
    client_indexes = {idx['name'] for idx in inspector.get_indexes('clients')}
    if 'ix_client_logs_client_timestamp' not in client_log_indexes:
        op.create_index('ix_client_logs_client_timestamp', 'client_logs', ['client_uuid', 'timestamp'])
    if 'ix_client_logs_level_timestamp' not in client_log_indexes:
        op.create_index('ix_client_logs_level_timestamp', 'client_logs', ['level', 'timestamp'])
    if 'ix_clients_process_status' not in client_indexes:
        op.create_index('ix_clients_process_status', 'clients', ['process_status'])
 def downgrade():
    # Drop indexes
    op.drop_index('ix_clients_process_status', table_name='clients')
    op.drop_index('ix_client_logs_level_timestamp', table_name='client_logs')
    op.drop_index('ix_client_logs_client_timestamp', table_name='client_logs')
    # Drop table
    op.drop_table('client_logs')
    # Drop columns from clients
    op.drop_column('clients', 'last_screenshot_hash')
    op.drop_column('clients', 'screen_health_status')
    op.drop_column('clients', 'last_screenshot_analyzed')
    op.drop_column('clients', 'process_pid')
    op.drop_column('clients', 'process_status')
    op.drop_column('clients', 'current_process')
    op.drop_column('clients', 'current_event_id')
--- a/server/routes/client_logs.py
+++ b/server/routes/client_logs.py
@@ -0,0 +1,255 @@
 from flask import Blueprint, jsonify, request
 from server.database import Session
 from server.permissions import admin_or_higher
 from models.models import ClientLog, Client, LogLevel
 from sqlalchemy import desc, func
 from datetime import datetime, timedelta, timezone
 import json
 client_logs_bp = Blueprint("client_logs", __name__, url_prefix="/api/client-logs")
@client_logs_bp.route("/test", methods=["GET"])
 def test_client_logs():
    """Test endpoint to verify logging infrastructure (no auth required)"""
    session = Session()
    try:
        # Count total logs
        total_logs = session.query(func.count(ClientLog.id)).scalar()
        # Count by level
        error_count = session.query(func.count(ClientLog.id)).filter_by(level=LogLevel.ERROR).scalar()
        warn_count = session.query(func.count(ClientLog.id)).filter_by(level=LogLevel.WARN).scalar()
        info_count = session.query(func.count(ClientLog.id)).filter_by(level=LogLevel.INFO).scalar()
        # Get last 5 logs
        recent_logs = session.query(ClientLog).order_by(desc(ClientLog.timestamp)).limit(5).all()
        recent = []
        for log in recent_logs:
            recent.append({
                "client_uuid": log.client_uuid,
                "level": log.level.value if log.level else None,
                "message": log.message,
                "timestamp": log.timestamp.isoformat() if log.timestamp else None
            })
        session.close()
        return jsonify({
            "status": "ok",
            "infrastructure": "working",
            "total_logs": total_logs,
            "counts": {
                "ERROR": error_count,
                "WARN": warn_count,
                "INFO": info_count
            },
            "recent_5": recent
        })
    except Exception as e:
        session.close()
        return jsonify({"status": "error", "message": str(e)}), 500
@client_logs_bp.route("/<uuid>/logs", methods=["GET"])
@admin_or_higher
 def get_client_logs(uuid):
    """
    Get logs for a specific client
    Query params:
      - level: ERROR, WARN, INFO, DEBUG (optional)
      - limit: number of entries (default 50, max 500)
      - since: ISO timestamp (optional)
    Example: /api/client-logs/abc-123/logs?level=ERROR&limit=100
    """
    session = Session()
    try:
        # Verify client exists
        client = session.query(Client).filter_by(uuid=uuid).first()
        if not client:
            session.close()
            return jsonify({"error": "Client not found"}), 404
        # Parse query parameters
        level_param = request.args.get('level')
        limit = min(int(request.args.get('limit', 50)), 500)
        since_param = request.args.get('since')
        # Build query
        query = session.query(ClientLog).filter_by(client_uuid=uuid)
        # Filter by log level
        if level_param:
            try:
                level_enum = LogLevel[level_param.upper()]
                query = query.filter_by(level=level_enum)
            except KeyError:
                session.close()
                return jsonify({"error": f"Invalid level: {level_param}. Must be ERROR, WARN, INFO, or DEBUG"}), 400
        # Filter by timestamp
        if since_param:
            try:
                # Handle both with and without 'Z' suffix
                since_str = since_param.replace('Z', '+00:00')
                since_dt = datetime.fromisoformat(since_str)
                if since_dt.tzinfo is None:
                    since_dt = since_dt.replace(tzinfo=timezone.utc)
                query = query.filter(ClientLog.timestamp >= since_dt)
            except ValueError:
                session.close()
                return jsonify({"error": "Invalid timestamp format. Use ISO 8601"}), 400
        # Execute query
        logs = query.order_by(desc(ClientLog.timestamp)).limit(limit).all()
        # Format results
        result = []
        for log in logs:
            entry = {
                "id": log.id,
                "timestamp": log.timestamp.isoformat() if log.timestamp else None,
                "level": log.level.value if log.level else None,
                "message": log.message,
                "context": {}
            }
            # Parse context JSON
            if log.context:
                try:
                    entry["context"] = json.loads(log.context)
                except json.JSONDecodeError:
                    entry["context"] = {"raw": log.context}
            result.append(entry)
        session.close()
        return jsonify({
            "client_uuid": uuid,
            "logs": result,
            "count": len(result),
            "limit": limit
        })
    except Exception as e:
        session.close()
        return jsonify({"error": f"Server error: {str(e)}"}), 500
@client_logs_bp.route("/summary", methods=["GET"])
@admin_or_higher
 def get_logs_summary():
    """
    Get summary of errors/warnings across all clients in last 24 hours
    Returns count of ERROR, WARN, INFO logs per client
    Example response:
    {
      "summary": {
        "client-uuid-1": {"ERROR": 5, "WARN": 12, "INFO": 45},
        "client-uuid-2": {"ERROR": 0, "WARN": 3, "INFO": 20}
      },
      "period_hours": 24,
      "timestamp": "2026-03-09T21:00:00Z"
    }
    """
    session = Session()
    try:
        # Get hours parameter (default 24, max 168 = 1 week)
        hours = min(int(request.args.get('hours', 24)), 168)
        since = datetime.now(timezone.utc) - timedelta(hours=hours)
        # Query log counts grouped by client and level
        stats = session.query(
            ClientLog.client_uuid,
            ClientLog.level,
            func.count(ClientLog.id).label('count')
        ).filter(
            ClientLog.timestamp >= since
        ).group_by(
            ClientLog.client_uuid,
            ClientLog.level
        ).all()
        # Build summary dictionary
        summary = {}
        for stat in stats:
            uuid = stat.client_uuid
            if uuid not in summary:
                # Initialize all levels to 0
                summary[uuid] = {
                    "ERROR": 0,
                    "WARN": 0,
                    "INFO": 0,
                    "DEBUG": 0
                }
            summary[uuid][stat.level.value] = stat.count
        # Get client info for enrichment
        clients = session.query(Client.uuid, Client.hostname, Client.description).all()
        client_info = {c.uuid: {"hostname": c.hostname, "description": c.description} for c in clients}
        # Enrich summary with client info
        enriched_summary = {}
        for uuid, counts in summary.items():
            enriched_summary[uuid] = {
                "counts": counts,
                "info": client_info.get(uuid, {})
            }
        session.close()
        return jsonify({
            "summary": enriched_summary,
            "period_hours": hours,
            "since": since.isoformat(),
            "timestamp": datetime.now(timezone.utc).isoformat()
        })
    except Exception as e:
        session.close()
        return jsonify({"error": f"Server error: {str(e)}"}), 500
@client_logs_bp.route("/recent-errors", methods=["GET"])
@admin_or_higher
 def get_recent_errors():
    """
    Get recent ERROR logs across all clients
    Query params:
      - limit: number of entries (default 20, max 100)
    Useful for system-wide error monitoring
    """
    session = Session()
    try:
        limit = min(int(request.args.get('limit', 20)), 100)
        # Get recent errors from all clients
        logs = session.query(ClientLog).filter_by(
            level=LogLevel.ERROR
        ).order_by(
            desc(ClientLog.timestamp)
        ).limit(limit).all()
        result = []
        for log in logs:
            entry = {
                "id": log.id,
                "client_uuid": log.client_uuid,
                "timestamp": log.timestamp.isoformat() if log.timestamp else None,
                "message": log.message,
                "context": json.loads(log.context) if log.context else {}
            }
            result.append(entry)
        session.close()
        return jsonify({
            "errors": result,
            "count": len(result)
        })
    except Exception as e:
        session.close()
        return jsonify({"error": f"Server error: {str(e)}"}), 500
--- a/server/wsgi.py
+++ b/server/wsgi.py
@@ -8,6 +8,7 @@ from server.routes.holidays import holidays_bp
 from server.routes.academic_periods import academic_periods_bp
 from server.routes.groups import groups_bp
 from server.routes.clients import clients_bp
 from server.routes.client_logs import client_logs_bp
 from server.routes.auth import auth_bp
 from server.routes.users import users_bp
 from server.routes.system_settings import system_settings_bp
@@ -46,6 +47,7 @@ else:
 app.register_blueprint(auth_bp)
 app.register_blueprint(users_bp)
 app.register_blueprint(clients_bp)
 app.register_blueprint(client_logs_bp)
 app.register_blueprint(groups_bp)
 app.register_blueprint(events_bp)
 app.register_blueprint(event_exceptions_bp)