feat(monitoring): add server-side client logging and health infrastructure
- add Alembic migration c1d2e3f4g5h6 for client monitoring:
- create client_logs table with FK to clients.uuid and performance indexes
- extend clients with process/health tracking fields
- extend data model with ClientLog, LogLevel, ProcessStatus, and ScreenHealthStatus
- enhance listener MQTT handling:
- subscribe to logs and health topics
- persist client logs from infoscreen/{uuid}/logs/{level}
- process health payloads and enrich heartbeat-derived client state
- add monitoring API blueprint server/routes/client_logs.py:
- GET /api/client-logs/<uuid>/logs
- GET /api/client-logs/summary
- GET /api/client-logs/recent-errors
- GET /api/client-logs/test
- register client_logs blueprint in server/wsgi.py
- align compose/dev runtime for listener live-code execution
- add client-side implementation docs:
- CLIENT_MONITORING_SPECIFICATION.md
- CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md
- update TECH-CHANGELOG.md and copilot-instructions.md:
- document monitoring changes
- codify post-release technical-notes/no-version-bump convention
This commit is contained in:
28
.github/copilot-instructions.md
vendored
28
.github/copilot-instructions.md
vendored
@@ -124,9 +124,11 @@ Keep docs synced with code. When you change services/MQTT/API/UTC/env or dev/pro
|
|||||||
- Scheduler queries a future window (default: 7 days) to expand recurring events using RFC 5545 rules, applies event exceptions (skipped dates, detached occurrences), and publishes only events that are active at the current time (UTC). When a group has no active events, the scheduler clears its retained topic by publishing an empty list. Time comparisons are UTC; naive timestamps are normalized. Logging is concise; conversion lookups are cached and logged only once per media.
|
- Scheduler queries a future window (default: 7 days) to expand recurring events using RFC 5545 rules, applies event exceptions (skipped dates, detached occurrences), and publishes only events that are active at the current time (UTC). When a group has no active events, the scheduler clears its retained topic by publishing an empty list. Time comparisons are UTC; naive timestamps are normalized. Logging is concise; conversion lookups are cached and logged only once per media.
|
||||||
- MQTT topics (paho-mqtt v2, use Callback API v2):
|
- MQTT topics (paho-mqtt v2, use Callback API v2):
|
||||||
- Discovery: `infoscreen/discovery` (JSON includes `uuid`, hw/ip data). ACK to `infoscreen/{uuid}/discovery_ack`. See `listener/listener.py`.
|
- Discovery: `infoscreen/discovery` (JSON includes `uuid`, hw/ip data). ACK to `infoscreen/{uuid}/discovery_ack`. See `listener/listener.py`.
|
||||||
- Heartbeat: `infoscreen/{uuid}/heartbeat` updates `Client.last_alive` (UTC).
|
- Heartbeat: `infoscreen/{uuid}/heartbeat` updates `Client.last_alive` (UTC); enhanced payload includes `current_process`, `process_pid`, `process_status`, `current_event_id`.
|
||||||
- Event lists (retained): `infoscreen/events/{group_id}` from `scheduler/scheduler.py`.
|
- Event lists (retained): `infoscreen/events/{group_id}` from `scheduler/scheduler.py`.
|
||||||
- Per-client group assignment (retained): `infoscreen/{uuid}/group_id` via `server/mqtt_helper.py`.
|
- Per-client group assignment (retained): `infoscreen/{uuid}/group_id` via `server/mqtt_helper.py`.
|
||||||
|
- Client logs: `infoscreen/{uuid}/logs/{error|warn|info}` with JSON payload (timestamp, message, context); QoS 1 for ERROR/WARN, QoS 0 for INFO.
|
||||||
|
- Client health: `infoscreen/{uuid}/health` with metrics (expected_state, actual_state, health_metrics); QoS 0, published every 5 seconds.
|
||||||
- Screenshots: server-side folders `server/received_screenshots/` and `server/screenshots/`; Nginx exposes `/screenshots/{uuid}.jpg` via `server/wsgi.py` route.
|
- Screenshots: server-side folders `server/received_screenshots/` and `server/screenshots/`; Nginx exposes `/screenshots/{uuid}.jpg` via `server/wsgi.py` route.
|
||||||
|
|
||||||
- Dev Container guidance: If extensions reappear inside the container, remove UI-only extensions from `devcontainer.json` `extensions` and map them in `remote.extensionKind` as `"ui"`.
|
- Dev Container guidance: If extensions reappear inside the container, remove UI-only extensions from `devcontainer.json` `extensions` and map them in `remote.extensionKind` as `"ui"`.
|
||||||
@@ -146,6 +148,11 @@ Keep docs synced with code. When you change services/MQTT/API/UTC/env or dev/pro
|
|||||||
- `locked_until`: TIMESTAMP placeholder for account lockout (infrastructure in place, not yet enforced)
|
- `locked_until`: TIMESTAMP placeholder for account lockout (infrastructure in place, not yet enforced)
|
||||||
- `deactivated_at`, `deactivated_by`: Soft-delete audit trail (FK self-reference); soft deactivation is the default, hard delete superadmin-only
|
- `deactivated_at`, `deactivated_by`: Soft-delete audit trail (FK self-reference); soft deactivation is the default, hard delete superadmin-only
|
||||||
- Role hierarchy (privilege escalation enforced): `user` < `editor` < `admin` < `superadmin`
|
- Role hierarchy (privilege escalation enforced): `user` < `editor` < `admin` < `superadmin`
|
||||||
|
- Client monitoring (migration: `c1d2e3f4g5h6_add_client_monitoring.py`):
|
||||||
|
- `ClientLog` model: Centralized log storage with fields (id, client_uuid, timestamp, level, message, context, created_at); FK to clients.uuid (CASCADE)
|
||||||
|
- `Client` model extended: 7 health monitoring fields (`current_event_id`, `current_process`, `process_status`, `process_pid`, `last_screenshot_analyzed`, `screen_health_status`, `last_screenshot_hash`)
|
||||||
|
- Enums: `LogLevel` (ERROR, WARN, INFO, DEBUG), `ProcessStatus` (running, crashed, starting, stopped), `ScreenHealthStatus` (OK, BLACK, FROZEN, UNKNOWN)
|
||||||
|
- Indexes: (client_uuid, timestamp DESC), (level, timestamp DESC), (created_at DESC) for performance
|
||||||
- System settings: `system_settings` key–value store via `SystemSetting` for global configuration (e.g., WebUntis/Vertretungsplan supplement-table). Managed through routes in `server/routes/system_settings.py`.
|
- System settings: `system_settings` key–value store via `SystemSetting` for global configuration (e.g., WebUntis/Vertretungsplan supplement-table). Managed through routes in `server/routes/system_settings.py`.
|
||||||
- Presentation defaults (system-wide):
|
- Presentation defaults (system-wide):
|
||||||
- `presentation_interval` (seconds, default "10")
|
- `presentation_interval` (seconds, default "10")
|
||||||
@@ -189,6 +196,11 @@ Keep docs synced with code. When you change services/MQTT/API/UTC/env or dev/pro
|
|||||||
- `PUT /api/users/<id>/password` — admin password reset (requires backend check to reject self-reset for consistency)
|
- `PUT /api/users/<id>/password` — admin password reset (requires backend check to reject self-reset for consistency)
|
||||||
- `DELETE /api/users/<id>` — hard delete (superadmin only, with self-deletion check)
|
- `DELETE /api/users/<id>` — hard delete (superadmin only, with self-deletion check)
|
||||||
- Auth routes (`server/routes/auth.py`): Enhanced to track login events (sets `last_login_at`, resets `failed_login_attempts` on success; increments `failed_login_attempts` and `last_failed_login_at` on failure). Self-service password change via `PUT /api/auth/change-password` requires current password verification.
|
- Auth routes (`server/routes/auth.py`): Enhanced to track login events (sets `last_login_at`, resets `failed_login_attempts` on success; increments `failed_login_attempts` and `last_failed_login_at` on failure). Self-service password change via `PUT /api/auth/change-password` requires current password verification.
|
||||||
|
- Client logs (`server/routes/client_logs.py`): Centralized log retrieval for monitoring:
|
||||||
|
- `GET /api/client-logs/<uuid>/logs` – Query client logs with filters (level, limit, since); admin_or_higher
|
||||||
|
- `GET /api/client-logs/summary` – Log counts by level per client (last 24h); admin_or_higher
|
||||||
|
- `GET /api/client-logs/recent-errors` – System-wide error monitoring; admin_or_higher
|
||||||
|
- `GET /api/client-logs/test` – Infrastructure validation (no auth); returns recent logs with counts
|
||||||
|
|
||||||
Documentation maintenance: keep this file aligned with real patterns; update when routes/session/UTC rules change. Avoid long prose; link exact paths.
|
Documentation maintenance: keep this file aligned with real patterns; update when routes/session/UTC rules change. Avoid long prose; link exact paths.
|
||||||
|
|
||||||
@@ -364,7 +376,8 @@ Docs maintenance guardrails (solo-friendly): Update this file alongside code cha
|
|||||||
## Quick examples
|
## Quick examples
|
||||||
- Add client description persists to DB and publishes group via MQTT: see `PUT /api/clients/<uuid>/description` in `routes/clients.py`.
|
- Add client description persists to DB and publishes group via MQTT: see `PUT /api/clients/<uuid>/description` in `routes/clients.py`.
|
||||||
- Bulk group assignment emits retained messages for each client: `PUT /api/clients/group`.
|
- Bulk group assignment emits retained messages for each client: `PUT /api/clients/group`.
|
||||||
- Listener heartbeat path: `infoscreen/<uuid>/heartbeat` → sets `clients.last_alive`.
|
- Listener heartbeat path: `infoscreen/<uuid>/heartbeat` → sets `clients.last_alive` and captures process health data.
|
||||||
|
- Client monitoring flow: Client publishes to `infoscreen/{uuid}/logs/error` → listener stores in `client_logs` table → API serves via `/api/client-logs/<uuid>/logs` → dashboard displays (Phase 4, pending).
|
||||||
|
|
||||||
## Scheduler payloads: presentation extras
|
## Scheduler payloads: presentation extras
|
||||||
- Presentation event payloads now include `page_progress` and `auto_progress` in addition to `slide_interval` and media files. These are sourced from per-event fields in the database (with system defaults applied on event creation).
|
- Presentation event payloads now include `page_progress` and `auto_progress` in addition to `slide_interval` and media files. These are sourced from per-event fields in the database (with system defaults applied on event creation).
|
||||||
@@ -393,3 +406,14 @@ Questions or unclear areas? Tell us if you need: exact devcontainer debugging st
|
|||||||
- Breaking changes must be prefixed with `BREAKING:`
|
- Breaking changes must be prefixed with `BREAKING:`
|
||||||
- Keep ≤ 8–10 bullets; summarize or group micro-changes
|
- Keep ≤ 8–10 bullets; summarize or group micro-changes
|
||||||
- JSON hygiene: valid JSON, no trailing commas, don’t edit historical entries except typos
|
- JSON hygiene: valid JSON, no trailing commas, don’t edit historical entries except typos
|
||||||
|
|
||||||
|
## Versioning Convention (Tech vs UI)
|
||||||
|
|
||||||
|
- Use one unified app version across technical and user-facing release notes.
|
||||||
|
- `dashboard/public/program-info.json` is user-facing and should list only user-visible changes.
|
||||||
|
- `TECH-CHANGELOG.md` can include deeper technical details for the same released version.
|
||||||
|
- If server/infrastructure work is implemented but not yet released or not user-visible, document it under the latest released section as:
|
||||||
|
- `Backend technical work (post-release notes; no version bump)`
|
||||||
|
- Do not create a new version header in `TECH-CHANGELOG.md` for internal milestones alone.
|
||||||
|
- Bump version numbers when a release is actually cut/deployed (or when user-facing release notes are published), not for intermediate backend-only steps.
|
||||||
|
- When UI integration lands later, include the user-visible part in the next release version and reference prior post-release technical groundwork when useful.
|
||||||
|
|||||||
757
CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md
Normal file
757
CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md
Normal file
@@ -0,0 +1,757 @@
|
|||||||
|
# 🚀 Client Monitoring Implementation Guide
|
||||||
|
|
||||||
|
**Phase-based implementation guide for basic monitoring in development phase**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ✅ Phase 1: Server-Side Database Foundation
|
||||||
|
**Status:** ✅ COMPLETE
|
||||||
|
**Dependencies:** None - Already implemented
|
||||||
|
**Time estimate:** Completed
|
||||||
|
|
||||||
|
### ✅ Step 1.1: Database Migration
|
||||||
|
**File:** `server/alembic/versions/c1d2e3f4g5h6_add_client_monitoring.py`
|
||||||
|
**What it does:**
|
||||||
|
- Creates `client_logs` table for centralized logging
|
||||||
|
- Adds health monitoring columns to `clients` table
|
||||||
|
- Creates indexes for efficient querying
|
||||||
|
|
||||||
|
**To apply:**
|
||||||
|
```bash
|
||||||
|
cd /workspace/server
|
||||||
|
alembic upgrade head
|
||||||
|
```
|
||||||
|
|
||||||
|
### ✅ Step 1.2: Update Data Models
|
||||||
|
**File:** `models/models.py`
|
||||||
|
**What was added:**
|
||||||
|
- New enums: `LogLevel`, `ProcessStatus`, `ScreenHealthStatus`
|
||||||
|
- Updated `Client` model with health tracking fields
|
||||||
|
- New `ClientLog` model for log storage
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🔧 Phase 2: Server-Side Backend Logic
|
||||||
|
**Status:** 🚧 IN PROGRESS
|
||||||
|
**Dependencies:** Phase 1 complete
|
||||||
|
**Time estimate:** 2-3 hours
|
||||||
|
|
||||||
|
### Step 2.1: Extend MQTT Listener
|
||||||
|
**File:** `listener/listener.py`
|
||||||
|
**What to add:**
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Add new topic subscriptions in on_connect():
|
||||||
|
client.subscribe("infoscreen/+/logs/error")
|
||||||
|
client.subscribe("infoscreen/+/logs/warn")
|
||||||
|
client.subscribe("infoscreen/+/logs/info") # Dev mode only
|
||||||
|
client.subscribe("infoscreen/+/health")
|
||||||
|
|
||||||
|
# Add new handler in on_message():
|
||||||
|
def handle_log_message(uuid, level, payload):
|
||||||
|
"""Store client log in database"""
|
||||||
|
from models.models import ClientLog, LogLevel
|
||||||
|
from server.database import Session
|
||||||
|
import json
|
||||||
|
|
||||||
|
session = Session()
|
||||||
|
try:
|
||||||
|
log_entry = ClientLog(
|
||||||
|
client_uuid=uuid,
|
||||||
|
timestamp=payload.get('timestamp', datetime.now(timezone.utc)),
|
||||||
|
level=LogLevel[level],
|
||||||
|
message=payload.get('message', ''),
|
||||||
|
context=json.dumps(payload.get('context', {}))
|
||||||
|
)
|
||||||
|
session.add(log_entry)
|
||||||
|
session.commit()
|
||||||
|
print(f"[LOG] {uuid} {level}: {payload.get('message', '')}")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error saving log: {e}")
|
||||||
|
session.rollback()
|
||||||
|
finally:
|
||||||
|
session.close()
|
||||||
|
|
||||||
|
def handle_health_message(uuid, payload):
|
||||||
|
"""Update client health status"""
|
||||||
|
from models.models import Client, ProcessStatus
|
||||||
|
from server.database import Session
|
||||||
|
|
||||||
|
session = Session()
|
||||||
|
try:
|
||||||
|
client = session.query(Client).filter_by(uuid=uuid).first()
|
||||||
|
if client:
|
||||||
|
client.current_event_id = payload.get('expected_state', {}).get('event_id')
|
||||||
|
client.current_process = payload.get('actual_state', {}).get('process')
|
||||||
|
|
||||||
|
status_str = payload.get('actual_state', {}).get('status')
|
||||||
|
if status_str:
|
||||||
|
client.process_status = ProcessStatus[status_str]
|
||||||
|
|
||||||
|
client.process_pid = payload.get('actual_state', {}).get('pid')
|
||||||
|
session.commit()
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error updating health: {e}")
|
||||||
|
session.rollback()
|
||||||
|
finally:
|
||||||
|
session.close()
|
||||||
|
```
|
||||||
|
|
||||||
|
**Topic routing logic:**
|
||||||
|
```python
|
||||||
|
# In on_message callback, add routing:
|
||||||
|
if topic.endswith('/logs/error'):
|
||||||
|
handle_log_message(uuid, 'ERROR', payload)
|
||||||
|
elif topic.endswith('/logs/warn'):
|
||||||
|
handle_log_message(uuid, 'WARN', payload)
|
||||||
|
elif topic.endswith('/logs/info'):
|
||||||
|
handle_log_message(uuid, 'INFO', payload)
|
||||||
|
elif topic.endswith('/health'):
|
||||||
|
handle_health_message(uuid, payload)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 2.2: Create API Routes
|
||||||
|
**File:** `server/routes/client_logs.py` (NEW)
|
||||||
|
|
||||||
|
```python
|
||||||
|
from flask import Blueprint, jsonify, request
|
||||||
|
from server.database import Session
|
||||||
|
from server.permissions import admin_or_higher
|
||||||
|
from models.models import ClientLog, Client
|
||||||
|
from sqlalchemy import desc
|
||||||
|
import json
|
||||||
|
|
||||||
|
client_logs_bp = Blueprint("client_logs", __name__, url_prefix="/api/client-logs")
|
||||||
|
|
||||||
|
@client_logs_bp.route("/<uuid>/logs", methods=["GET"])
|
||||||
|
@admin_or_higher
|
||||||
|
def get_client_logs(uuid):
|
||||||
|
"""
|
||||||
|
Get logs for a specific client
|
||||||
|
Query params:
|
||||||
|
- level: ERROR, WARN, INFO, DEBUG (optional)
|
||||||
|
- limit: number of entries (default 50, max 500)
|
||||||
|
- since: ISO timestamp (optional)
|
||||||
|
"""
|
||||||
|
session = Session()
|
||||||
|
try:
|
||||||
|
level = request.args.get('level')
|
||||||
|
limit = min(int(request.args.get('limit', 50)), 500)
|
||||||
|
since = request.args.get('since')
|
||||||
|
|
||||||
|
query = session.query(ClientLog).filter_by(client_uuid=uuid)
|
||||||
|
|
||||||
|
if level:
|
||||||
|
from models.models import LogLevel
|
||||||
|
query = query.filter_by(level=LogLevel[level])
|
||||||
|
|
||||||
|
if since:
|
||||||
|
from datetime import datetime
|
||||||
|
since_dt = datetime.fromisoformat(since.replace('Z', '+00:00'))
|
||||||
|
query = query.filter(ClientLog.timestamp >= since_dt)
|
||||||
|
|
||||||
|
logs = query.order_by(desc(ClientLog.timestamp)).limit(limit).all()
|
||||||
|
|
||||||
|
result = []
|
||||||
|
for log in logs:
|
||||||
|
result.append({
|
||||||
|
"id": log.id,
|
||||||
|
"timestamp": log.timestamp.isoformat() if log.timestamp else None,
|
||||||
|
"level": log.level.value if log.level else None,
|
||||||
|
"message": log.message,
|
||||||
|
"context": json.loads(log.context) if log.context else {}
|
||||||
|
})
|
||||||
|
|
||||||
|
session.close()
|
||||||
|
return jsonify({"logs": result, "count": len(result)})
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
session.close()
|
||||||
|
return jsonify({"error": str(e)}), 500
|
||||||
|
|
||||||
|
@client_logs_bp.route("/summary", methods=["GET"])
|
||||||
|
@admin_or_higher
|
||||||
|
def get_logs_summary():
|
||||||
|
"""Get summary of errors/warnings across all clients"""
|
||||||
|
session = Session()
|
||||||
|
try:
|
||||||
|
from sqlalchemy import func
|
||||||
|
from models.models import LogLevel
|
||||||
|
from datetime import datetime, timedelta
|
||||||
|
|
||||||
|
# Last 24 hours
|
||||||
|
since = datetime.utcnow() - timedelta(hours=24)
|
||||||
|
|
||||||
|
stats = session.query(
|
||||||
|
ClientLog.client_uuid,
|
||||||
|
ClientLog.level,
|
||||||
|
func.count(ClientLog.id).label('count')
|
||||||
|
).filter(
|
||||||
|
ClientLog.timestamp >= since
|
||||||
|
).group_by(
|
||||||
|
ClientLog.client_uuid,
|
||||||
|
ClientLog.level
|
||||||
|
).all()
|
||||||
|
|
||||||
|
result = {}
|
||||||
|
for stat in stats:
|
||||||
|
uuid = stat.client_uuid
|
||||||
|
if uuid not in result:
|
||||||
|
result[uuid] = {"ERROR": 0, "WARN": 0, "INFO": 0}
|
||||||
|
result[uuid][stat.level.value] = stat.count
|
||||||
|
|
||||||
|
session.close()
|
||||||
|
return jsonify({"summary": result, "period_hours": 24})
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
session.close()
|
||||||
|
return jsonify({"error": str(e)}), 500
|
||||||
|
```
|
||||||
|
|
||||||
|
**Register in `server/wsgi.py`:**
|
||||||
|
```python
|
||||||
|
from server.routes.client_logs import client_logs_bp
|
||||||
|
app.register_blueprint(client_logs_bp)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 2.3: Add Health Data to Heartbeat Handler
|
||||||
|
**File:** `listener/listener.py` (extend existing heartbeat handler)
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Modify existing heartbeat handler to capture health data
|
||||||
|
def on_message(client, userdata, message):
|
||||||
|
topic = message.topic
|
||||||
|
|
||||||
|
# Existing heartbeat logic...
|
||||||
|
if '/heartbeat' in topic:
|
||||||
|
uuid = extract_uuid_from_topic(topic)
|
||||||
|
try:
|
||||||
|
payload = json.loads(message.payload.decode())
|
||||||
|
|
||||||
|
# Update last_alive (existing)
|
||||||
|
session = Session()
|
||||||
|
client_obj = session.query(Client).filter_by(uuid=uuid).first()
|
||||||
|
if client_obj:
|
||||||
|
client_obj.last_alive = datetime.now(timezone.utc)
|
||||||
|
|
||||||
|
# NEW: Update health data if present in heartbeat
|
||||||
|
if 'process_status' in payload:
|
||||||
|
client_obj.process_status = ProcessStatus[payload['process_status']]
|
||||||
|
if 'current_process' in payload:
|
||||||
|
client_obj.current_process = payload['current_process']
|
||||||
|
if 'process_pid' in payload:
|
||||||
|
client_obj.process_pid = payload['process_pid']
|
||||||
|
if 'current_event_id' in payload:
|
||||||
|
client_obj.current_event_id = payload['current_event_id']
|
||||||
|
|
||||||
|
session.commit()
|
||||||
|
session.close()
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error processing heartbeat: {e}")
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🖥️ Phase 3: Client-Side Implementation
|
||||||
|
**Status:** ⏳ PENDING (After Phase 2)
|
||||||
|
**Dependencies:** Phase 2 complete
|
||||||
|
**Time estimate:** 3-4 hours
|
||||||
|
|
||||||
|
### Step 3.1: Create Client Watchdog Script
|
||||||
|
**File:** `client/watchdog.py` (NEW - on client device)
|
||||||
|
|
||||||
|
```python
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Client-side process watchdog
|
||||||
|
Monitors VLC, Chromium, PDF viewer and reports health to server
|
||||||
|
"""
|
||||||
|
import psutil
|
||||||
|
import paho.mqtt.client as mqtt
|
||||||
|
import json
|
||||||
|
import time
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
import sys
|
||||||
|
import os
|
||||||
|
|
||||||
|
class MediaWatchdog:
|
||||||
|
def __init__(self, client_uuid, mqtt_broker, mqtt_port=1883):
|
||||||
|
self.uuid = client_uuid
|
||||||
|
self.mqtt_client = mqtt.Client()
|
||||||
|
self.mqtt_client.connect(mqtt_broker, mqtt_port, 60)
|
||||||
|
self.mqtt_client.loop_start()
|
||||||
|
|
||||||
|
self.current_process = None
|
||||||
|
self.current_event_id = None
|
||||||
|
self.restart_attempts = 0
|
||||||
|
self.MAX_RESTARTS = 3
|
||||||
|
|
||||||
|
def send_log(self, level, message, context=None):
|
||||||
|
"""Send log message to server via MQTT"""
|
||||||
|
topic = f"infoscreen/{self.uuid}/logs/{level.lower()}"
|
||||||
|
payload = {
|
||||||
|
"timestamp": datetime.now(timezone.utc).isoformat(),
|
||||||
|
"message": message,
|
||||||
|
"context": context or {}
|
||||||
|
}
|
||||||
|
self.mqtt_client.publish(topic, json.dumps(payload), qos=1)
|
||||||
|
print(f"[{level}] {message}")
|
||||||
|
|
||||||
|
def send_health(self, process_name, pid, status, event_id=None):
|
||||||
|
"""Send health status to server"""
|
||||||
|
topic = f"infoscreen/{self.uuid}/health"
|
||||||
|
payload = {
|
||||||
|
"timestamp": datetime.now(timezone.utc).isoformat(),
|
||||||
|
"expected_state": {
|
||||||
|
"event_id": event_id
|
||||||
|
},
|
||||||
|
"actual_state": {
|
||||||
|
"process": process_name,
|
||||||
|
"pid": pid,
|
||||||
|
"status": status # 'running', 'crashed', 'starting', 'stopped'
|
||||||
|
}
|
||||||
|
}
|
||||||
|
self.mqtt_client.publish(topic, json.dumps(payload), qos=1, retain=False)
|
||||||
|
|
||||||
|
def is_process_running(self, process_name):
|
||||||
|
"""Check if a process is running"""
|
||||||
|
for proc in psutil.process_iter(['name', 'pid']):
|
||||||
|
try:
|
||||||
|
if process_name.lower() in proc.info['name'].lower():
|
||||||
|
return proc.info['pid']
|
||||||
|
except (psutil.NoSuchProcess, psutil.AccessDenied):
|
||||||
|
pass
|
||||||
|
return None
|
||||||
|
|
||||||
|
def monitor_loop(self):
|
||||||
|
"""Main monitoring loop"""
|
||||||
|
print(f"Watchdog started for client {self.uuid}")
|
||||||
|
self.send_log("INFO", "Watchdog service started", {"uuid": self.uuid})
|
||||||
|
|
||||||
|
while True:
|
||||||
|
try:
|
||||||
|
# Check expected process (would be set by main event handler)
|
||||||
|
if self.current_process:
|
||||||
|
pid = self.is_process_running(self.current_process)
|
||||||
|
|
||||||
|
if pid:
|
||||||
|
# Process is running
|
||||||
|
self.send_health(
|
||||||
|
self.current_process,
|
||||||
|
pid,
|
||||||
|
"running",
|
||||||
|
self.current_event_id
|
||||||
|
)
|
||||||
|
self.restart_attempts = 0 # Reset on success
|
||||||
|
else:
|
||||||
|
# Process crashed
|
||||||
|
self.send_log(
|
||||||
|
"ERROR",
|
||||||
|
f"Process {self.current_process} crashed or stopped",
|
||||||
|
{
|
||||||
|
"event_id": self.current_event_id,
|
||||||
|
"process": self.current_process,
|
||||||
|
"restart_attempt": self.restart_attempts
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
if self.restart_attempts < self.MAX_RESTARTS:
|
||||||
|
self.send_log("WARN", f"Attempting restart ({self.restart_attempts + 1}/{self.MAX_RESTARTS})")
|
||||||
|
self.restart_attempts += 1
|
||||||
|
# TODO: Implement restart logic (call event handler)
|
||||||
|
else:
|
||||||
|
self.send_log("ERROR", "Max restart attempts exceeded", {
|
||||||
|
"event_id": self.current_event_id
|
||||||
|
})
|
||||||
|
|
||||||
|
time.sleep(5) # Check every 5 seconds
|
||||||
|
|
||||||
|
except KeyboardInterrupt:
|
||||||
|
print("Watchdog stopped by user")
|
||||||
|
break
|
||||||
|
except Exception as e:
|
||||||
|
self.send_log("ERROR", f"Watchdog error: {str(e)}", {
|
||||||
|
"exception": str(e),
|
||||||
|
"traceback": str(sys.exc_info())
|
||||||
|
})
|
||||||
|
time.sleep(10) # Wait longer on error
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
import sys
|
||||||
|
if len(sys.argv) < 3:
|
||||||
|
print("Usage: python watchdog.py <client_uuid> <mqtt_broker>")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
uuid = sys.argv[1]
|
||||||
|
broker = sys.argv[2]
|
||||||
|
|
||||||
|
watchdog = MediaWatchdog(uuid, broker)
|
||||||
|
watchdog.monitor_loop()
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 3.2: Integrate with Existing Event Handler
|
||||||
|
**File:** `client/event_handler.py` (modify existing)
|
||||||
|
|
||||||
|
```python
|
||||||
|
# When starting a new event, notify watchdog
|
||||||
|
def play_event(event_data):
|
||||||
|
event_type = event_data.get('event_type')
|
||||||
|
event_id = event_data.get('id')
|
||||||
|
|
||||||
|
if event_type == 'video':
|
||||||
|
process_name = 'vlc'
|
||||||
|
# Start VLC...
|
||||||
|
elif event_type == 'website':
|
||||||
|
process_name = 'chromium'
|
||||||
|
# Start Chromium...
|
||||||
|
elif event_type == 'presentation':
|
||||||
|
process_name = 'pdf_viewer' # or your PDF tool
|
||||||
|
# Start PDF viewer...
|
||||||
|
|
||||||
|
# Notify watchdog about expected process
|
||||||
|
watchdog.current_process = process_name
|
||||||
|
watchdog.current_event_id = event_id
|
||||||
|
watchdog.restart_attempts = 0
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 3.3: Enhanced Heartbeat Payload
|
||||||
|
**File:** `client/heartbeat.py` (modify existing)
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Modify existing heartbeat to include process status
|
||||||
|
def send_heartbeat(mqtt_client, uuid):
|
||||||
|
# Get current process status
|
||||||
|
current_process = None
|
||||||
|
process_pid = None
|
||||||
|
process_status = "stopped"
|
||||||
|
|
||||||
|
# Check if expected process is running
|
||||||
|
if watchdog.current_process:
|
||||||
|
pid = watchdog.is_process_running(watchdog.current_process)
|
||||||
|
if pid:
|
||||||
|
current_process = watchdog.current_process
|
||||||
|
process_pid = pid
|
||||||
|
process_status = "running"
|
||||||
|
|
||||||
|
payload = {
|
||||||
|
"uuid": uuid,
|
||||||
|
"timestamp": datetime.now(timezone.utc).isoformat(),
|
||||||
|
# Existing fields...
|
||||||
|
# NEW health fields:
|
||||||
|
"current_process": current_process,
|
||||||
|
"process_pid": process_pid,
|
||||||
|
"process_status": process_status,
|
||||||
|
"current_event_id": watchdog.current_event_id
|
||||||
|
}
|
||||||
|
|
||||||
|
mqtt_client.publish(f"infoscreen/{uuid}/heartbeat", json.dumps(payload))
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🎨 Phase 4: Dashboard UI Integration
|
||||||
|
**Status:** ⏳ PENDING (After Phase 3)
|
||||||
|
**Dependencies:** Phases 2 & 3 complete
|
||||||
|
**Time estimate:** 2-3 hours
|
||||||
|
|
||||||
|
### Step 4.1: Create Log Viewer Component
|
||||||
|
**File:** `dashboard/src/ClientLogs.tsx` (NEW)
|
||||||
|
|
||||||
|
```typescript
|
||||||
|
import React from 'react';
|
||||||
|
import { GridComponent, ColumnsDirective, ColumnDirective, Page, Inject } from '@syncfusion/ej2-react-grids';
|
||||||
|
|
||||||
|
interface LogEntry {
|
||||||
|
id: number;
|
||||||
|
timestamp: string;
|
||||||
|
level: 'ERROR' | 'WARN' | 'INFO' | 'DEBUG';
|
||||||
|
message: string;
|
||||||
|
context: any;
|
||||||
|
}
|
||||||
|
|
||||||
|
interface ClientLogsProps {
|
||||||
|
clientUuid: string;
|
||||||
|
}
|
||||||
|
|
||||||
|
export const ClientLogs: React.FC<ClientLogsProps> = ({ clientUuid }) => {
|
||||||
|
const [logs, setLogs] = React.useState<LogEntry[]>([]);
|
||||||
|
const [loading, setLoading] = React.useState(false);
|
||||||
|
|
||||||
|
const loadLogs = async (level?: string) => {
|
||||||
|
setLoading(true);
|
||||||
|
try {
|
||||||
|
const params = new URLSearchParams({ limit: '50' });
|
||||||
|
if (level) params.append('level', level);
|
||||||
|
|
||||||
|
const response = await fetch(`/api/client-logs/${clientUuid}/logs?${params}`);
|
||||||
|
const data = await response.json();
|
||||||
|
setLogs(data.logs);
|
||||||
|
} catch (err) {
|
||||||
|
console.error('Failed to load logs:', err);
|
||||||
|
} finally {
|
||||||
|
setLoading(false);
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
React.useEffect(() => {
|
||||||
|
loadLogs();
|
||||||
|
const interval = setInterval(() => loadLogs(), 30000); // Refresh every 30s
|
||||||
|
return () => clearInterval(interval);
|
||||||
|
}, [clientUuid]);
|
||||||
|
|
||||||
|
const levelTemplate = (props: any) => {
|
||||||
|
const colors = {
|
||||||
|
ERROR: 'text-red-600 bg-red-100',
|
||||||
|
WARN: 'text-yellow-600 bg-yellow-100',
|
||||||
|
INFO: 'text-blue-600 bg-blue-100',
|
||||||
|
DEBUG: 'text-gray-600 bg-gray-100'
|
||||||
|
};
|
||||||
|
return (
|
||||||
|
<span className={`px-2 py-1 rounded ${colors[props.level as keyof typeof colors]}`}>
|
||||||
|
{props.level}
|
||||||
|
</span>
|
||||||
|
);
|
||||||
|
};
|
||||||
|
|
||||||
|
return (
|
||||||
|
<div>
|
||||||
|
<div className="mb-4 flex gap-2">
|
||||||
|
<button onClick={() => loadLogs()} className="e-btn e-primary">All</button>
|
||||||
|
<button onClick={() => loadLogs('ERROR')} className="e-btn e-danger">Errors</button>
|
||||||
|
<button onClick={() => loadLogs('WARN')} className="e-btn e-warning">Warnings</button>
|
||||||
|
<button onClick={() => loadLogs('INFO')} className="e-btn e-info">Info</button>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<GridComponent
|
||||||
|
dataSource={logs}
|
||||||
|
allowPaging={true}
|
||||||
|
pageSettings={{ pageSize: 20 }}
|
||||||
|
>
|
||||||
|
<ColumnsDirective>
|
||||||
|
<ColumnDirective field='timestamp' headerText='Time' width='180' format='yMd HH:mm:ss' />
|
||||||
|
<ColumnDirective field='level' headerText='Level' width='100' template={levelTemplate} />
|
||||||
|
<ColumnDirective field='message' headerText='Message' width='400' />
|
||||||
|
</ColumnsDirective>
|
||||||
|
<Inject services={[Page]} />
|
||||||
|
</GridComponent>
|
||||||
|
</div>
|
||||||
|
);
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 4.2: Add Health Indicators to Client Cards
|
||||||
|
**File:** `dashboard/src/clients.tsx` (modify existing)
|
||||||
|
|
||||||
|
```typescript
|
||||||
|
// Add health indicator to client card
|
||||||
|
const getHealthBadge = (client: Client) => {
|
||||||
|
if (!client.process_status) {
|
||||||
|
return <span className="badge badge-secondary">Unknown</span>;
|
||||||
|
}
|
||||||
|
|
||||||
|
const badges = {
|
||||||
|
running: <span className="badge badge-success">✓ Running</span>,
|
||||||
|
crashed: <span className="badge badge-danger">✗ Crashed</span>,
|
||||||
|
starting: <span className="badge badge-warning">⟳ Starting</span>,
|
||||||
|
stopped: <span className="badge badge-secondary">■ Stopped</span>
|
||||||
|
};
|
||||||
|
|
||||||
|
return badges[client.process_status] || null;
|
||||||
|
};
|
||||||
|
|
||||||
|
// In client card render:
|
||||||
|
<div className="client-card">
|
||||||
|
<h3>{client.hostname || client.uuid}</h3>
|
||||||
|
<div>Status: {getHealthBadge(client)}</div>
|
||||||
|
<div>Process: {client.current_process || 'None'}</div>
|
||||||
|
<div>Event ID: {client.current_event_id || 'None'}</div>
|
||||||
|
<button onClick={() => showLogs(client.uuid)}>View Logs</button>
|
||||||
|
</div>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 4.3: Add System Health Dashboard (Superadmin)
|
||||||
|
**File:** `dashboard/src/SystemMonitor.tsx` (NEW)
|
||||||
|
|
||||||
|
```typescript
|
||||||
|
import React from 'react';
|
||||||
|
import { ClientLogs } from './ClientLogs';
|
||||||
|
|
||||||
|
export const SystemMonitor: React.FC = () => {
|
||||||
|
const [summary, setSummary] = React.useState<any>({});
|
||||||
|
|
||||||
|
const loadSummary = async () => {
|
||||||
|
const response = await fetch('/api/client-logs/summary');
|
||||||
|
const data = await response.json();
|
||||||
|
setSummary(data.summary);
|
||||||
|
};
|
||||||
|
|
||||||
|
React.useEffect(() => {
|
||||||
|
loadSummary();
|
||||||
|
const interval = setInterval(loadSummary, 30000);
|
||||||
|
return () => clearInterval(interval);
|
||||||
|
}, []);
|
||||||
|
|
||||||
|
return (
|
||||||
|
<div className="system-monitor">
|
||||||
|
<h2>System Health Monitor (Superadmin)</h2>
|
||||||
|
|
||||||
|
<div className="alert-panel">
|
||||||
|
<h3>Active Issues</h3>
|
||||||
|
{Object.entries(summary).map(([uuid, stats]: [string, any]) => (
|
||||||
|
stats.ERROR > 0 || stats.WARN > 5 ? (
|
||||||
|
<div key={uuid} className="alert">
|
||||||
|
🔴 {uuid}: {stats.ERROR} errors, {stats.WARN} warnings (24h)
|
||||||
|
</div>
|
||||||
|
) : null
|
||||||
|
))}
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{/* Real-time log stream */}
|
||||||
|
<div className="log-stream">
|
||||||
|
<h3>Recent Logs (All Clients)</h3>
|
||||||
|
{/* Implement real-time log aggregation */}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
);
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🧪 Phase 5: Testing & Validation
|
||||||
|
**Status:** ⏳ PENDING
|
||||||
|
**Dependencies:** All previous phases
|
||||||
|
**Time estimate:** 1-2 hours
|
||||||
|
|
||||||
|
### Step 5.1: Server-Side Tests
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Test database migration
|
||||||
|
cd /workspace/server
|
||||||
|
alembic upgrade head
|
||||||
|
alembic downgrade -1
|
||||||
|
alembic upgrade head
|
||||||
|
|
||||||
|
# Test API endpoints
|
||||||
|
curl -X GET "http://localhost:8000/api/client-logs/<uuid>/logs?limit=10"
|
||||||
|
curl -X GET "http://localhost:8000/api/client-logs/summary"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 5.2: Client-Side Tests
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# On client device
|
||||||
|
python3 watchdog.py <your-uuid> <mqtt-broker-ip>
|
||||||
|
|
||||||
|
# Simulate process crash
|
||||||
|
pkill vlc # Should trigger error log and restart attempt
|
||||||
|
|
||||||
|
# Check MQTT messages
|
||||||
|
mosquitto_sub -h <broker> -t "infoscreen/+/logs/#" -v
|
||||||
|
mosquitto_sub -h <broker> -t "infoscreen/+/health" -v
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 5.3: Dashboard Tests
|
||||||
|
|
||||||
|
1. Open dashboard and navigate to Clients page
|
||||||
|
2. Verify health indicators show correct status
|
||||||
|
3. Click "View Logs" and verify logs appear
|
||||||
|
4. Navigate to System Monitor (superadmin)
|
||||||
|
5. Verify summary statistics are correct
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📝 Configuration Summary
|
||||||
|
|
||||||
|
### Environment Variables
|
||||||
|
|
||||||
|
**Server (docker-compose.yml):**
|
||||||
|
```yaml
|
||||||
|
- LOG_RETENTION_DAYS=90 # How long to keep logs
|
||||||
|
- DEBUG_MODE=true # Enable INFO level logging via MQTT
|
||||||
|
```
|
||||||
|
|
||||||
|
**Client:**
|
||||||
|
```bash
|
||||||
|
export MQTT_BROKER="your-server-ip"
|
||||||
|
export CLIENT_UUID="abc-123-def"
|
||||||
|
export WATCHDOG_ENABLED=true
|
||||||
|
```
|
||||||
|
|
||||||
|
### MQTT Topics Reference
|
||||||
|
|
||||||
|
| Topic Pattern | Direction | Purpose |
|
||||||
|
|--------------|-----------|---------|
|
||||||
|
| `infoscreen/{uuid}/logs/error` | Client → Server | Error messages |
|
||||||
|
| `infoscreen/{uuid}/logs/warn` | Client → Server | Warning messages |
|
||||||
|
| `infoscreen/{uuid}/logs/info` | Client → Server | Info (dev only) |
|
||||||
|
| `infoscreen/{uuid}/health` | Client → Server | Health metrics |
|
||||||
|
| `infoscreen/{uuid}/heartbeat` | Client → Server | Enhanced heartbeat |
|
||||||
|
|
||||||
|
### Database Tables
|
||||||
|
|
||||||
|
**client_logs:**
|
||||||
|
- Stores all centralized logs
|
||||||
|
- Indexed by client_uuid, timestamp, level
|
||||||
|
- Auto-cleanup after 90 days (recommended)
|
||||||
|
|
||||||
|
**clients (extended):**
|
||||||
|
- `current_event_id`: Which event should be playing
|
||||||
|
- `current_process`: Expected process name
|
||||||
|
- `process_status`: running/crashed/starting/stopped
|
||||||
|
- `process_pid`: Process ID
|
||||||
|
- `screen_health_status`: OK/BLACK/FROZEN/UNKNOWN
|
||||||
|
- `last_screenshot_analyzed`: Last analysis time
|
||||||
|
- `last_screenshot_hash`: For frozen detection
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🎯 Next Steps After Implementation
|
||||||
|
|
||||||
|
1. **Deploy Phase 1-2** to staging environment
|
||||||
|
2. **Test with 1-2 pilot clients** before full rollout
|
||||||
|
3. **Monitor traffic & performance** (should be minimal)
|
||||||
|
4. **Fine-tune log levels** based on actual noise
|
||||||
|
5. **Add alerting** (email/Slack when errors > threshold)
|
||||||
|
6. **Implement screenshot analysis** (Phase 2 enhancement)
|
||||||
|
7. **Add trending/analytics** (which clients are least reliable)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🚨 Troubleshooting
|
||||||
|
|
||||||
|
**Logs not appearing in database:**
|
||||||
|
- Check MQTT broker logs: `docker logs infoscreen-mqtt`
|
||||||
|
- Verify listener subscriptions: Check `listener/listener.py` logs
|
||||||
|
- Test MQTT manually: `mosquitto_pub -h broker -t "infoscreen/test/logs/error" -m '{"message":"test"}'`
|
||||||
|
|
||||||
|
**High database growth:**
|
||||||
|
- Check log_retention cleanup cronjob
|
||||||
|
- Reduce INFO level logging frequency
|
||||||
|
- Add sampling (log every 10th occurrence instead of all)
|
||||||
|
|
||||||
|
**Client watchdog not detecting crashes:**
|
||||||
|
- Verify psutil can see processes: `ps aux | grep vlc`
|
||||||
|
- Check permissions (may need sudo for some process checks)
|
||||||
|
- Increase monitor loop frequency for faster detection
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ✅ Completion Checklist
|
||||||
|
|
||||||
|
- [ ] Phase 1: Database migration applied
|
||||||
|
- [ ] Phase 2: Listener extended for log topics
|
||||||
|
- [ ] Phase 2: API endpoints created and tested
|
||||||
|
- [ ] Phase 3: Client watchdog implemented
|
||||||
|
- [ ] Phase 3: Enhanced heartbeat deployed
|
||||||
|
- [ ] Phase 4: Dashboard log viewer working
|
||||||
|
- [ ] Phase 4: Health indicators visible
|
||||||
|
- [ ] Phase 5: End-to-end testing complete
|
||||||
|
- [ ] Documentation updated with new features
|
||||||
|
- [ ] Production deployment plan created
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Last Updated:** 2026-03-09
|
||||||
|
**Author:** GitHub Copilot
|
||||||
|
**For:** Infoscreen 2025 Project
|
||||||
972
CLIENT_MONITORING_SPECIFICATION.md
Normal file
972
CLIENT_MONITORING_SPECIFICATION.md
Normal file
@@ -0,0 +1,972 @@
|
|||||||
|
# Client-Side Monitoring Specification
|
||||||
|
|
||||||
|
**Version:** 1.0
|
||||||
|
**Date:** 2026-03-10
|
||||||
|
**For:** Infoscreen Client Implementation
|
||||||
|
**Server Endpoint:** `192.168.43.201:8000` (or your production server)
|
||||||
|
**MQTT Broker:** `192.168.43.201:1883` (or your production MQTT broker)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Overview
|
||||||
|
|
||||||
|
Each infoscreen client must implement health monitoring and logging capabilities to report status to the central server via MQTT.
|
||||||
|
|
||||||
|
### 1.1 Goals
|
||||||
|
- **Detect failures:** Process crashes, frozen screens, content mismatches
|
||||||
|
- **Provide visibility:** Real-time health status visible on server dashboard
|
||||||
|
- **Enable remote diagnosis:** Centralized log storage for debugging
|
||||||
|
- **Auto-recovery:** Attempt automatic restart on failure
|
||||||
|
|
||||||
|
### 1.2 Architecture
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────┐
|
||||||
|
│ Infoscreen Client │
|
||||||
|
│ │
|
||||||
|
│ ┌──────────────┐ ┌──────────────┐ │
|
||||||
|
│ │ Media Player │ │ Watchdog │ │
|
||||||
|
│ │ (VLC/Chrome) │◄───│ Monitor │ │
|
||||||
|
│ └──────────────┘ └──────┬───────┘ │
|
||||||
|
│ │ │
|
||||||
|
│ ┌──────────────┐ │ │
|
||||||
|
│ │ Event Mgr │ │ │
|
||||||
|
│ │ (receives │ │ │
|
||||||
|
│ │ schedule) │◄───────────┘ │
|
||||||
|
│ └──────┬───────┘ │
|
||||||
|
│ │ │
|
||||||
|
│ ┌──────▼───────────────────────┐ │
|
||||||
|
│ │ MQTT Client │ │
|
||||||
|
│ │ - Heartbeat (every 60s) │ │
|
||||||
|
│ │ - Logs (error/warn/info) │ │
|
||||||
|
│ │ - Health metrics (every 5s) │ │
|
||||||
|
│ └──────┬────────────────────────┘ │
|
||||||
|
└─────────┼──────────────────────────────┘
|
||||||
|
│
|
||||||
|
│ MQTT over TCP
|
||||||
|
▼
|
||||||
|
┌─────────────┐
|
||||||
|
│ MQTT Broker │
|
||||||
|
│ (server) │
|
||||||
|
└─────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. MQTT Protocol Specification
|
||||||
|
|
||||||
|
### 2.1 Connection Parameters
|
||||||
|
```
|
||||||
|
Broker: 192.168.43.201 (or DNS hostname)
|
||||||
|
Port: 1883 (standard MQTT)
|
||||||
|
Protocol: MQTT v3.1.1
|
||||||
|
Client ID: "infoscreen-{client_uuid}"
|
||||||
|
Clean Session: false (retain subscriptions)
|
||||||
|
Keep Alive: 60 seconds
|
||||||
|
Username/Password: (if configured on broker)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2.2 QoS Levels
|
||||||
|
- **Heartbeat:** QoS 0 (fire and forget, high frequency)
|
||||||
|
- **Logs (ERROR/WARN):** QoS 1 (at least once delivery, important)
|
||||||
|
- **Logs (INFO):** QoS 0 (optional, high volume)
|
||||||
|
- **Health metrics:** QoS 0 (frequent, latest value matters)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Topic Structure & Payload Formats
|
||||||
|
|
||||||
|
### 3.1 Log Messages
|
||||||
|
|
||||||
|
#### Topic Pattern:
|
||||||
|
```
|
||||||
|
infoscreen/{client_uuid}/logs/{level}
|
||||||
|
```
|
||||||
|
|
||||||
|
Where `{level}` is one of: `error`, `warn`, `info`
|
||||||
|
|
||||||
|
#### Payload Format (JSON):
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"timestamp": "2026-03-10T07:30:00Z",
|
||||||
|
"message": "Human-readable error description",
|
||||||
|
"context": {
|
||||||
|
"event_id": 42,
|
||||||
|
"process": "vlc",
|
||||||
|
"error_code": "NETWORK_TIMEOUT",
|
||||||
|
"additional_key": "any relevant data"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Field Specifications:
|
||||||
|
| Field | Type | Required | Description |
|
||||||
|
|-------|------|----------|-------------|
|
||||||
|
| `timestamp` | string (ISO 8601 UTC) | Yes | When the event occurred. Use `YYYY-MM-DDTHH:MM:SSZ` format |
|
||||||
|
| `message` | string | Yes | Human-readable description of the event (max 1000 chars) |
|
||||||
|
| `context` | object | No | Additional structured data (will be stored as JSON) |
|
||||||
|
|
||||||
|
#### Example Topics:
|
||||||
|
```
|
||||||
|
infoscreen/9b8d1856-ff34-4864-a726-12de072d0f77/logs/error
|
||||||
|
infoscreen/9b8d1856-ff34-4864-a726-12de072d0f77/logs/warn
|
||||||
|
infoscreen/9b8d1856-ff34-4864-a726-12de072d0f77/logs/info
|
||||||
|
```
|
||||||
|
|
||||||
|
#### When to Send Logs:
|
||||||
|
|
||||||
|
**ERROR (Always send):**
|
||||||
|
- Process crashed (VLC/Chromium/PDF viewer terminated unexpectedly)
|
||||||
|
- Content failed to load (404, network timeout, corrupt file)
|
||||||
|
- Hardware failure detected (display off, audio device missing)
|
||||||
|
- Exception caught in main event loop
|
||||||
|
- Maximum restart attempts exceeded
|
||||||
|
|
||||||
|
**WARN (Always send):**
|
||||||
|
- Process restarted automatically (after crash)
|
||||||
|
- High resource usage (CPU >80%, RAM >90%)
|
||||||
|
- Slow performance (frame drops, lag)
|
||||||
|
- Non-critical failures (screenshot capture failed, cache full)
|
||||||
|
- Fallback content displayed (primary source unavailable)
|
||||||
|
|
||||||
|
**INFO (Send in development, optional in production):**
|
||||||
|
- Process started successfully
|
||||||
|
- Event transition (switched from video to presentation)
|
||||||
|
- Content loaded successfully
|
||||||
|
- Watchdog service started/stopped
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3.2 Health Metrics
|
||||||
|
|
||||||
|
#### Topic Pattern:
|
||||||
|
```
|
||||||
|
infoscreen/{client_uuid}/health
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Payload Format (JSON):
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"timestamp": "2026-03-10T07:30:00Z",
|
||||||
|
"expected_state": {
|
||||||
|
"event_id": 42,
|
||||||
|
"event_type": "video",
|
||||||
|
"media_file": "presentation.mp4",
|
||||||
|
"started_at": "2026-03-10T07:15:00Z"
|
||||||
|
},
|
||||||
|
"actual_state": {
|
||||||
|
"process": "vlc",
|
||||||
|
"pid": 1234,
|
||||||
|
"status": "running",
|
||||||
|
"uptime_seconds": 900,
|
||||||
|
"position": 45.3,
|
||||||
|
"duration": 180.0
|
||||||
|
},
|
||||||
|
"health_metrics": {
|
||||||
|
"screen_on": true,
|
||||||
|
"last_frame_update": "2026-03-10T07:29:58Z",
|
||||||
|
"frames_dropped": 2,
|
||||||
|
"network_errors": 0,
|
||||||
|
"cpu_percent": 15.3,
|
||||||
|
"memory_mb": 234
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Field Specifications:
|
||||||
|
|
||||||
|
**expected_state:**
|
||||||
|
| Field | Type | Required | Description |
|
||||||
|
|-------|------|----------|-------------|
|
||||||
|
| `event_id` | integer | Yes | Current event ID from scheduler |
|
||||||
|
| `event_type` | string | Yes | `presentation`, `video`, `website`, `webuntis`, `message` |
|
||||||
|
| `media_file` | string | No | Filename or URL of current content |
|
||||||
|
| `started_at` | string (ISO 8601) | Yes | When this event started playing |
|
||||||
|
|
||||||
|
**actual_state:**
|
||||||
|
| Field | Type | Required | Description |
|
||||||
|
|-------|------|----------|-------------|
|
||||||
|
| `process` | string | Yes | `vlc`, `chromium`, `pdf_viewer`, `none` |
|
||||||
|
| `pid` | integer | No | Process ID (if running) |
|
||||||
|
| `status` | string | Yes | `running`, `crashed`, `starting`, `stopped` |
|
||||||
|
| `uptime_seconds` | integer | No | How long process has been running |
|
||||||
|
| `position` | float | No | Current playback position (seconds, for video/audio) |
|
||||||
|
| `duration` | float | No | Total content duration (seconds) |
|
||||||
|
|
||||||
|
**health_metrics:**
|
||||||
|
| Field | Type | Required | Description |
|
||||||
|
|-------|------|----------|-------------|
|
||||||
|
| `screen_on` | boolean | Yes | Is display powered on? |
|
||||||
|
| `last_frame_update` | string (ISO 8601) | No | Last time screen content changed |
|
||||||
|
| `frames_dropped` | integer | No | Video frames dropped (performance indicator) |
|
||||||
|
| `network_errors` | integer | No | Count of network errors in last interval |
|
||||||
|
| `cpu_percent` | float | No | CPU usage (0-100) |
|
||||||
|
| `memory_mb` | integer | No | RAM usage in megabytes |
|
||||||
|
|
||||||
|
#### Sending Frequency:
|
||||||
|
- **Normal operation:** Every 5 seconds
|
||||||
|
- **During startup/transition:** Every 1 second
|
||||||
|
- **After error:** Immediately + every 2 seconds until recovered
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3.3 Enhanced Heartbeat
|
||||||
|
|
||||||
|
The existing heartbeat topic should be enhanced to include process status.
|
||||||
|
|
||||||
|
#### Topic Pattern:
|
||||||
|
```
|
||||||
|
infoscreen/{client_uuid}/heartbeat
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Enhanced Payload Format (JSON):
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
|
||||||
|
"timestamp": "2026-03-10T07:30:00Z",
|
||||||
|
"current_process": "vlc",
|
||||||
|
"process_pid": 1234,
|
||||||
|
"process_status": "running",
|
||||||
|
"current_event_id": 42
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
#### New Fields (add to existing heartbeat):
|
||||||
|
| Field | Type | Required | Description |
|
||||||
|
|-------|------|----------|-------------|
|
||||||
|
| `current_process` | string | No | Name of active media player process |
|
||||||
|
| `process_pid` | integer | No | Process ID |
|
||||||
|
| `process_status` | string | No | `running`, `crashed`, `starting`, `stopped` |
|
||||||
|
| `current_event_id` | integer | No | Event ID currently being displayed |
|
||||||
|
|
||||||
|
#### Sending Frequency:
|
||||||
|
- Keep existing: **Every 60 seconds**
|
||||||
|
- Include new fields if available
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Process Monitoring Requirements
|
||||||
|
|
||||||
|
### 4.1 Processes to Monitor
|
||||||
|
|
||||||
|
| Media Type | Process Name | How to Detect |
|
||||||
|
|------------|--------------|---------------|
|
||||||
|
| Video | `vlc` | `ps aux \| grep vlc` or `pgrep vlc` |
|
||||||
|
| Website/WebUntis | `chromium` or `chromium-browser` | `pgrep chromium` |
|
||||||
|
| PDF Presentation | `evince`, `okular`, or custom viewer | `pgrep {viewer_name}` |
|
||||||
|
|
||||||
|
### 4.2 Monitoring Checks (Every 5 seconds)
|
||||||
|
|
||||||
|
#### Check 1: Process Alive
|
||||||
|
```
|
||||||
|
Goal: Verify expected process is running
|
||||||
|
Method:
|
||||||
|
- Get list of running processes (psutil or `ps`)
|
||||||
|
- Check if expected process name exists
|
||||||
|
- Match PID if known
|
||||||
|
Result:
|
||||||
|
- If missing → status = "crashed"
|
||||||
|
- If found → status = "running"
|
||||||
|
Action on crash:
|
||||||
|
- Send ERROR log immediately
|
||||||
|
- Attempt restart (max 3 attempts)
|
||||||
|
- Send WARN log on each restart
|
||||||
|
- If max restarts exceeded → send ERROR log, display fallback
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Check 2: Process Responsive
|
||||||
|
```
|
||||||
|
Goal: Detect frozen processes
|
||||||
|
Method:
|
||||||
|
- For VLC: Query HTTP interface (status.json)
|
||||||
|
- For Chromium: Use DevTools Protocol (CDP)
|
||||||
|
- For custom viewers: Check last screen update time
|
||||||
|
Result:
|
||||||
|
- If same frame >30 seconds → likely frozen
|
||||||
|
- If playback position not advancing → frozen
|
||||||
|
Action on freeze:
|
||||||
|
- Send WARN log
|
||||||
|
- Force refresh (reload page, seek video, next slide)
|
||||||
|
- If refresh fails → restart process
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Check 3: Content Match
|
||||||
|
```
|
||||||
|
Goal: Verify correct content is displayed
|
||||||
|
Method:
|
||||||
|
- Compare expected event_id with actual media/URL
|
||||||
|
- Check scheduled time window (is event still active?)
|
||||||
|
Result:
|
||||||
|
- Mismatch → content error
|
||||||
|
Action:
|
||||||
|
- Send WARN log
|
||||||
|
- Reload correct event from scheduler
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Process Control Interface Requirements
|
||||||
|
|
||||||
|
### 5.1 VLC Control
|
||||||
|
|
||||||
|
**Requirement:** Enable VLC HTTP interface for monitoring
|
||||||
|
|
||||||
|
**Launch Command:**
|
||||||
|
```bash
|
||||||
|
vlc --intf http --http-host 127.0.0.1 --http-port 8080 --http-password "vlc_password" \
|
||||||
|
--fullscreen --loop /path/to/video.mp4
|
||||||
|
```
|
||||||
|
|
||||||
|
**Status Query:**
|
||||||
|
```bash
|
||||||
|
curl http://127.0.0.1:8080/requests/status.json --user ":vlc_password"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Response Fields to Monitor:**
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"state": "playing", // "playing", "paused", "stopped"
|
||||||
|
"position": 0.25, // 0.0-1.0 (25% through)
|
||||||
|
"time": 45, // seconds into playback
|
||||||
|
"length": 180, // total duration in seconds
|
||||||
|
"volume": 256 // 0-512
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 5.2 Chromium Control
|
||||||
|
|
||||||
|
**Requirement:** Enable Chrome DevTools Protocol (CDP)
|
||||||
|
|
||||||
|
**Launch Command:**
|
||||||
|
```bash
|
||||||
|
chromium --remote-debugging-port=9222 --kiosk --app=https://example.com
|
||||||
|
```
|
||||||
|
|
||||||
|
**Status Query:**
|
||||||
|
```bash
|
||||||
|
curl http://127.0.0.1:9222/json
|
||||||
|
```
|
||||||
|
|
||||||
|
**Response Fields to Monitor:**
|
||||||
|
```json
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"url": "https://example.com",
|
||||||
|
"title": "Page Title",
|
||||||
|
"type": "page"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
**Advanced:** Use CDP WebSocket for events (page load, navigation, errors)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 5.3 PDF Viewer (Custom or Standard)
|
||||||
|
|
||||||
|
**Option A: Standard Viewer (e.g., Evince)**
|
||||||
|
- No built-in API
|
||||||
|
- Monitor via process check + screenshot comparison
|
||||||
|
|
||||||
|
**Option B: Custom Python Viewer**
|
||||||
|
- Implement REST API for status queries
|
||||||
|
- Track: current page, total pages, last transition time
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Watchdog Service Architecture
|
||||||
|
|
||||||
|
### 6.1 Service Components
|
||||||
|
|
||||||
|
**Component 1: Process Monitor Thread**
|
||||||
|
```
|
||||||
|
Responsibilities:
|
||||||
|
- Check process alive every 5 seconds
|
||||||
|
- Detect crashes and frozen processes
|
||||||
|
- Attempt automatic restart
|
||||||
|
- Send health metrics via MQTT
|
||||||
|
|
||||||
|
State Machine:
|
||||||
|
IDLE → STARTING → RUNNING → (if crash) → RESTARTING → RUNNING
|
||||||
|
→ (if max restarts) → FAILED
|
||||||
|
```
|
||||||
|
|
||||||
|
**Component 2: MQTT Publisher Thread**
|
||||||
|
```
|
||||||
|
Responsibilities:
|
||||||
|
- Maintain MQTT connection
|
||||||
|
- Send heartbeat every 60 seconds
|
||||||
|
- Send logs on-demand (queued from other components)
|
||||||
|
- Send health metrics every 5 seconds
|
||||||
|
- Reconnect on connection loss
|
||||||
|
```
|
||||||
|
|
||||||
|
**Component 3: Event Manager Integration**
|
||||||
|
```
|
||||||
|
Responsibilities:
|
||||||
|
- Receive event schedule from server
|
||||||
|
- Notify watchdog of expected process/content
|
||||||
|
- Launch media player processes
|
||||||
|
- Handle event transitions
|
||||||
|
```
|
||||||
|
|
||||||
|
### 6.2 Service Lifecycle
|
||||||
|
|
||||||
|
**On Startup:**
|
||||||
|
1. Load configuration (client UUID, MQTT broker, etc.)
|
||||||
|
2. Connect to MQTT broker
|
||||||
|
3. Send INFO log: "Watchdog service started"
|
||||||
|
4. Wait for first event from scheduler
|
||||||
|
|
||||||
|
**During Operation:**
|
||||||
|
1. Monitor loop runs every 5 seconds
|
||||||
|
2. Check expected vs actual process state
|
||||||
|
3. Send health metrics
|
||||||
|
4. Handle failures (log + restart)
|
||||||
|
|
||||||
|
**On Shutdown:**
|
||||||
|
1. Send INFO log: "Watchdog service stopping"
|
||||||
|
2. Gracefully stop monitored processes
|
||||||
|
3. Disconnect from MQTT
|
||||||
|
4. Exit cleanly
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Auto-Recovery Logic
|
||||||
|
|
||||||
|
### 7.1 Restart Strategy
|
||||||
|
|
||||||
|
**Step 1: Detect Failure**
|
||||||
|
```
|
||||||
|
Trigger: Process not found in process list
|
||||||
|
Action:
|
||||||
|
- Log ERROR: "Process {name} crashed"
|
||||||
|
- Increment restart counter
|
||||||
|
- Check if within retry limit (max 3)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Step 2: Attempt Restart**
|
||||||
|
```
|
||||||
|
If restart_attempts < MAX_RESTARTS:
|
||||||
|
- Log WARN: "Attempting restart ({attempt}/{MAX_RESTARTS})"
|
||||||
|
- Kill any zombie processes
|
||||||
|
- Wait 2 seconds (cooldown)
|
||||||
|
- Launch process with same parameters
|
||||||
|
- Wait 5 seconds for startup
|
||||||
|
- Verify process is running
|
||||||
|
- If success: reset restart counter, log INFO
|
||||||
|
- If fail: increment counter, repeat
|
||||||
|
```
|
||||||
|
|
||||||
|
**Step 3: Permanent Failure**
|
||||||
|
```
|
||||||
|
If restart_attempts >= MAX_RESTARTS:
|
||||||
|
- Log ERROR: "Max restart attempts exceeded, failing over"
|
||||||
|
- Display fallback content (static image with error message)
|
||||||
|
- Send notification to server (separate alert topic, optional)
|
||||||
|
- Wait for manual intervention or scheduler event change
|
||||||
|
```
|
||||||
|
|
||||||
|
### 7.2 Restart Cooldown
|
||||||
|
|
||||||
|
**Purpose:** Prevent rapid restart loops that waste resources
|
||||||
|
|
||||||
|
**Implementation:**
|
||||||
|
```
|
||||||
|
After each restart attempt:
|
||||||
|
- Wait 2 seconds before next restart
|
||||||
|
- After 3 failures: wait 30 seconds before trying again
|
||||||
|
- Reset counter on successful run >5 minutes
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Resource Monitoring
|
||||||
|
|
||||||
|
### 8.1 System Metrics to Track
|
||||||
|
|
||||||
|
**CPU Usage:**
|
||||||
|
```
|
||||||
|
Method: Read /proc/stat or use psutil.cpu_percent()
|
||||||
|
Frequency: Every 5 seconds
|
||||||
|
Threshold: Warn if >80% for >60 seconds
|
||||||
|
```
|
||||||
|
|
||||||
|
**Memory Usage:**
|
||||||
|
```
|
||||||
|
Method: Read /proc/meminfo or use psutil.virtual_memory()
|
||||||
|
Frequency: Every 5 seconds
|
||||||
|
Threshold: Warn if >90% for >30 seconds
|
||||||
|
```
|
||||||
|
|
||||||
|
**Display Status:**
|
||||||
|
```
|
||||||
|
Method: Check DPMS state or xset query
|
||||||
|
Frequency: Every 30 seconds
|
||||||
|
Threshold: Error if display off (unexpected)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Network Connectivity:**
|
||||||
|
```
|
||||||
|
Method: Ping server or check MQTT connection
|
||||||
|
Frequency: Every 60 seconds
|
||||||
|
Threshold: Warn if no server connectivity
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Development vs Production Mode
|
||||||
|
|
||||||
|
### 9.1 Development Mode
|
||||||
|
|
||||||
|
**Enable via:** Environment variable `DEBUG=true` or `ENV=development`
|
||||||
|
|
||||||
|
**Behavior:**
|
||||||
|
- Send INFO level logs
|
||||||
|
- More verbose logging to console
|
||||||
|
- Shorter monitoring intervals (faster feedback)
|
||||||
|
- Screenshot capture every 30 seconds
|
||||||
|
- No rate limiting on logs
|
||||||
|
|
||||||
|
### 9.2 Production Mode
|
||||||
|
|
||||||
|
**Enable via:** `ENV=production`
|
||||||
|
|
||||||
|
**Behavior:**
|
||||||
|
- Send only ERROR and WARN logs
|
||||||
|
- Minimal console output
|
||||||
|
- Standard monitoring intervals
|
||||||
|
- Screenshot capture every 60 seconds
|
||||||
|
- Rate limiting: max 10 logs per minute per level
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. Configuration File Format
|
||||||
|
|
||||||
|
### 10.1 Recommended Config: JSON
|
||||||
|
|
||||||
|
**File:** `/etc/infoscreen/config.json` or `~/.config/infoscreen/config.json`
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"client": {
|
||||||
|
"uuid": "9b8d1856-ff34-4864-a726-12de072d0f77",
|
||||||
|
"hostname": "infoscreen-room-101"
|
||||||
|
},
|
||||||
|
"mqtt": {
|
||||||
|
"broker": "192.168.43.201",
|
||||||
|
"port": 1883,
|
||||||
|
"username": "",
|
||||||
|
"password": "",
|
||||||
|
"keepalive": 60
|
||||||
|
},
|
||||||
|
"monitoring": {
|
||||||
|
"enabled": true,
|
||||||
|
"health_interval_seconds": 5,
|
||||||
|
"heartbeat_interval_seconds": 60,
|
||||||
|
"max_restart_attempts": 3,
|
||||||
|
"restart_cooldown_seconds": 2
|
||||||
|
},
|
||||||
|
"logging": {
|
||||||
|
"level": "INFO",
|
||||||
|
"send_info_logs": false,
|
||||||
|
"console_output": true,
|
||||||
|
"local_log_file": "/var/log/infoscreen/watchdog.log"
|
||||||
|
},
|
||||||
|
"processes": {
|
||||||
|
"vlc": {
|
||||||
|
"http_port": 8080,
|
||||||
|
"http_password": "vlc_password"
|
||||||
|
},
|
||||||
|
"chromium": {
|
||||||
|
"debug_port": 9222
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 11. Error Scenarios & Expected Behavior
|
||||||
|
|
||||||
|
### Scenario 1: VLC Crashes Mid-Video
|
||||||
|
```
|
||||||
|
1. Watchdog detects: process_status = "crashed"
|
||||||
|
2. Send ERROR log: "VLC process crashed"
|
||||||
|
3. Attempt 1: Restart VLC with same video, seek to last position
|
||||||
|
4. If success: Send INFO log "VLC restarted successfully"
|
||||||
|
5. If fail: Repeat 2 more times
|
||||||
|
6. After 3 failures: Send ERROR "Max restarts exceeded", show fallback
|
||||||
|
```
|
||||||
|
|
||||||
|
### Scenario 2: Network Timeout Loading Website
|
||||||
|
```
|
||||||
|
1. Chromium fails to load page (CDP reports error)
|
||||||
|
2. Send WARN log: "Page load timeout"
|
||||||
|
3. Attempt reload (Chromium refresh)
|
||||||
|
4. If success after 10s: Continue monitoring
|
||||||
|
5. If timeout again: Send ERROR, try restarting Chromium
|
||||||
|
```
|
||||||
|
|
||||||
|
### Scenario 3: Display Powers Off (Hardware)
|
||||||
|
```
|
||||||
|
1. DPMS check detects display off
|
||||||
|
2. Send ERROR log: "Display powered off"
|
||||||
|
3. Attempt to wake display (xset dpms force on)
|
||||||
|
4. If success: Send INFO log
|
||||||
|
5. If fail: Hardware issue, alert admin
|
||||||
|
```
|
||||||
|
|
||||||
|
### Scenario 4: High CPU Usage
|
||||||
|
```
|
||||||
|
1. CPU >80% for 60 seconds
|
||||||
|
2. Send WARN log: "High CPU usage: 85%"
|
||||||
|
3. Check if expected (e.g., video playback is normal)
|
||||||
|
4. If unexpected: investigate process causing it
|
||||||
|
5. If critical (>95%): consider restarting offending process
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 12. Testing & Validation
|
||||||
|
|
||||||
|
### 12.1 Manual Tests (During Development)
|
||||||
|
|
||||||
|
**Test 1: Process Crash Simulation**
|
||||||
|
```bash
|
||||||
|
# Start video, then kill VLC manually
|
||||||
|
killall vlc
|
||||||
|
# Expected: ERROR log sent, automatic restart within 5 seconds
|
||||||
|
```
|
||||||
|
|
||||||
|
**Test 2: MQTT Connectivity**
|
||||||
|
```bash
|
||||||
|
# Subscribe to all client topics on server
|
||||||
|
mosquitto_sub -h 192.168.43.201 -t "infoscreen/{uuid}/#" -v
|
||||||
|
# Expected: See heartbeat every 60s, health every 5s
|
||||||
|
```
|
||||||
|
|
||||||
|
**Test 3: Log Levels**
|
||||||
|
```bash
|
||||||
|
# Trigger error condition and verify log appears in database
|
||||||
|
curl http://192.168.43.201:8000/api/client-logs/test
|
||||||
|
# Expected: See new log entry with correct level/message
|
||||||
|
```
|
||||||
|
|
||||||
|
### 12.2 Acceptance Criteria
|
||||||
|
|
||||||
|
✅ **Client must:**
|
||||||
|
1. Send heartbeat every 60 seconds without gaps
|
||||||
|
2. Send ERROR log within 5 seconds of process crash
|
||||||
|
3. Attempt automatic restart (max 3 times)
|
||||||
|
4. Report health metrics every 5 seconds
|
||||||
|
5. Survive MQTT broker restart (reconnect automatically)
|
||||||
|
6. Survive network interruption (buffer logs, send when reconnected)
|
||||||
|
7. Use correct timestamp format (ISO 8601 UTC)
|
||||||
|
8. Only send logs for real client UUID (FK constraint)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 13. Python Libraries (Recommended)
|
||||||
|
|
||||||
|
**For process monitoring:**
|
||||||
|
- `psutil` - Cross-platform process and system utilities
|
||||||
|
|
||||||
|
**For MQTT:**
|
||||||
|
- `paho-mqtt` - Official MQTT client (use v2.x with Callback API v2)
|
||||||
|
|
||||||
|
**For VLC control:**
|
||||||
|
- `requests` - HTTP client for status queries
|
||||||
|
|
||||||
|
**For Chromium control:**
|
||||||
|
- `websocket-client` or `pychrome` - Chrome DevTools Protocol
|
||||||
|
|
||||||
|
**For datetime:**
|
||||||
|
- `datetime` (stdlib) - Use `datetime.now(timezone.utc).isoformat()`
|
||||||
|
|
||||||
|
**Example requirements.txt:**
|
||||||
|
```
|
||||||
|
paho-mqtt>=2.0.0
|
||||||
|
psutil>=5.9.0
|
||||||
|
requests>=2.31.0
|
||||||
|
python-dateutil>=2.8.0
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 14. Security Considerations
|
||||||
|
|
||||||
|
### 14.1 MQTT Security
|
||||||
|
- If broker requires auth, store credentials in config file with restricted permissions (`chmod 600`)
|
||||||
|
- Consider TLS/SSL for MQTT (port 8883) if on untrusted network
|
||||||
|
- Use unique client ID to prevent impersonation
|
||||||
|
|
||||||
|
### 14.2 Process Control APIs
|
||||||
|
- VLC HTTP password should be random, not default
|
||||||
|
- Chromium debug port should bind to `127.0.0.1` only (not `0.0.0.0`)
|
||||||
|
- Restrict file system access for media player processes
|
||||||
|
|
||||||
|
### 14.3 Log Content
|
||||||
|
- **Do not log:** Passwords, API keys, personal data
|
||||||
|
- **Sanitize:** File paths (strip user directories), URLs (remove query params with tokens)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 15. Performance Targets
|
||||||
|
|
||||||
|
| Metric | Target | Acceptable | Critical |
|
||||||
|
|--------|--------|------------|----------|
|
||||||
|
| Health check interval | 5s | 10s | 30s |
|
||||||
|
| Crash detection time | <5s | <10s | <30s |
|
||||||
|
| Restart time | <10s | <20s | <60s |
|
||||||
|
| MQTT publish latency | <100ms | <500ms | <2s |
|
||||||
|
| CPU usage (watchdog) | <2% | <5% | <10% |
|
||||||
|
| RAM usage (watchdog) | <50MB | <100MB | <200MB |
|
||||||
|
| Log message size | <1KB | <10KB | <100KB |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 16. Troubleshooting Guide (For Client Development)
|
||||||
|
|
||||||
|
### Issue: Logs not appearing in server database
|
||||||
|
**Check:**
|
||||||
|
1. Is MQTT broker reachable? (`mosquitto_pub` test from client)
|
||||||
|
2. Is client UUID correct and exists in `clients` table?
|
||||||
|
3. Is timestamp format correct (ISO 8601 with 'Z')?
|
||||||
|
4. Check server listener logs for errors
|
||||||
|
|
||||||
|
### Issue: Health metrics not updating
|
||||||
|
**Check:**
|
||||||
|
1. Is health loop running? (check watchdog service status)
|
||||||
|
2. Is MQTT connected? (check connection status in logs)
|
||||||
|
3. Is payload JSON valid? (use JSON validator)
|
||||||
|
|
||||||
|
### Issue: Process restarts in loop
|
||||||
|
**Check:**
|
||||||
|
1. Is media file/URL accessible?
|
||||||
|
2. Is process command correct? (test manually)
|
||||||
|
3. Check process exit code (crash reason)
|
||||||
|
4. Increase restart cooldown to avoid rapid loops
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 17. Complete Message Flow Diagram
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────┐
|
||||||
|
│ Infoscreen Client │
|
||||||
|
│ │
|
||||||
|
│ Event Occurs: │
|
||||||
|
│ - Process crashed │
|
||||||
|
│ - High CPU usage │
|
||||||
|
│ - Content loaded │
|
||||||
|
│ │
|
||||||
|
│ ┌────────────────┐ │
|
||||||
|
│ │ Decision Logic │ │
|
||||||
|
│ │ - Is it ERROR?│ │
|
||||||
|
│ │ - Is it WARN? │ │
|
||||||
|
│ │ - Is it INFO? │ │
|
||||||
|
│ └────────┬───────┘ │
|
||||||
|
│ │ │
|
||||||
|
│ ▼ │
|
||||||
|
│ ┌────────────────────────────────┐ │
|
||||||
|
│ │ Build JSON Payload │ │
|
||||||
|
│ │ { │ │
|
||||||
|
│ │ "timestamp": "...", │ │
|
||||||
|
│ │ "message": "...", │ │
|
||||||
|
│ │ "context": {...} │ │
|
||||||
|
│ │ } │ │
|
||||||
|
│ └────────┬───────────────────────┘ │
|
||||||
|
│ │ │
|
||||||
|
│ ▼ │
|
||||||
|
│ ┌────────────────────────────────┐ │
|
||||||
|
│ │ MQTT Publish │ │
|
||||||
|
│ │ Topic: infoscreen/{uuid}/logs/error │
|
||||||
|
│ │ QoS: 1 │ │
|
||||||
|
│ └────────┬───────────────────────┘ │
|
||||||
|
└───────────┼──────────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
│ TCP/IP (MQTT Protocol)
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌──────────────┐
|
||||||
|
│ MQTT Broker │
|
||||||
|
│ (Mosquitto) │
|
||||||
|
└──────┬───────┘
|
||||||
|
│
|
||||||
|
│ Topic: infoscreen/+/logs/#
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌──────────────────────────────┐
|
||||||
|
│ Listener Service │
|
||||||
|
│ (Python) │
|
||||||
|
│ │
|
||||||
|
│ - Parse JSON │
|
||||||
|
│ - Validate UUID │
|
||||||
|
│ - Store in database │
|
||||||
|
└──────┬───────────────────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌──────────────────────────────┐
|
||||||
|
│ MariaDB Database │
|
||||||
|
│ │
|
||||||
|
│ Table: client_logs │
|
||||||
|
│ - client_uuid │
|
||||||
|
│ - timestamp │
|
||||||
|
│ - level │
|
||||||
|
│ - message │
|
||||||
|
│ - context (JSON) │
|
||||||
|
└──────┬───────────────────────┘
|
||||||
|
│
|
||||||
|
│ SQL Query
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌──────────────────────────────┐
|
||||||
|
│ API Server (Flask) │
|
||||||
|
│ │
|
||||||
|
│ GET /api/client-logs/{uuid}/logs
|
||||||
|
│ GET /api/client-logs/summary
|
||||||
|
└──────┬───────────────────────┘
|
||||||
|
│
|
||||||
|
│ HTTP/JSON
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌──────────────────────────────┐
|
||||||
|
│ Dashboard (React) │
|
||||||
|
│ │
|
||||||
|
│ - Display logs │
|
||||||
|
│ - Filter by level │
|
||||||
|
│ - Show health status │
|
||||||
|
└───────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 18. Quick Reference Card
|
||||||
|
|
||||||
|
### MQTT Topics Summary
|
||||||
|
```
|
||||||
|
infoscreen/{uuid}/logs/error → Critical failures
|
||||||
|
infoscreen/{uuid}/logs/warn → Non-critical issues
|
||||||
|
infoscreen/{uuid}/logs/info → Informational (dev mode)
|
||||||
|
infoscreen/{uuid}/health → Health metrics (every 5s)
|
||||||
|
infoscreen/{uuid}/heartbeat → Enhanced heartbeat (every 60s)
|
||||||
|
```
|
||||||
|
|
||||||
|
### JSON Timestamp Format
|
||||||
|
```python
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
timestamp = datetime.now(timezone.utc).isoformat()
|
||||||
|
# Output: "2026-03-10T07:30:00+00:00" or "2026-03-10T07:30:00Z"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Process Status Values
|
||||||
|
```
|
||||||
|
"running" - Process is alive and responding
|
||||||
|
"crashed" - Process terminated unexpectedly
|
||||||
|
"starting" - Process is launching (startup phase)
|
||||||
|
"stopped" - Process intentionally stopped
|
||||||
|
```
|
||||||
|
|
||||||
|
### Restart Logic
|
||||||
|
```
|
||||||
|
Max attempts: 3
|
||||||
|
Cooldown: 2 seconds between attempts
|
||||||
|
Reset: After 5 minutes of successful operation
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 19. Contact & Support
|
||||||
|
|
||||||
|
**Server API Documentation:**
|
||||||
|
- Base URL: `http://192.168.43.201:8000`
|
||||||
|
- Health check: `GET /health`
|
||||||
|
- Test logs: `GET /api/client-logs/test` (no auth)
|
||||||
|
- Full API docs: See `CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md` on server
|
||||||
|
|
||||||
|
**MQTT Broker:**
|
||||||
|
- Host: `192.168.43.201`
|
||||||
|
- Port: `1883` (standard), `9001` (WebSocket)
|
||||||
|
- Test tool: `mosquitto_pub` / `mosquitto_sub`
|
||||||
|
|
||||||
|
**Database Schema:**
|
||||||
|
- Table: `client_logs`
|
||||||
|
- Foreign Key: `client_uuid` → `clients.uuid` (ON DELETE CASCADE)
|
||||||
|
- Constraint: UUID must exist in clients table before logging
|
||||||
|
|
||||||
|
**Server-Side Logs:**
|
||||||
|
```bash
|
||||||
|
# View listener logs (processes MQTT messages)
|
||||||
|
docker compose logs -f listener
|
||||||
|
|
||||||
|
# View server logs (API requests)
|
||||||
|
docker compose logs -f server
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 20. Appendix: Example Implementations
|
||||||
|
|
||||||
|
### A. Minimal Python Watchdog (Pseudocode)
|
||||||
|
|
||||||
|
```python
|
||||||
|
import time
|
||||||
|
import json
|
||||||
|
import psutil
|
||||||
|
import paho.mqtt.client as mqtt
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
|
||||||
|
class MinimalWatchdog:
|
||||||
|
def __init__(self, client_uuid, mqtt_broker):
|
||||||
|
self.uuid = client_uuid
|
||||||
|
self.mqtt_client = mqtt.Client(callback_api_version=mqtt.CallbackAPIVersion.VERSION2)
|
||||||
|
self.mqtt_client.connect(mqtt_broker, 1883, 60)
|
||||||
|
self.mqtt_client.loop_start()
|
||||||
|
|
||||||
|
self.expected_process = None
|
||||||
|
self.restart_attempts = 0
|
||||||
|
self.MAX_RESTARTS = 3
|
||||||
|
|
||||||
|
def send_log(self, level, message, context=None):
|
||||||
|
topic = f"infoscreen/{self.uuid}/logs/{level}"
|
||||||
|
payload = {
|
||||||
|
"timestamp": datetime.now(timezone.utc).isoformat(),
|
||||||
|
"message": message,
|
||||||
|
"context": context or {}
|
||||||
|
}
|
||||||
|
self.mqtt_client.publish(topic, json.dumps(payload), qos=1)
|
||||||
|
|
||||||
|
def is_process_running(self, process_name):
|
||||||
|
for proc in psutil.process_iter(['name']):
|
||||||
|
if process_name in proc.info['name']:
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
|
def monitor_loop(self):
|
||||||
|
while True:
|
||||||
|
if self.expected_process:
|
||||||
|
if not self.is_process_running(self.expected_process):
|
||||||
|
self.send_log("error", f"{self.expected_process} crashed")
|
||||||
|
if self.restart_attempts < self.MAX_RESTARTS:
|
||||||
|
self.restart_process()
|
||||||
|
else:
|
||||||
|
self.send_log("error", "Max restarts exceeded")
|
||||||
|
|
||||||
|
time.sleep(5)
|
||||||
|
|
||||||
|
# Usage:
|
||||||
|
watchdog = MinimalWatchdog("9b8d1856-ff34-4864-a726-12de072d0f77", "192.168.43.201")
|
||||||
|
watchdog.expected_process = "vlc"
|
||||||
|
watchdog.monitor_loop()
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**END OF SPECIFICATION**
|
||||||
|
|
||||||
|
Questions? Refer to:
|
||||||
|
- `CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md` (server repo)
|
||||||
|
- Server API: `http://192.168.43.201:8000/api/client-logs/test`
|
||||||
|
- MQTT test: `mosquitto_sub -h 192.168.43.201 -t infoscreen/#`
|
||||||
@@ -56,6 +56,51 @@ Notes for integrators:
|
|||||||
- CSS follows modern Material 3 color-function notation (`rgb(r g b / alpha%)`)
|
- CSS follows modern Material 3 color-function notation (`rgb(r g b / alpha%)`)
|
||||||
- Syncfusion ScheduleComponent requires TimelineViews, Resize, and DragAndDrop modules injected
|
- Syncfusion ScheduleComponent requires TimelineViews, Resize, and DragAndDrop modules injected
|
||||||
|
|
||||||
|
Backend technical work (post-release notes; no version bump):
|
||||||
|
- 📊 **Client Monitoring Infrastructure (Server-Side) (2026-03-10)**:
|
||||||
|
- Database schema: New Alembic migration `c1d2e3f4g5h6_add_client_monitoring.py` (idempotent) adds:
|
||||||
|
- `client_logs` table: Stores centralized logs with columns (id, client_uuid, timestamp, level, message, context, created_at)
|
||||||
|
- Foreign key: `client_logs.client_uuid` → `clients.uuid` (ON DELETE CASCADE)
|
||||||
|
- Health monitoring columns added to `clients` table: `current_event_id`, `current_process`, `process_status`, `process_pid`, `last_screenshot_analyzed`, `screen_health_status`, `last_screenshot_hash`
|
||||||
|
- Indexes for performance: (client_uuid, timestamp DESC), (level, timestamp DESC), (created_at DESC)
|
||||||
|
- Data models (`models/models.py`):
|
||||||
|
- New enums: `LogLevel` (ERROR, WARN, INFO, DEBUG), `ProcessStatus` (running, crashed, starting, stopped), `ScreenHealthStatus` (OK, BLACK, FROZEN, UNKNOWN)
|
||||||
|
- New model: `ClientLog` with foreign key to `Client` (CASCADE on delete)
|
||||||
|
- Extended `Client` model with 7 health monitoring fields
|
||||||
|
- MQTT listener extensions (`listener/listener.py`):
|
||||||
|
- New topic subscriptions: `infoscreen/+/logs/error`, `infoscreen/+/logs/warn`, `infoscreen/+/logs/info`, `infoscreen/+/health`
|
||||||
|
- Log handler: Parses JSON payloads, creates `ClientLog` entries, validates client UUID exists (FK constraint)
|
||||||
|
- Health handler: Updates client state from MQTT health messages
|
||||||
|
- Enhanced heartbeat handler: Captures `process_status`, `current_process`, `process_pid`, `current_event_id` from payload
|
||||||
|
- API endpoints (`server/routes/client_logs.py`):
|
||||||
|
- `GET /api/client-logs/<uuid>/logs` – Retrieve client logs with filters (level, limit, since); authenticated (admin_or_higher)
|
||||||
|
- `GET /api/client-logs/summary` – Get log counts by level per client for last 24h; authenticated (admin_or_higher)
|
||||||
|
- `GET /api/client-logs/recent-errors` – System-wide error monitoring; authenticated (admin_or_higher)
|
||||||
|
- `GET /api/client-logs/test` – Infrastructure validation endpoint (no auth required)
|
||||||
|
- Blueprint registered in `server/wsgi.py` as `client_logs_bp`
|
||||||
|
- Dev environment fix: Updated `docker-compose.override.yml` listener service to use `working_dir: /workspace` and direct command path for live code reload
|
||||||
|
- 📡 **MQTT Protocol Extensions**:
|
||||||
|
- New log topics: `infoscreen/{uuid}/logs/{error|warn|info}` with JSON payload (timestamp, message, context)
|
||||||
|
- New health topic: `infoscreen/{uuid}/health` with metrics (expected_state, actual_state, health_metrics)
|
||||||
|
- Enhanced heartbeat: `infoscreen/{uuid}/heartbeat` now includes `current_process`, `process_pid`, `process_status`, `current_event_id`
|
||||||
|
- QoS levels: ERROR/WARN logs use QoS 1 (at least once), INFO/health use QoS 0 (fire and forget)
|
||||||
|
- 📖 **Documentation**:
|
||||||
|
- New file: `CLIENT_MONITORING_SPECIFICATION.md` – Comprehensive 20-section technical spec for client-side implementation (MQTT protocol, process monitoring, auto-recovery, payload formats, testing guide)
|
||||||
|
- New file: `CLIENT_MONITORING_IMPLEMENTATION_GUIDE.md` – 5-phase implementation guide (database, backend, client watchdog, dashboard UI, testing)
|
||||||
|
- Updated `.github/copilot-instructions.md`: Added MQTT topics section, client monitoring integration notes
|
||||||
|
- ✅ **Validation**:
|
||||||
|
- End-to-end testing completed: MQTT message → listener → database → API confirmed working
|
||||||
|
- Test flow: Published message to `infoscreen/{real-uuid}/logs/error` → listener logs showed receipt → database stored entry → test API returned log data
|
||||||
|
- Known client UUIDs validated: 9b8d1856-ff34-4864-a726-12de072d0f77, 7f65c615-5827-4ada-9ac8-4727c2e8ee55, bdbfff95-0b2b-4265-8cc7-b0284509540a
|
||||||
|
|
||||||
|
Notes for integrators:
|
||||||
|
- Tiered logging strategy: ERROR/WARN always centralized (QoS 1), INFO dev-only (QoS 0), DEBUG local-only
|
||||||
|
- Client-side implementation pending (Phase 3: watchdog service)
|
||||||
|
- Dashboard UI pending (Phase 4: log viewer and health indicators)
|
||||||
|
- Foreign key constraint prevents logging for non-existent clients (data integrity enforced)
|
||||||
|
- Migration is idempotent and can be safely rerun after interruption
|
||||||
|
- Use `GET /api/client-logs/test` for quick infrastructure validation without authentication
|
||||||
|
|
||||||
## 2025.1.0-beta.1 (TBD)
|
## 2025.1.0-beta.1 (TBD)
|
||||||
- 🔐 **User Management & Role-Based Access Control**:
|
- 🔐 **User Management & Role-Based Access Control**:
|
||||||
- Backend: Implemented comprehensive user management API (`server/routes/users.py`) with 6 endpoints (GET, POST, PUT, DELETE users + password reset).
|
- Backend: Implemented comprehensive user management API (`server/routes/users.py`) with 6 endpoints (GET, POST, PUT, DELETE users + password reset).
|
||||||
|
|||||||
@@ -18,6 +18,7 @@ services:
|
|||||||
environment:
|
environment:
|
||||||
- DB_CONN=mysql+pymysql://${DB_USER}:${DB_PASSWORD}@db/${DB_NAME}
|
- DB_CONN=mysql+pymysql://${DB_USER}:${DB_PASSWORD}@db/${DB_NAME}
|
||||||
- DB_URL=mysql+pymysql://${DB_USER}:${DB_PASSWORD}@db/${DB_NAME}
|
- DB_URL=mysql+pymysql://${DB_USER}:${DB_PASSWORD}@db/${DB_NAME}
|
||||||
|
- API_BASE_URL=http://server:8000
|
||||||
- ENV=${ENV:-development}
|
- ENV=${ENV:-development}
|
||||||
- FLASK_SECRET_KEY=${FLASK_SECRET_KEY:-dev-secret-key-change-in-production}
|
- FLASK_SECRET_KEY=${FLASK_SECRET_KEY:-dev-secret-key-change-in-production}
|
||||||
- DEFAULT_SUPERADMIN_USERNAME=${DEFAULT_SUPERADMIN_USERNAME:-superadmin}
|
- DEFAULT_SUPERADMIN_USERNAME=${DEFAULT_SUPERADMIN_USERNAME:-superadmin}
|
||||||
|
|||||||
@@ -7,7 +7,7 @@ import requests
|
|||||||
import paho.mqtt.client as mqtt
|
import paho.mqtt.client as mqtt
|
||||||
from sqlalchemy import create_engine
|
from sqlalchemy import create_engine
|
||||||
from sqlalchemy.orm import sessionmaker
|
from sqlalchemy.orm import sessionmaker
|
||||||
from models.models import Client
|
from models.models import Client, ClientLog, LogLevel, ProcessStatus
|
||||||
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s [%(levelname)s] %(message)s')
|
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s [%(levelname)s] %(message)s')
|
||||||
|
|
||||||
# Load .env in development
|
# Load .env in development
|
||||||
@@ -78,7 +78,14 @@ def on_connect(client, userdata, flags, reasonCode, properties):
|
|||||||
client.subscribe("infoscreen/+/heartbeat")
|
client.subscribe("infoscreen/+/heartbeat")
|
||||||
client.subscribe("infoscreen/+/screenshot")
|
client.subscribe("infoscreen/+/screenshot")
|
||||||
client.subscribe("infoscreen/+/dashboard")
|
client.subscribe("infoscreen/+/dashboard")
|
||||||
logging.info(f"MQTT connected (reasonCode: {reasonCode}); (re)subscribed to discovery, heartbeats, screenshots, and dashboards")
|
|
||||||
|
# Subscribe to monitoring topics
|
||||||
|
client.subscribe("infoscreen/+/logs/error")
|
||||||
|
client.subscribe("infoscreen/+/logs/warn")
|
||||||
|
client.subscribe("infoscreen/+/logs/info")
|
||||||
|
client.subscribe("infoscreen/+/health")
|
||||||
|
|
||||||
|
logging.info(f"MQTT connected (reasonCode: {reasonCode}); (re)subscribed to discovery, heartbeats, screenshots, dashboards, logs, and health")
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logging.error(f"Subscribe failed on connect: {e}")
|
logging.error(f"Subscribe failed on connect: {e}")
|
||||||
|
|
||||||
@@ -124,16 +131,124 @@ def on_message(client, userdata, msg):
|
|||||||
# Heartbeat-Handling
|
# Heartbeat-Handling
|
||||||
if topic.startswith("infoscreen/") and topic.endswith("/heartbeat"):
|
if topic.startswith("infoscreen/") and topic.endswith("/heartbeat"):
|
||||||
uuid = topic.split("/")[1]
|
uuid = topic.split("/")[1]
|
||||||
|
try:
|
||||||
|
# Parse payload to get optional health data
|
||||||
|
payload_data = json.loads(msg.payload.decode())
|
||||||
|
except (json.JSONDecodeError, UnicodeDecodeError):
|
||||||
|
payload_data = {}
|
||||||
|
|
||||||
session = Session()
|
session = Session()
|
||||||
client_obj = session.query(Client).filter_by(uuid=uuid).first()
|
client_obj = session.query(Client).filter_by(uuid=uuid).first()
|
||||||
if client_obj:
|
if client_obj:
|
||||||
client_obj.last_alive = datetime.datetime.now(datetime.UTC)
|
client_obj.last_alive = datetime.datetime.now(datetime.UTC)
|
||||||
|
|
||||||
|
# Update health fields if present in heartbeat
|
||||||
|
if 'process_status' in payload_data:
|
||||||
|
try:
|
||||||
|
client_obj.process_status = ProcessStatus[payload_data['process_status']]
|
||||||
|
except (KeyError, TypeError):
|
||||||
|
pass
|
||||||
|
|
||||||
|
if 'current_process' in payload_data:
|
||||||
|
client_obj.current_process = payload_data.get('current_process')
|
||||||
|
|
||||||
|
if 'process_pid' in payload_data:
|
||||||
|
client_obj.process_pid = payload_data.get('process_pid')
|
||||||
|
|
||||||
|
if 'current_event_id' in payload_data:
|
||||||
|
client_obj.current_event_id = payload_data.get('current_event_id')
|
||||||
|
|
||||||
session.commit()
|
session.commit()
|
||||||
logging.info(
|
logging.info(f"Heartbeat von {uuid} empfangen, last_alive (UTC) aktualisiert.")
|
||||||
f"Heartbeat von {uuid} empfangen, last_alive (UTC) aktualisiert.")
|
|
||||||
session.close()
|
session.close()
|
||||||
return
|
return
|
||||||
|
|
||||||
|
# Log-Handling (ERROR, WARN, INFO)
|
||||||
|
if topic.startswith("infoscreen/") and "/logs/" in topic:
|
||||||
|
parts = topic.split("/")
|
||||||
|
if len(parts) >= 4:
|
||||||
|
uuid = parts[1]
|
||||||
|
level_str = parts[3].upper() # 'error', 'warn', 'info' -> 'ERROR', 'WARN', 'INFO'
|
||||||
|
|
||||||
|
try:
|
||||||
|
payload_data = json.loads(msg.payload.decode())
|
||||||
|
message = payload_data.get('message', '')
|
||||||
|
timestamp_str = payload_data.get('timestamp')
|
||||||
|
context = payload_data.get('context', {})
|
||||||
|
|
||||||
|
# Parse timestamp or use current time
|
||||||
|
if timestamp_str:
|
||||||
|
try:
|
||||||
|
log_timestamp = datetime.datetime.fromisoformat(timestamp_str.replace('Z', '+00:00'))
|
||||||
|
if log_timestamp.tzinfo is None:
|
||||||
|
log_timestamp = log_timestamp.replace(tzinfo=datetime.UTC)
|
||||||
|
except ValueError:
|
||||||
|
log_timestamp = datetime.datetime.now(datetime.UTC)
|
||||||
|
else:
|
||||||
|
log_timestamp = datetime.datetime.now(datetime.UTC)
|
||||||
|
|
||||||
|
# Store in database
|
||||||
|
session = Session()
|
||||||
|
try:
|
||||||
|
log_level = LogLevel[level_str]
|
||||||
|
log_entry = ClientLog(
|
||||||
|
client_uuid=uuid,
|
||||||
|
timestamp=log_timestamp,
|
||||||
|
level=log_level,
|
||||||
|
message=message,
|
||||||
|
context=json.dumps(context) if context else None
|
||||||
|
)
|
||||||
|
session.add(log_entry)
|
||||||
|
session.commit()
|
||||||
|
logging.info(f"[{level_str}] {uuid}: {message}")
|
||||||
|
except Exception as e:
|
||||||
|
logging.error(f"Error saving log from {uuid}: {e}")
|
||||||
|
session.rollback()
|
||||||
|
finally:
|
||||||
|
session.close()
|
||||||
|
|
||||||
|
except (json.JSONDecodeError, UnicodeDecodeError) as e:
|
||||||
|
logging.error(f"Could not parse log payload from {uuid}: {e}")
|
||||||
|
return
|
||||||
|
|
||||||
|
# Health-Handling
|
||||||
|
if topic.startswith("infoscreen/") and topic.endswith("/health"):
|
||||||
|
uuid = topic.split("/")[1]
|
||||||
|
try:
|
||||||
|
payload_data = json.loads(msg.payload.decode())
|
||||||
|
|
||||||
|
session = Session()
|
||||||
|
client_obj = session.query(Client).filter_by(uuid=uuid).first()
|
||||||
|
if client_obj:
|
||||||
|
# Update expected state
|
||||||
|
expected = payload_data.get('expected_state', {})
|
||||||
|
if 'event_id' in expected:
|
||||||
|
client_obj.current_event_id = expected['event_id']
|
||||||
|
|
||||||
|
# Update actual state
|
||||||
|
actual = payload_data.get('actual_state', {})
|
||||||
|
if 'process' in actual:
|
||||||
|
client_obj.current_process = actual['process']
|
||||||
|
|
||||||
|
if 'pid' in actual:
|
||||||
|
client_obj.process_pid = actual['pid']
|
||||||
|
|
||||||
|
if 'status' in actual:
|
||||||
|
try:
|
||||||
|
client_obj.process_status = ProcessStatus[actual['status']]
|
||||||
|
except (KeyError, TypeError):
|
||||||
|
pass
|
||||||
|
|
||||||
|
session.commit()
|
||||||
|
logging.debug(f"Health update from {uuid}: {actual.get('process')} ({actual.get('status')})")
|
||||||
|
session.close()
|
||||||
|
|
||||||
|
except (json.JSONDecodeError, UnicodeDecodeError) as e:
|
||||||
|
logging.error(f"Could not parse health payload from {uuid}: {e}")
|
||||||
|
except Exception as e:
|
||||||
|
logging.error(f"Error processing health from {uuid}: {e}")
|
||||||
|
return
|
||||||
|
|
||||||
# Discovery-Handling
|
# Discovery-Handling
|
||||||
if topic == "infoscreen/discovery":
|
if topic == "infoscreen/discovery":
|
||||||
payload = json.loads(msg.payload.decode())
|
payload = json.loads(msg.payload.decode())
|
||||||
|
|||||||
@@ -21,6 +21,27 @@ class AcademicPeriodType(enum.Enum):
|
|||||||
trimester = "trimester"
|
trimester = "trimester"
|
||||||
|
|
||||||
|
|
||||||
|
class LogLevel(enum.Enum):
|
||||||
|
ERROR = "ERROR"
|
||||||
|
WARN = "WARN"
|
||||||
|
INFO = "INFO"
|
||||||
|
DEBUG = "DEBUG"
|
||||||
|
|
||||||
|
|
||||||
|
class ProcessStatus(enum.Enum):
|
||||||
|
running = "running"
|
||||||
|
crashed = "crashed"
|
||||||
|
starting = "starting"
|
||||||
|
stopped = "stopped"
|
||||||
|
|
||||||
|
|
||||||
|
class ScreenHealthStatus(enum.Enum):
|
||||||
|
OK = "OK"
|
||||||
|
BLACK = "BLACK"
|
||||||
|
FROZEN = "FROZEN"
|
||||||
|
UNKNOWN = "UNKNOWN"
|
||||||
|
|
||||||
|
|
||||||
class User(Base):
|
class User(Base):
|
||||||
__tablename__ = 'users'
|
__tablename__ = 'users'
|
||||||
id = Column(Integer, primary_key=True, autoincrement=True)
|
id = Column(Integer, primary_key=True, autoincrement=True)
|
||||||
@@ -107,6 +128,31 @@ class Client(Base):
|
|||||||
group_id = Column(Integer, ForeignKey(
|
group_id = Column(Integer, ForeignKey(
|
||||||
'client_groups.id'), nullable=False, default=1)
|
'client_groups.id'), nullable=False, default=1)
|
||||||
|
|
||||||
|
# Health monitoring fields
|
||||||
|
current_event_id = Column(Integer, nullable=True)
|
||||||
|
current_process = Column(String(50), nullable=True) # 'vlc', 'chromium', 'pdf_viewer'
|
||||||
|
process_status = Column(Enum(ProcessStatus), nullable=True)
|
||||||
|
process_pid = Column(Integer, nullable=True)
|
||||||
|
last_screenshot_analyzed = Column(TIMESTAMP(timezone=True), nullable=True)
|
||||||
|
screen_health_status = Column(Enum(ScreenHealthStatus), nullable=True, server_default='UNKNOWN')
|
||||||
|
last_screenshot_hash = Column(String(32), nullable=True)
|
||||||
|
|
||||||
|
|
||||||
|
class ClientLog(Base):
|
||||||
|
__tablename__ = 'client_logs'
|
||||||
|
id = Column(Integer, primary_key=True, autoincrement=True)
|
||||||
|
client_uuid = Column(String(36), ForeignKey('clients.uuid', ondelete='CASCADE'), nullable=False, index=True)
|
||||||
|
timestamp = Column(TIMESTAMP(timezone=True), nullable=False, index=True)
|
||||||
|
level = Column(Enum(LogLevel), nullable=False, index=True)
|
||||||
|
message = Column(Text, nullable=False)
|
||||||
|
context = Column(Text, nullable=True) # JSON stored as text
|
||||||
|
created_at = Column(TIMESTAMP(timezone=True), server_default=func.current_timestamp(), nullable=False)
|
||||||
|
|
||||||
|
__table_args__ = (
|
||||||
|
Index('ix_client_logs_client_timestamp', 'client_uuid', 'timestamp'),
|
||||||
|
Index('ix_client_logs_level_timestamp', 'level', 'timestamp'),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
class EventType(enum.Enum):
|
class EventType(enum.Enum):
|
||||||
presentation = "presentation"
|
presentation = "presentation"
|
||||||
|
|||||||
@@ -0,0 +1,84 @@
|
|||||||
|
"""add client monitoring tables and columns
|
||||||
|
|
||||||
|
Revision ID: c1d2e3f4g5h6
|
||||||
|
Revises: 4f0b8a3e5c20
|
||||||
|
Create Date: 2026-03-09 21:08:38.000000
|
||||||
|
|
||||||
|
"""
|
||||||
|
from alembic import op
|
||||||
|
import sqlalchemy as sa
|
||||||
|
|
||||||
|
# revision identifiers, used by Alembic.
|
||||||
|
revision = 'c1d2e3f4g5h6'
|
||||||
|
down_revision = '4f0b8a3e5c20'
|
||||||
|
branch_labels = None
|
||||||
|
depends_on = None
|
||||||
|
|
||||||
|
|
||||||
|
def upgrade():
|
||||||
|
bind = op.get_bind()
|
||||||
|
inspector = sa.inspect(bind)
|
||||||
|
|
||||||
|
# 1. Add health monitoring columns to clients table (safe on rerun)
|
||||||
|
existing_client_columns = {c['name'] for c in inspector.get_columns('clients')}
|
||||||
|
if 'current_event_id' not in existing_client_columns:
|
||||||
|
op.add_column('clients', sa.Column('current_event_id', sa.Integer(), nullable=True))
|
||||||
|
if 'current_process' not in existing_client_columns:
|
||||||
|
op.add_column('clients', sa.Column('current_process', sa.String(50), nullable=True))
|
||||||
|
if 'process_status' not in existing_client_columns:
|
||||||
|
op.add_column('clients', sa.Column('process_status', sa.Enum('running', 'crashed', 'starting', 'stopped', name='processstatus'), nullable=True))
|
||||||
|
if 'process_pid' not in existing_client_columns:
|
||||||
|
op.add_column('clients', sa.Column('process_pid', sa.Integer(), nullable=True))
|
||||||
|
if 'last_screenshot_analyzed' not in existing_client_columns:
|
||||||
|
op.add_column('clients', sa.Column('last_screenshot_analyzed', sa.TIMESTAMP(timezone=True), nullable=True))
|
||||||
|
if 'screen_health_status' not in existing_client_columns:
|
||||||
|
op.add_column('clients', sa.Column('screen_health_status', sa.Enum('OK', 'BLACK', 'FROZEN', 'UNKNOWN', name='screenhealthstatus'), nullable=True, server_default='UNKNOWN'))
|
||||||
|
if 'last_screenshot_hash' not in existing_client_columns:
|
||||||
|
op.add_column('clients', sa.Column('last_screenshot_hash', sa.String(32), nullable=True))
|
||||||
|
|
||||||
|
# 2. Create client_logs table (safe on rerun)
|
||||||
|
if not inspector.has_table('client_logs'):
|
||||||
|
op.create_table('client_logs',
|
||||||
|
sa.Column('id', sa.Integer(), autoincrement=True, nullable=False),
|
||||||
|
sa.Column('client_uuid', sa.String(36), nullable=False),
|
||||||
|
sa.Column('timestamp', sa.TIMESTAMP(timezone=True), nullable=False),
|
||||||
|
sa.Column('level', sa.Enum('ERROR', 'WARN', 'INFO', 'DEBUG', name='loglevel'), nullable=False),
|
||||||
|
sa.Column('message', sa.Text(), nullable=False),
|
||||||
|
sa.Column('context', sa.JSON(), nullable=True),
|
||||||
|
sa.Column('created_at', sa.TIMESTAMP(timezone=True), server_default=sa.func.current_timestamp(), nullable=False),
|
||||||
|
sa.PrimaryKeyConstraint('id'),
|
||||||
|
sa.ForeignKeyConstraint(['client_uuid'], ['clients.uuid'], ondelete='CASCADE'),
|
||||||
|
mysql_charset='utf8mb4',
|
||||||
|
mysql_collate='utf8mb4_unicode_ci',
|
||||||
|
mysql_engine='InnoDB'
|
||||||
|
)
|
||||||
|
|
||||||
|
# 3. Create indexes for efficient querying (safe on rerun)
|
||||||
|
client_log_indexes = {idx['name'] for idx in inspector.get_indexes('client_logs')} if inspector.has_table('client_logs') else set()
|
||||||
|
client_indexes = {idx['name'] for idx in inspector.get_indexes('clients')}
|
||||||
|
|
||||||
|
if 'ix_client_logs_client_timestamp' not in client_log_indexes:
|
||||||
|
op.create_index('ix_client_logs_client_timestamp', 'client_logs', ['client_uuid', 'timestamp'])
|
||||||
|
if 'ix_client_logs_level_timestamp' not in client_log_indexes:
|
||||||
|
op.create_index('ix_client_logs_level_timestamp', 'client_logs', ['level', 'timestamp'])
|
||||||
|
if 'ix_clients_process_status' not in client_indexes:
|
||||||
|
op.create_index('ix_clients_process_status', 'clients', ['process_status'])
|
||||||
|
|
||||||
|
|
||||||
|
def downgrade():
|
||||||
|
# Drop indexes
|
||||||
|
op.drop_index('ix_clients_process_status', table_name='clients')
|
||||||
|
op.drop_index('ix_client_logs_level_timestamp', table_name='client_logs')
|
||||||
|
op.drop_index('ix_client_logs_client_timestamp', table_name='client_logs')
|
||||||
|
|
||||||
|
# Drop table
|
||||||
|
op.drop_table('client_logs')
|
||||||
|
|
||||||
|
# Drop columns from clients
|
||||||
|
op.drop_column('clients', 'last_screenshot_hash')
|
||||||
|
op.drop_column('clients', 'screen_health_status')
|
||||||
|
op.drop_column('clients', 'last_screenshot_analyzed')
|
||||||
|
op.drop_column('clients', 'process_pid')
|
||||||
|
op.drop_column('clients', 'process_status')
|
||||||
|
op.drop_column('clients', 'current_process')
|
||||||
|
op.drop_column('clients', 'current_event_id')
|
||||||
255
server/routes/client_logs.py
Normal file
255
server/routes/client_logs.py
Normal file
@@ -0,0 +1,255 @@
|
|||||||
|
from flask import Blueprint, jsonify, request
|
||||||
|
from server.database import Session
|
||||||
|
from server.permissions import admin_or_higher
|
||||||
|
from models.models import ClientLog, Client, LogLevel
|
||||||
|
from sqlalchemy import desc, func
|
||||||
|
from datetime import datetime, timedelta, timezone
|
||||||
|
import json
|
||||||
|
|
||||||
|
client_logs_bp = Blueprint("client_logs", __name__, url_prefix="/api/client-logs")
|
||||||
|
|
||||||
|
|
||||||
|
@client_logs_bp.route("/test", methods=["GET"])
|
||||||
|
def test_client_logs():
|
||||||
|
"""Test endpoint to verify logging infrastructure (no auth required)"""
|
||||||
|
session = Session()
|
||||||
|
try:
|
||||||
|
# Count total logs
|
||||||
|
total_logs = session.query(func.count(ClientLog.id)).scalar()
|
||||||
|
|
||||||
|
# Count by level
|
||||||
|
error_count = session.query(func.count(ClientLog.id)).filter_by(level=LogLevel.ERROR).scalar()
|
||||||
|
warn_count = session.query(func.count(ClientLog.id)).filter_by(level=LogLevel.WARN).scalar()
|
||||||
|
info_count = session.query(func.count(ClientLog.id)).filter_by(level=LogLevel.INFO).scalar()
|
||||||
|
|
||||||
|
# Get last 5 logs
|
||||||
|
recent_logs = session.query(ClientLog).order_by(desc(ClientLog.timestamp)).limit(5).all()
|
||||||
|
|
||||||
|
recent = []
|
||||||
|
for log in recent_logs:
|
||||||
|
recent.append({
|
||||||
|
"client_uuid": log.client_uuid,
|
||||||
|
"level": log.level.value if log.level else None,
|
||||||
|
"message": log.message,
|
||||||
|
"timestamp": log.timestamp.isoformat() if log.timestamp else None
|
||||||
|
})
|
||||||
|
|
||||||
|
session.close()
|
||||||
|
return jsonify({
|
||||||
|
"status": "ok",
|
||||||
|
"infrastructure": "working",
|
||||||
|
"total_logs": total_logs,
|
||||||
|
"counts": {
|
||||||
|
"ERROR": error_count,
|
||||||
|
"WARN": warn_count,
|
||||||
|
"INFO": info_count
|
||||||
|
},
|
||||||
|
"recent_5": recent
|
||||||
|
})
|
||||||
|
except Exception as e:
|
||||||
|
session.close()
|
||||||
|
return jsonify({"status": "error", "message": str(e)}), 500
|
||||||
|
|
||||||
|
|
||||||
|
@client_logs_bp.route("/<uuid>/logs", methods=["GET"])
|
||||||
|
@admin_or_higher
|
||||||
|
def get_client_logs(uuid):
|
||||||
|
"""
|
||||||
|
Get logs for a specific client
|
||||||
|
Query params:
|
||||||
|
- level: ERROR, WARN, INFO, DEBUG (optional)
|
||||||
|
- limit: number of entries (default 50, max 500)
|
||||||
|
- since: ISO timestamp (optional)
|
||||||
|
|
||||||
|
Example: /api/client-logs/abc-123/logs?level=ERROR&limit=100
|
||||||
|
"""
|
||||||
|
session = Session()
|
||||||
|
try:
|
||||||
|
# Verify client exists
|
||||||
|
client = session.query(Client).filter_by(uuid=uuid).first()
|
||||||
|
if not client:
|
||||||
|
session.close()
|
||||||
|
return jsonify({"error": "Client not found"}), 404
|
||||||
|
|
||||||
|
# Parse query parameters
|
||||||
|
level_param = request.args.get('level')
|
||||||
|
limit = min(int(request.args.get('limit', 50)), 500)
|
||||||
|
since_param = request.args.get('since')
|
||||||
|
|
||||||
|
# Build query
|
||||||
|
query = session.query(ClientLog).filter_by(client_uuid=uuid)
|
||||||
|
|
||||||
|
# Filter by log level
|
||||||
|
if level_param:
|
||||||
|
try:
|
||||||
|
level_enum = LogLevel[level_param.upper()]
|
||||||
|
query = query.filter_by(level=level_enum)
|
||||||
|
except KeyError:
|
||||||
|
session.close()
|
||||||
|
return jsonify({"error": f"Invalid level: {level_param}. Must be ERROR, WARN, INFO, or DEBUG"}), 400
|
||||||
|
|
||||||
|
# Filter by timestamp
|
||||||
|
if since_param:
|
||||||
|
try:
|
||||||
|
# Handle both with and without 'Z' suffix
|
||||||
|
since_str = since_param.replace('Z', '+00:00')
|
||||||
|
since_dt = datetime.fromisoformat(since_str)
|
||||||
|
if since_dt.tzinfo is None:
|
||||||
|
since_dt = since_dt.replace(tzinfo=timezone.utc)
|
||||||
|
query = query.filter(ClientLog.timestamp >= since_dt)
|
||||||
|
except ValueError:
|
||||||
|
session.close()
|
||||||
|
return jsonify({"error": "Invalid timestamp format. Use ISO 8601"}), 400
|
||||||
|
|
||||||
|
# Execute query
|
||||||
|
logs = query.order_by(desc(ClientLog.timestamp)).limit(limit).all()
|
||||||
|
|
||||||
|
# Format results
|
||||||
|
result = []
|
||||||
|
for log in logs:
|
||||||
|
entry = {
|
||||||
|
"id": log.id,
|
||||||
|
"timestamp": log.timestamp.isoformat() if log.timestamp else None,
|
||||||
|
"level": log.level.value if log.level else None,
|
||||||
|
"message": log.message,
|
||||||
|
"context": {}
|
||||||
|
}
|
||||||
|
|
||||||
|
# Parse context JSON
|
||||||
|
if log.context:
|
||||||
|
try:
|
||||||
|
entry["context"] = json.loads(log.context)
|
||||||
|
except json.JSONDecodeError:
|
||||||
|
entry["context"] = {"raw": log.context}
|
||||||
|
|
||||||
|
result.append(entry)
|
||||||
|
|
||||||
|
session.close()
|
||||||
|
return jsonify({
|
||||||
|
"client_uuid": uuid,
|
||||||
|
"logs": result,
|
||||||
|
"count": len(result),
|
||||||
|
"limit": limit
|
||||||
|
})
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
session.close()
|
||||||
|
return jsonify({"error": f"Server error: {str(e)}"}), 500
|
||||||
|
|
||||||
|
|
||||||
|
@client_logs_bp.route("/summary", methods=["GET"])
|
||||||
|
@admin_or_higher
|
||||||
|
def get_logs_summary():
|
||||||
|
"""
|
||||||
|
Get summary of errors/warnings across all clients in last 24 hours
|
||||||
|
Returns count of ERROR, WARN, INFO logs per client
|
||||||
|
|
||||||
|
Example response:
|
||||||
|
{
|
||||||
|
"summary": {
|
||||||
|
"client-uuid-1": {"ERROR": 5, "WARN": 12, "INFO": 45},
|
||||||
|
"client-uuid-2": {"ERROR": 0, "WARN": 3, "INFO": 20}
|
||||||
|
},
|
||||||
|
"period_hours": 24,
|
||||||
|
"timestamp": "2026-03-09T21:00:00Z"
|
||||||
|
}
|
||||||
|
"""
|
||||||
|
session = Session()
|
||||||
|
try:
|
||||||
|
# Get hours parameter (default 24, max 168 = 1 week)
|
||||||
|
hours = min(int(request.args.get('hours', 24)), 168)
|
||||||
|
since = datetime.now(timezone.utc) - timedelta(hours=hours)
|
||||||
|
|
||||||
|
# Query log counts grouped by client and level
|
||||||
|
stats = session.query(
|
||||||
|
ClientLog.client_uuid,
|
||||||
|
ClientLog.level,
|
||||||
|
func.count(ClientLog.id).label('count')
|
||||||
|
).filter(
|
||||||
|
ClientLog.timestamp >= since
|
||||||
|
).group_by(
|
||||||
|
ClientLog.client_uuid,
|
||||||
|
ClientLog.level
|
||||||
|
).all()
|
||||||
|
|
||||||
|
# Build summary dictionary
|
||||||
|
summary = {}
|
||||||
|
for stat in stats:
|
||||||
|
uuid = stat.client_uuid
|
||||||
|
if uuid not in summary:
|
||||||
|
# Initialize all levels to 0
|
||||||
|
summary[uuid] = {
|
||||||
|
"ERROR": 0,
|
||||||
|
"WARN": 0,
|
||||||
|
"INFO": 0,
|
||||||
|
"DEBUG": 0
|
||||||
|
}
|
||||||
|
|
||||||
|
summary[uuid][stat.level.value] = stat.count
|
||||||
|
|
||||||
|
# Get client info for enrichment
|
||||||
|
clients = session.query(Client.uuid, Client.hostname, Client.description).all()
|
||||||
|
client_info = {c.uuid: {"hostname": c.hostname, "description": c.description} for c in clients}
|
||||||
|
|
||||||
|
# Enrich summary with client info
|
||||||
|
enriched_summary = {}
|
||||||
|
for uuid, counts in summary.items():
|
||||||
|
enriched_summary[uuid] = {
|
||||||
|
"counts": counts,
|
||||||
|
"info": client_info.get(uuid, {})
|
||||||
|
}
|
||||||
|
|
||||||
|
session.close()
|
||||||
|
return jsonify({
|
||||||
|
"summary": enriched_summary,
|
||||||
|
"period_hours": hours,
|
||||||
|
"since": since.isoformat(),
|
||||||
|
"timestamp": datetime.now(timezone.utc).isoformat()
|
||||||
|
})
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
session.close()
|
||||||
|
return jsonify({"error": f"Server error: {str(e)}"}), 500
|
||||||
|
|
||||||
|
|
||||||
|
@client_logs_bp.route("/recent-errors", methods=["GET"])
|
||||||
|
@admin_or_higher
|
||||||
|
def get_recent_errors():
|
||||||
|
"""
|
||||||
|
Get recent ERROR logs across all clients
|
||||||
|
Query params:
|
||||||
|
- limit: number of entries (default 20, max 100)
|
||||||
|
|
||||||
|
Useful for system-wide error monitoring
|
||||||
|
"""
|
||||||
|
session = Session()
|
||||||
|
try:
|
||||||
|
limit = min(int(request.args.get('limit', 20)), 100)
|
||||||
|
|
||||||
|
# Get recent errors from all clients
|
||||||
|
logs = session.query(ClientLog).filter_by(
|
||||||
|
level=LogLevel.ERROR
|
||||||
|
).order_by(
|
||||||
|
desc(ClientLog.timestamp)
|
||||||
|
).limit(limit).all()
|
||||||
|
|
||||||
|
result = []
|
||||||
|
for log in logs:
|
||||||
|
entry = {
|
||||||
|
"id": log.id,
|
||||||
|
"client_uuid": log.client_uuid,
|
||||||
|
"timestamp": log.timestamp.isoformat() if log.timestamp else None,
|
||||||
|
"message": log.message,
|
||||||
|
"context": json.loads(log.context) if log.context else {}
|
||||||
|
}
|
||||||
|
result.append(entry)
|
||||||
|
|
||||||
|
session.close()
|
||||||
|
return jsonify({
|
||||||
|
"errors": result,
|
||||||
|
"count": len(result)
|
||||||
|
})
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
session.close()
|
||||||
|
return jsonify({"error": f"Server error: {str(e)}"}), 500
|
||||||
@@ -8,6 +8,7 @@ from server.routes.holidays import holidays_bp
|
|||||||
from server.routes.academic_periods import academic_periods_bp
|
from server.routes.academic_periods import academic_periods_bp
|
||||||
from server.routes.groups import groups_bp
|
from server.routes.groups import groups_bp
|
||||||
from server.routes.clients import clients_bp
|
from server.routes.clients import clients_bp
|
||||||
|
from server.routes.client_logs import client_logs_bp
|
||||||
from server.routes.auth import auth_bp
|
from server.routes.auth import auth_bp
|
||||||
from server.routes.users import users_bp
|
from server.routes.users import users_bp
|
||||||
from server.routes.system_settings import system_settings_bp
|
from server.routes.system_settings import system_settings_bp
|
||||||
@@ -46,6 +47,7 @@ else:
|
|||||||
app.register_blueprint(auth_bp)
|
app.register_blueprint(auth_bp)
|
||||||
app.register_blueprint(users_bp)
|
app.register_blueprint(users_bp)
|
||||||
app.register_blueprint(clients_bp)
|
app.register_blueprint(clients_bp)
|
||||||
|
app.register_blueprint(client_logs_bp)
|
||||||
app.register_blueprint(groups_bp)
|
app.register_blueprint(groups_bp)
|
||||||
app.register_blueprint(events_bp)
|
app.register_blueprint(events_bp)
|
||||||
app.register_blueprint(event_exceptions_bp)
|
app.register_blueprint(event_exceptions_bp)
|
||||||
|
|||||||
Reference in New Issue
Block a user