olafn/infoscreen

Fork 0

Files

RobbStarkAustria 1efe40a03b Initial commit - copied workspace after database cleanup

2025-10-10 15:20:14 +00:00

24 KiB

Raw Blame History

Recommended Implementation: PPTX-to-PDF Conversion System with Gotenberg

Architecture Overview

Asynchronous server-side conversion using Gotenberg with shared storage

User Upload → API saves PPTX → Job in Queue → Worker calls Gotenberg API
                                                      ↓
                                           Gotenberg converts via shared volume
                                                      ↓
Client requests → API checks DB status → PDF ready? → Download PDF from shared storage
                                       → Pending? → "Please wait"
                                       → Failed? → Retry/Error

1. Database Schema

CREATE TABLE media_files (
    id UUID PRIMARY KEY,
    filename VARCHAR(255),
    original_path VARCHAR(512),
    file_type VARCHAR(10),
    mime_type VARCHAR(100),
    uploaded_at TIMESTAMP,
    updated_at TIMESTAMP
);

CREATE TABLE conversions (
    id UUID PRIMARY KEY,
    source_file_id UUID REFERENCES media_files(id) ON DELETE CASCADE,
    target_format VARCHAR(10),          -- 'pdf'
    target_path VARCHAR(512),           -- Path to generated PDF
    status VARCHAR(20),                 -- 'pending', 'processing', 'ready', 'failed'
    started_at TIMESTAMP,
    completed_at TIMESTAMP,
    error_message TEXT,
    file_hash VARCHAR(64)               -- Hash of PPTX for cache invalidation
);

CREATE INDEX idx_conversions_source ON conversions(source_file_id, target_format);

2. Components

API Server (existing)

Accepts uploads
Creates DB entries
Enqueues jobs
Delivers status and files

Background Worker (new)

Runs as separate process in same container as API
Processes conversion jobs from queue
Calls Gotenberg API for conversion
Updates database with results
Technology: Python RQ, Celery, or similar

Gotenberg Container (new)

Dedicated conversion service
HTTP API for document conversion
Handles LibreOffice conversions internally
Accesses files via shared volume

Message Queue

Redis (recommended for start - simple, fast)
Alternative: RabbitMQ for more features

Redis Container (separate)

Handles job queue
Minimal resource footprint

Shared Storage

Docker volume mounted to all containers that need file access
API, Worker, and Gotenberg all access same files
Simplifies file exchange between services

3. Detailed Workflow

Upload Process:

@app.post("/upload")
async def upload_file(file):
    # 1. Save PPTX to shared volume
    file_path = save_to_disk(file)  # e.g., /shared/uploads/abc123.pptx

    # 2. DB entry for original file
    file_record = db.create_media_file({
        'filename': file.filename,
        'original_path': file_path,
        'file_type': 'pptx'
    })

    # 3. Create conversion record
    conversion = db.create_conversion({
        'source_file_id': file_record.id,
        'target_format': 'pdf',
        'status': 'pending',
        'file_hash': calculate_hash(file_path)
    })

    # 4. Enqueue job (asynchronous!)
    queue.enqueue(convert_to_pdf_via_gotenberg, conversion.id)

    # 5. Return immediately to user
    return {
        'file_id': file_record.id,
        'status': 'uploaded',
        'conversion_status': 'pending'
    }

Worker Process (calls Gotenberg):

import requests
import os

GOTENBERG_URL = os.getenv('GOTENBERG_URL', 'http://gotenberg:3000')

def convert_to_pdf_via_gotenberg(conversion_id):
    conversion = db.get_conversion(conversion_id)
    source_file = db.get_media_file(conversion.source_file_id)

    # Status update: processing
    db.update_conversion(conversion_id, {
        'status': 'processing',
        'started_at': now()
    })

    try:
        # Prepare output path
        pdf_filename = f"{conversion.id}.pdf"
        pdf_path = f"/shared/converted/{pdf_filename}"

        # Call Gotenberg API
        # Gotenberg accesses the file via shared volume
        with open(source_file.original_path, 'rb') as f:
            files = {
                'files': (os.path.basename(source_file.original_path), f)
            }

            response = requests.post(
                f'{GOTENBERG_URL}/forms/libreoffice/convert',
                files=files,
                timeout=300  # 5 minutes timeout
            )
            response.raise_for_status()

        # Save PDF to shared volume
        with open(pdf_path, 'wb') as pdf_file:
            pdf_file.write(response.content)

        # Success
        db.update_conversion(conversion_id, {
            'status': 'ready',
            'target_path': pdf_path,
            'completed_at': now()
        })

    except requests.exceptions.Timeout:
        db.update_conversion(conversion_id, {
            'status': 'failed',
            'error_message': 'Conversion timeout after 5 minutes',
            'completed_at': now()
        })
    except requests.exceptions.RequestException as e:
        db.update_conversion(conversion_id, {
            'status': 'failed',
            'error_message': f'Gotenberg API error: {str(e)}',
            'completed_at': now()
        })
    except Exception as e:
        db.update_conversion(conversion_id, {
            'status': 'failed',
            'error_message': str(e),
            'completed_at': now()
        })

Alternative: Direct File Access via Shared Volume

If you prefer Gotenberg to read from shared storage directly (more efficient for large files):

def convert_to_pdf_via_gotenberg_shared(conversion_id):
    conversion = db.get_conversion(conversion_id)
    source_file = db.get_media_file(conversion.source_file_id)

    db.update_conversion(conversion_id, {
        'status': 'processing',
        'started_at': now()
    })

    try:
        pdf_filename = f"{conversion.id}.pdf"
        pdf_path = f"/shared/converted/{pdf_filename}"

        # Gotenberg reads directly from shared volume
        # We just tell it where to find the file
        with open(source_file.original_path, 'rb') as f:
            files = {'files': f}

            response = requests.post(
                f'{GOTENBERG_URL}/forms/libreoffice/convert',
                files=files,
                timeout=300
            )
            response.raise_for_status()

        # Write result to shared volume
        with open(pdf_path, 'wb') as pdf_file:
            pdf_file.write(response.content)

        db.update_conversion(conversion_id, {
            'status': 'ready',
            'target_path': pdf_path,
            'completed_at': now()
        })

    except Exception as e:
        db.update_conversion(conversion_id, {
            'status': 'failed',
            'error_message': str(e),
            'completed_at': now()
        })

Client Download:

@app.get("/files/{file_id}/display")
async def get_display_file(file_id):
    file = db.get_media_file(file_id)

    # Only for PPTX: check PDF conversion
    if file.file_type == 'pptx':
        conversion = db.get_latest_conversion(file.id, target_format='pdf')

        if not conversion:
            # Shouldn't happen, but just to be safe
            trigger_new_conversion(file.id)
            return {'status': 'pending', 'message': 'Conversion is being created'}

        if conversion.status == 'ready':
            # Serve PDF from shared storage
            return FileResponse(conversion.target_path)

        elif conversion.status == 'failed':
            # Optional: Auto-retry
            trigger_new_conversion(file.id)
            return {'status': 'failed', 'error': conversion.error_message}

        else:  # pending or processing
            return {'status': conversion.status, 'message': 'Please wait...'}

    # Serve other file types directly
    return FileResponse(file.original_path)

4. Docker Setup

version: '3.8'

services:
  # Your API Server
  api:
    build: ./api
    command: uvicorn main:app --host 0.0.0.0 --port 8000
    ports:
      - "8000:8000"
    volumes:
      - shared-storage:/shared  # Shared volume
    environment:
      - REDIS_URL=redis://redis:6379
      - DATABASE_URL=postgresql://postgres:password@postgres:5432/infoscreen
      - GOTENBERG_URL=http://gotenberg:3000
    depends_on:
      - redis
      - postgres
      - gotenberg
    restart: unless-stopped

  # Worker (same codebase as API, different command)
  worker:
    build: ./api  # Same build as API!
    command: python worker.py  # or: rq worker
    volumes:
      - shared-storage:/shared  # Shared volume
    environment:
      - REDIS_URL=redis://redis:6379
      - DATABASE_URL=postgresql://postgres:password@postgres:5432/infoscreen
      - GOTENBERG_URL=http://gotenberg:3000
    depends_on:
      - redis
      - postgres
      - gotenberg
    restart: unless-stopped
    # Optional: Multiple workers
    deploy:
      replicas: 2

  # Gotenberg - Document Conversion Service
  gotenberg:
    image: gotenberg/gotenberg:8
    # Gotenberg doesn't need the shared volume if files are sent via HTTP
    # But mount it if you want direct file access
    volumes:
      - shared-storage:/shared  # Optional: for direct file access
    environment:
      # Gotenberg configuration
      - GOTENBERG_API_TIMEOUT=300s
      - GOTENBERG_LOG_LEVEL=info
    restart: unless-stopped
    # Resource limits (optional but recommended)
    deploy:
      resources:
        limits:
          cpus: '2.0'
          memory: 2G
        reservations:
          cpus: '0.5'
          memory: 512M

  # Redis - separate container
  redis:
    image: redis:7-alpine
    volumes:
      - redis-data:/data
    command: redis-server --appendonly yes
    restart: unless-stopped

  # Your existing Postgres
  postgres:
    image: postgres:15
    environment:
      - POSTGRES_DB=infoscreen
      - POSTGRES_PASSWORD=password
    volumes:
      - postgres-data:/var/lib/postgresql/data
    restart: unless-stopped

  # Optional: Redis Commander (UI for debugging)
  redis-commander:
    image: rediscommander/redis-commander
    environment:
      - REDIS_HOSTS=local:redis:6379
    ports:
      - "8081:8081"
    depends_on:
      - redis

volumes:
  shared-storage:  # New: Shared storage for all file operations
  redis-data:
  postgres-data:

5. Storage Structure

/shared/
├── uploads/           # Original uploaded files (PPTX, etc.)
│   ├── abc123.pptx
│   ├── def456.pptx
│   └── ...
└── converted/         # Converted PDF files
    ├── uuid-1.pdf
    ├── uuid-2.pdf
    └── ...

6. Gotenberg Integration Details

Gotenberg API Endpoints:

Gotenberg provides various conversion endpoints:

# LibreOffice conversion (for PPTX, DOCX, ODT, etc.)
POST http://gotenberg:3000/forms/libreoffice/convert

# HTML to PDF
POST http://gotenberg:3000/forms/chromium/convert/html

# Markdown to PDF
POST http://gotenberg:3000/forms/chromium/convert/markdown

# Merge PDFs
POST http://gotenberg:3000/forms/pdfengines/merge

Example Conversion Request:

import requests

def convert_with_gotenberg(input_file_path, output_file_path):
    """
    Convert document using Gotenberg
    """
    with open(input_file_path, 'rb') as f:
        files = {
            'files': (os.path.basename(input_file_path), f,
                     'application/vnd.openxmlformats-officedocument.presentationml.presentation')
        }

        # Optional: Add conversion parameters
        data = {
            'landscape': 'false',  # Portrait mode
            'nativePageRanges': '1-',  # All pages
        }

        response = requests.post(
            'http://gotenberg:3000/forms/libreoffice/convert',
            files=files,
            data=data,
            timeout=300
        )

        if response.status_code == 200:
            with open(output_file_path, 'wb') as out:
                out.write(response.content)
            return True
        else:
            raise Exception(f"Gotenberg error: {response.status_code} - {response.text}")

Advanced Options:

# With custom PDF properties
data = {
    'landscape': 'false',
    'nativePageRanges': '1-10',  # Only first 10 pages
    'pdfFormat': 'PDF/A-1a',     # PDF/A format
    'exportFormFields': 'false',
}

# With password protection
data = {
    'userPassword': 'secret123',
    'ownerPassword': 'admin456',
}

7. Client Behavior (Pi5)

# On the Pi5 client
def display_file(file_id):
    response = api.get(f"/files/{file_id}/display")

    if response.content_type == 'application/pdf':
        # PDF is ready
        download_and_display(response)
        subprocess.run(['impressive', downloaded_pdf])

    elif response.json()['status'] in ['pending', 'processing']:
        # Wait and retry
        show_loading_screen("Presentation is being prepared...")
        time.sleep(5)
        display_file(file_id)  # Retry

    else:
        # Error
        show_error_screen("Error loading presentation")

8. Additional Features

Cache Invalidation on PPTX Update:

@app.put("/files/{file_id}")
async def update_file(file_id, new_file):
    # Delete old conversions and PDFs
    conversions = db.get_conversions_for_file(file_id)
    for conv in conversions:
        if conv.target_path and os.path.exists(conv.target_path):
            os.remove(conv.target_path)

    db.mark_conversions_as_obsolete(file_id)

    # Update file
    update_media_file(file_id, new_file)

    # Trigger new conversion
    trigger_conversion(file_id, 'pdf')

Status API for Monitoring:

@app.get("/admin/conversions/status")
async def get_conversion_stats():
    return {
        'pending': db.count(status='pending'),
        'processing': db.count(status='processing'),
        'failed': db.count(status='failed'),
        'avg_duration_seconds': db.avg_duration(),
        'gotenberg_health': check_gotenberg_health()
    }

def check_gotenberg_health():
    try:
        response = requests.get(
            f'{GOTENBERG_URL}/health',
            timeout=5
        )
        return response.status_code == 200
    except:
        return False

Cleanup Job (Cronjob):

def cleanup_old_conversions():
    # Remove PDFs from deleted files
    orphaned = db.get_orphaned_conversions()
    for conv in orphaned:
        if conv.target_path and os.path.exists(conv.target_path):
            os.remove(conv.target_path)
        db.delete_conversion(conv.id)

    # Clean up old failed conversions
    old_failed = db.get_old_failed_conversions(older_than_days=7)
    for conv in old_failed:
        db.delete_conversion(conv.id)

9. Advantages of Using Gotenberg

✅ Specialized Service: Optimized specifically for document conversion ✅ No LibreOffice Management: Gotenberg handles LibreOffice lifecycle internally ✅ Better Resource Management: Isolated conversion process ✅ HTTP API: Clean, standard interface ✅ Production Ready: Battle-tested, actively maintained ✅ Multiple Formats: Supports PPTX, DOCX, ODT, HTML, Markdown, etc. ✅ PDF Features: Merge, encrypt, watermark PDFs ✅ Health Checks: Built-in health endpoint ✅ Horizontal Scaling: Can run multiple Gotenberg instances ✅ Memory Safe: Automatic cleanup and restart on issues

10. Migration Path

Phase 1 (MVP):

1 worker process in API container
Redis for queue (separate container)
Gotenberg for conversion (separate container)
Basic DB schema
Shared volume for file exchange
Simple retry logic

Phase 2 (as needed):

Multiple worker instances
Multiple Gotenberg instances (load balancing)
Monitoring & alerting
Prioritization logic
Advanced caching strategies
PDF optimization/compression

Start simple, scale when needed!

11. Key Decisions Summary

Aspect	Decision	Reason
Conversion Location	Server-side (Gotenberg)	One conversion per file, consistent results
Conversion Service	Dedicated Gotenberg container	Specialized, production-ready, better isolation
Conversion Timing	Asynchronous (on upload)	No client waiting time, predictable performance
Data Storage	Database-tracked	Status visibility, robust error handling
File Exchange	Shared Docker volume	Simple, efficient, no network overhead
Queue System	Redis (separate container)	Standard pattern, scalable, maintainable
Worker Architecture	Background process in API container	Simple start, easy to separate later

12. File Flow Diagram

┌─────────────┐
│ User Upload │
│   (PPTX)    │
└──────┬──────┘
       │
       ▼
┌──────────────────────┐
│   API Server         │
│ 1. Save to /shared   │
│ 2. Create DB record  │
│ 3. Enqueue job       │
└──────┬───────────────┘
       │
       ▼
┌──────────────────┐
│  Redis Queue     │
└──────┬───────────┘
       │
       ▼
┌──────────────────────┐
│  Worker Process      │
│ 1. Get job           │
│ 2. Call Gotenberg    │
│ 3. Update DB         │
└──────┬───────────────┘
       │
       ▼
┌──────────────────────┐
│  Gotenberg           │
│ 1. Read from /shared │
│ 2. Convert PPTX      │
│ 3. Return PDF        │
└──────┬───────────────┘
       │
       ▼
┌──────────────────────┐
│  Worker saves PDF    │
│  to /shared/converted│
└──────┬───────────────┘
       │
       ▼
┌──────────────────────┐
│  Client Requests     │
│ 1. Check DB          │
│ 2. Download PDF      │
│ 3. Display           │
└──────────────────────┘
   (via impressive)

13. Implementation Checklist

Database Setup

Create media_files table
Create conversions table
Add indexes for performance
Set up foreign key constraints

Storage Setup

Create shared Docker volume
Set up directory structure (/shared/uploads, /shared/converted)
Configure proper permissions

API Changes

Modify upload endpoint to save to shared storage
Create DB records for uploads
Add conversion job enqueueing
Implement file download endpoint with status checking
Add status API for monitoring
Implement cache invalidation on file update

Worker Setup

Create worker script/module
Implement Gotenberg API calls
Add error handling and retry logic
Set up logging and monitoring
Handle timeouts and failures

Docker Configuration

Add Gotenberg container to docker-compose.yml
Add Redis container to docker-compose.yml
Configure worker container
Set up shared volume mounts
Configure environment variables
Set up container dependencies
Configure resource limits for Gotenberg

Client Updates

Modify client to check conversion status
Implement retry logic for pending conversions
Add loading/waiting screens
Implement error handling

Testing

Test upload → conversion → download flow
Test multiple concurrent conversions
Test error handling (corrupted PPTX, etc.)
Test Gotenberg timeout handling
Test cache invalidation on file update
Load test with multiple clients
Test Gotenberg health checks

Monitoring & Operations

Set up logging for conversions
Monitor Gotenberg health endpoint
Implement cleanup job for old files
Add metrics for conversion times
Set up alerts for failed conversions
Monitor shared storage disk usage
Document backup procedures

Security

Validate file types before conversion
Set file size limits
Sanitize filenames
Implement rate limiting
Secure inter-container communication

14. Gotenberg Configuration Options

Environment Variables:

gotenberg:
  image: gotenberg/gotenberg:8
  environment:
    # API Configuration
    - GOTENBERG_API_TIMEOUT=300s
    - GOTENBERG_API_PORT=3000

    # Logging
    - GOTENBERG_LOG_LEVEL=info  # debug, info, warn, error

    # LibreOffice
    - GOTENBERG_LIBREOFFICE_DISABLE_ROUTES=false
    - GOTENBERG_LIBREOFFICE_AUTO_START=true

    # Chromium (if needed for HTML/Markdown)
    - GOTENBERG_CHROMIUM_DISABLE_ROUTES=true  # Disable if not needed

    # Resource limits
    - GOTENBERG_LIBREOFFICE_MAX_QUEUE_SIZE=100

Custom Gotenberg Configuration:

For advanced configurations, create a gotenberg.yml:

api:
  timeout: 300s
  port: 3000

libreoffice:
  autoStart: true
  maxQueueSize: 100

chromium:
  disableRoutes: true

Mount it in docker-compose:

gotenberg:
  image: gotenberg/gotenberg:8
  volumes:
    - ./gotenberg.yml:/etc/gotenberg/config.yml:ro
    - shared-storage:/shared

15. Troubleshooting

Common Issues:

Gotenberg timeout:

# Increase timeout for large files
response = requests.post(
    f'{GOTENBERG_URL}/forms/libreoffice/convert',
    files=files,
    timeout=600  # 10 minutes for large PPTX
)

Memory issues:

# Increase Gotenberg memory limit
gotenberg:
  deploy:
    resources:
      limits:
        memory: 4G

File permission issues:

# Ensure proper permissions on shared volume
chmod -R 755 /shared
chown -R 1000:1000 /shared

Gotenberg not responding:

# Check health before conversion
def ensure_gotenberg_healthy():
    try:
        response = requests.get(f'{GOTENBERG_URL}/health', timeout=5)
        if response.status_code != 200:
            raise Exception("Gotenberg unhealthy")
    except Exception as e:
        logger.error(f"Gotenberg health check failed: {e}")
        raise

This architecture provides a production-ready, scalable solution using Gotenberg as a specialized conversion service with efficient file sharing via Docker volumes!

16. Best Practices Specific to Infoscreen

Idempotency by content: Always compute a SHA‑256 of the uploaded source and include it in the unique key (source_event_media_id, target_format, file_hash). This prevents duplicate work for identical content and auto-busts cache on change.
Strict MIME/type validation: Accept only .ppt, .pptx, .odp for conversion. Reject unknown types early. Consider reading the first bytes (magic) for extra safety.
Bounded retries with jitter: Retry conversions on transient HTTP 5xx or timeouts up to N times with exponential backoff. Do not retry on 4xx or clear user errors.
Output naming: Derive deterministic output paths under media/converted/, e.g., .pdf. Ensure no path traversal and sanitize names.
Timeouts and size limits: Enforce server-side max upload size and per-job conversion timeout (e.g., 10 minutes). Return clear errors for oversized/long-running files.
Isolation and quotas: Set CPU/memory limits for Gotenberg; consider a concurrency cap per worker to avoid DB starvation.
Health probes before work: Check Gotenberg /health prior to enqueue spikes; fail-fast to avoid queue pile-ups when Gotenberg is down.
Observability: Log job IDs, file hashes, durations, and sizes. Expose a small /api/conversions/status summary for operational visibility.
Cleanup policy: Periodically delete orphaned conversions (media deleted) and failed jobs older than X days. Keep successful PDFs aligned with DB rows.
Security: Never trust client paths; always resolve relative to the known media root. Do not expose the shared volume directly; serve via API only.
Backpressure: If queue length exceeds a threshold, surface 503/“try later” on new uploads or pause enqueue to protect the system.

24 KiB Raw Blame History Unescape Escape