Files
infoscreen/pptx_conversion_guide_gotenberg.md
2025-10-10 15:20:14 +00:00

24 KiB
Raw Blame History

Recommended Implementation: PPTX-to-PDF Conversion System with Gotenberg

Architecture Overview

Asynchronous server-side conversion using Gotenberg with shared storage

User Upload → API saves PPTX → Job in Queue → Worker calls Gotenberg API
                                                      ↓
                                           Gotenberg converts via shared volume
                                                      ↓
Client requests → API checks DB status → PDF ready? → Download PDF from shared storage
                                       → Pending? → "Please wait"
                                       → Failed? → Retry/Error

1. Database Schema

CREATE TABLE media_files (
    id UUID PRIMARY KEY,
    filename VARCHAR(255),
    original_path VARCHAR(512),
    file_type VARCHAR(10),
    mime_type VARCHAR(100),
    uploaded_at TIMESTAMP,
    updated_at TIMESTAMP
);

CREATE TABLE conversions (
    id UUID PRIMARY KEY,
    source_file_id UUID REFERENCES media_files(id) ON DELETE CASCADE,
    target_format VARCHAR(10),          -- 'pdf'
    target_path VARCHAR(512),           -- Path to generated PDF
    status VARCHAR(20),                 -- 'pending', 'processing', 'ready', 'failed'
    started_at TIMESTAMP,
    completed_at TIMESTAMP,
    error_message TEXT,
    file_hash VARCHAR(64)               -- Hash of PPTX for cache invalidation
);

CREATE INDEX idx_conversions_source ON conversions(source_file_id, target_format);

2. Components

API Server (existing)

  • Accepts uploads
  • Creates DB entries
  • Enqueues jobs
  • Delivers status and files

Background Worker (new)

  • Runs as separate process in same container as API
  • Processes conversion jobs from queue
  • Calls Gotenberg API for conversion
  • Updates database with results
  • Technology: Python RQ, Celery, or similar

Gotenberg Container (new)

  • Dedicated conversion service
  • HTTP API for document conversion
  • Handles LibreOffice conversions internally
  • Accesses files via shared volume

Message Queue

  • Redis (recommended for start - simple, fast)
  • Alternative: RabbitMQ for more features

Redis Container (separate)

  • Handles job queue
  • Minimal resource footprint

Shared Storage

  • Docker volume mounted to all containers that need file access
  • API, Worker, and Gotenberg all access same files
  • Simplifies file exchange between services

3. Detailed Workflow

Upload Process:

@app.post("/upload")
async def upload_file(file):
    # 1. Save PPTX to shared volume
    file_path = save_to_disk(file)  # e.g., /shared/uploads/abc123.pptx

    # 2. DB entry for original file
    file_record = db.create_media_file({
        'filename': file.filename,
        'original_path': file_path,
        'file_type': 'pptx'
    })

    # 3. Create conversion record
    conversion = db.create_conversion({
        'source_file_id': file_record.id,
        'target_format': 'pdf',
        'status': 'pending',
        'file_hash': calculate_hash(file_path)
    })

    # 4. Enqueue job (asynchronous!)
    queue.enqueue(convert_to_pdf_via_gotenberg, conversion.id)

    # 5. Return immediately to user
    return {
        'file_id': file_record.id,
        'status': 'uploaded',
        'conversion_status': 'pending'
    }

Worker Process (calls Gotenberg):

import requests
import os

GOTENBERG_URL = os.getenv('GOTENBERG_URL', 'http://gotenberg:3000')

def convert_to_pdf_via_gotenberg(conversion_id):
    conversion = db.get_conversion(conversion_id)
    source_file = db.get_media_file(conversion.source_file_id)

    # Status update: processing
    db.update_conversion(conversion_id, {
        'status': 'processing',
        'started_at': now()
    })

    try:
        # Prepare output path
        pdf_filename = f"{conversion.id}.pdf"
        pdf_path = f"/shared/converted/{pdf_filename}"

        # Call Gotenberg API
        # Gotenberg accesses the file via shared volume
        with open(source_file.original_path, 'rb') as f:
            files = {
                'files': (os.path.basename(source_file.original_path), f)
            }

            response = requests.post(
                f'{GOTENBERG_URL}/forms/libreoffice/convert',
                files=files,
                timeout=300  # 5 minutes timeout
            )
            response.raise_for_status()

        # Save PDF to shared volume
        with open(pdf_path, 'wb') as pdf_file:
            pdf_file.write(response.content)

        # Success
        db.update_conversion(conversion_id, {
            'status': 'ready',
            'target_path': pdf_path,
            'completed_at': now()
        })

    except requests.exceptions.Timeout:
        db.update_conversion(conversion_id, {
            'status': 'failed',
            'error_message': 'Conversion timeout after 5 minutes',
            'completed_at': now()
        })
    except requests.exceptions.RequestException as e:
        db.update_conversion(conversion_id, {
            'status': 'failed',
            'error_message': f'Gotenberg API error: {str(e)}',
            'completed_at': now()
        })
    except Exception as e:
        db.update_conversion(conversion_id, {
            'status': 'failed',
            'error_message': str(e),
            'completed_at': now()
        })

Alternative: Direct File Access via Shared Volume

If you prefer Gotenberg to read from shared storage directly (more efficient for large files):

def convert_to_pdf_via_gotenberg_shared(conversion_id):
    conversion = db.get_conversion(conversion_id)
    source_file = db.get_media_file(conversion.source_file_id)

    db.update_conversion(conversion_id, {
        'status': 'processing',
        'started_at': now()
    })

    try:
        pdf_filename = f"{conversion.id}.pdf"
        pdf_path = f"/shared/converted/{pdf_filename}"

        # Gotenberg reads directly from shared volume
        # We just tell it where to find the file
        with open(source_file.original_path, 'rb') as f:
            files = {'files': f}

            response = requests.post(
                f'{GOTENBERG_URL}/forms/libreoffice/convert',
                files=files,
                timeout=300
            )
            response.raise_for_status()

        # Write result to shared volume
        with open(pdf_path, 'wb') as pdf_file:
            pdf_file.write(response.content)

        db.update_conversion(conversion_id, {
            'status': 'ready',
            'target_path': pdf_path,
            'completed_at': now()
        })

    except Exception as e:
        db.update_conversion(conversion_id, {
            'status': 'failed',
            'error_message': str(e),
            'completed_at': now()
        })

Client Download:

@app.get("/files/{file_id}/display")
async def get_display_file(file_id):
    file = db.get_media_file(file_id)

    # Only for PPTX: check PDF conversion
    if file.file_type == 'pptx':
        conversion = db.get_latest_conversion(file.id, target_format='pdf')

        if not conversion:
            # Shouldn't happen, but just to be safe
            trigger_new_conversion(file.id)
            return {'status': 'pending', 'message': 'Conversion is being created'}

        if conversion.status == 'ready':
            # Serve PDF from shared storage
            return FileResponse(conversion.target_path)

        elif conversion.status == 'failed':
            # Optional: Auto-retry
            trigger_new_conversion(file.id)
            return {'status': 'failed', 'error': conversion.error_message}

        else:  # pending or processing
            return {'status': conversion.status, 'message': 'Please wait...'}

    # Serve other file types directly
    return FileResponse(file.original_path)

4. Docker Setup

version: '3.8'

services:
  # Your API Server
  api:
    build: ./api
    command: uvicorn main:app --host 0.0.0.0 --port 8000
    ports:
      - "8000:8000"
    volumes:
      - shared-storage:/shared  # Shared volume
    environment:
      - REDIS_URL=redis://redis:6379
      - DATABASE_URL=postgresql://postgres:password@postgres:5432/infoscreen
      - GOTENBERG_URL=http://gotenberg:3000
    depends_on:
      - redis
      - postgres
      - gotenberg
    restart: unless-stopped

  # Worker (same codebase as API, different command)
  worker:
    build: ./api  # Same build as API!
    command: python worker.py  # or: rq worker
    volumes:
      - shared-storage:/shared  # Shared volume
    environment:
      - REDIS_URL=redis://redis:6379
      - DATABASE_URL=postgresql://postgres:password@postgres:5432/infoscreen
      - GOTENBERG_URL=http://gotenberg:3000
    depends_on:
      - redis
      - postgres
      - gotenberg
    restart: unless-stopped
    # Optional: Multiple workers
    deploy:
      replicas: 2

  # Gotenberg - Document Conversion Service
  gotenberg:
    image: gotenberg/gotenberg:8
    # Gotenberg doesn't need the shared volume if files are sent via HTTP
    # But mount it if you want direct file access
    volumes:
      - shared-storage:/shared  # Optional: for direct file access
    environment:
      # Gotenberg configuration
      - GOTENBERG_API_TIMEOUT=300s
      - GOTENBERG_LOG_LEVEL=info
    restart: unless-stopped
    # Resource limits (optional but recommended)
    deploy:
      resources:
        limits:
          cpus: '2.0'
          memory: 2G
        reservations:
          cpus: '0.5'
          memory: 512M

  # Redis - separate container
  redis:
    image: redis:7-alpine
    volumes:
      - redis-data:/data
    command: redis-server --appendonly yes
    restart: unless-stopped

  # Your existing Postgres
  postgres:
    image: postgres:15
    environment:
      - POSTGRES_DB=infoscreen
      - POSTGRES_PASSWORD=password
    volumes:
      - postgres-data:/var/lib/postgresql/data
    restart: unless-stopped

  # Optional: Redis Commander (UI for debugging)
  redis-commander:
    image: rediscommander/redis-commander
    environment:
      - REDIS_HOSTS=local:redis:6379
    ports:
      - "8081:8081"
    depends_on:
      - redis

volumes:
  shared-storage:  # New: Shared storage for all file operations
  redis-data:
  postgres-data:

5. Storage Structure

/shared/
├── uploads/           # Original uploaded files (PPTX, etc.)
│   ├── abc123.pptx
│   ├── def456.pptx
│   └── ...
└── converted/         # Converted PDF files
    ├── uuid-1.pdf
    ├── uuid-2.pdf
    └── ...

6. Gotenberg Integration Details

Gotenberg API Endpoints:

Gotenberg provides various conversion endpoints:

# LibreOffice conversion (for PPTX, DOCX, ODT, etc.)
POST http://gotenberg:3000/forms/libreoffice/convert

# HTML to PDF
POST http://gotenberg:3000/forms/chromium/convert/html

# Markdown to PDF
POST http://gotenberg:3000/forms/chromium/convert/markdown

# Merge PDFs
POST http://gotenberg:3000/forms/pdfengines/merge

Example Conversion Request:

import requests

def convert_with_gotenberg(input_file_path, output_file_path):
    """
    Convert document using Gotenberg
    """
    with open(input_file_path, 'rb') as f:
        files = {
            'files': (os.path.basename(input_file_path), f,
                     'application/vnd.openxmlformats-officedocument.presentationml.presentation')
        }

        # Optional: Add conversion parameters
        data = {
            'landscape': 'false',  # Portrait mode
            'nativePageRanges': '1-',  # All pages
        }

        response = requests.post(
            'http://gotenberg:3000/forms/libreoffice/convert',
            files=files,
            data=data,
            timeout=300
        )

        if response.status_code == 200:
            with open(output_file_path, 'wb') as out:
                out.write(response.content)
            return True
        else:
            raise Exception(f"Gotenberg error: {response.status_code} - {response.text}")

Advanced Options:

# With custom PDF properties
data = {
    'landscape': 'false',
    'nativePageRanges': '1-10',  # Only first 10 pages
    'pdfFormat': 'PDF/A-1a',     # PDF/A format
    'exportFormFields': 'false',
}

# With password protection
data = {
    'userPassword': 'secret123',
    'ownerPassword': 'admin456',
}

7. Client Behavior (Pi5)

# On the Pi5 client
def display_file(file_id):
    response = api.get(f"/files/{file_id}/display")

    if response.content_type == 'application/pdf':
        # PDF is ready
        download_and_display(response)
        subprocess.run(['impressive', downloaded_pdf])

    elif response.json()['status'] in ['pending', 'processing']:
        # Wait and retry
        show_loading_screen("Presentation is being prepared...")
        time.sleep(5)
        display_file(file_id)  # Retry

    else:
        # Error
        show_error_screen("Error loading presentation")

8. Additional Features

Cache Invalidation on PPTX Update:

@app.put("/files/{file_id}")
async def update_file(file_id, new_file):
    # Delete old conversions and PDFs
    conversions = db.get_conversions_for_file(file_id)
    for conv in conversions:
        if conv.target_path and os.path.exists(conv.target_path):
            os.remove(conv.target_path)

    db.mark_conversions_as_obsolete(file_id)

    # Update file
    update_media_file(file_id, new_file)

    # Trigger new conversion
    trigger_conversion(file_id, 'pdf')

Status API for Monitoring:

@app.get("/admin/conversions/status")
async def get_conversion_stats():
    return {
        'pending': db.count(status='pending'),
        'processing': db.count(status='processing'),
        'failed': db.count(status='failed'),
        'avg_duration_seconds': db.avg_duration(),
        'gotenberg_health': check_gotenberg_health()
    }

def check_gotenberg_health():
    try:
        response = requests.get(
            f'{GOTENBERG_URL}/health',
            timeout=5
        )
        return response.status_code == 200
    except:
        return False

Cleanup Job (Cronjob):

def cleanup_old_conversions():
    # Remove PDFs from deleted files
    orphaned = db.get_orphaned_conversions()
    for conv in orphaned:
        if conv.target_path and os.path.exists(conv.target_path):
            os.remove(conv.target_path)
        db.delete_conversion(conv.id)

    # Clean up old failed conversions
    old_failed = db.get_old_failed_conversions(older_than_days=7)
    for conv in old_failed:
        db.delete_conversion(conv.id)

9. Advantages of Using Gotenberg

Specialized Service: Optimized specifically for document conversion No LibreOffice Management: Gotenberg handles LibreOffice lifecycle internally Better Resource Management: Isolated conversion process HTTP API: Clean, standard interface Production Ready: Battle-tested, actively maintained Multiple Formats: Supports PPTX, DOCX, ODT, HTML, Markdown, etc. PDF Features: Merge, encrypt, watermark PDFs Health Checks: Built-in health endpoint Horizontal Scaling: Can run multiple Gotenberg instances Memory Safe: Automatic cleanup and restart on issues

10. Migration Path

Phase 1 (MVP):

  • 1 worker process in API container
  • Redis for queue (separate container)
  • Gotenberg for conversion (separate container)
  • Basic DB schema
  • Shared volume for file exchange
  • Simple retry logic

Phase 2 (as needed):

  • Multiple worker instances
  • Multiple Gotenberg instances (load balancing)
  • Monitoring & alerting
  • Prioritization logic
  • Advanced caching strategies
  • PDF optimization/compression

Start simple, scale when needed!

11. Key Decisions Summary

Aspect Decision Reason
Conversion Location Server-side (Gotenberg) One conversion per file, consistent results
Conversion Service Dedicated Gotenberg container Specialized, production-ready, better isolation
Conversion Timing Asynchronous (on upload) No client waiting time, predictable performance
Data Storage Database-tracked Status visibility, robust error handling
File Exchange Shared Docker volume Simple, efficient, no network overhead
Queue System Redis (separate container) Standard pattern, scalable, maintainable
Worker Architecture Background process in API container Simple start, easy to separate later

12. File Flow Diagram

┌─────────────┐
│ User Upload │
│   (PPTX)    │
└──────┬──────┘
       │
       ▼
┌──────────────────────┐
│   API Server         │
│ 1. Save to /shared   │
│ 2. Create DB record  │
│ 3. Enqueue job       │
└──────┬───────────────┘
       │
       ▼
┌──────────────────┐
│  Redis Queue     │
└──────┬───────────┘
       │
       ▼
┌──────────────────────┐
│  Worker Process      │
│ 1. Get job           │
│ 2. Call Gotenberg    │
│ 3. Update DB         │
└──────┬───────────────┘
       │
       ▼
┌──────────────────────┐
│  Gotenberg           │
│ 1. Read from /shared │
│ 2. Convert PPTX      │
│ 3. Return PDF        │
└──────┬───────────────┘
       │
       ▼
┌──────────────────────┐
│  Worker saves PDF    │
│  to /shared/converted│
└──────┬───────────────┘
       │
       ▼
┌──────────────────────┐
│  Client Requests     │
│ 1. Check DB          │
│ 2. Download PDF      │
│ 3. Display           │
└──────────────────────┘
   (via impressive)

13. Implementation Checklist

Database Setup

  • Create media_files table
  • Create conversions table
  • Add indexes for performance
  • Set up foreign key constraints

Storage Setup

  • Create shared Docker volume
  • Set up directory structure (/shared/uploads, /shared/converted)
  • Configure proper permissions

API Changes

  • Modify upload endpoint to save to shared storage
  • Create DB records for uploads
  • Add conversion job enqueueing
  • Implement file download endpoint with status checking
  • Add status API for monitoring
  • Implement cache invalidation on file update

Worker Setup

  • Create worker script/module
  • Implement Gotenberg API calls
  • Add error handling and retry logic
  • Set up logging and monitoring
  • Handle timeouts and failures

Docker Configuration

  • Add Gotenberg container to docker-compose.yml
  • Add Redis container to docker-compose.yml
  • Configure worker container
  • Set up shared volume mounts
  • Configure environment variables
  • Set up container dependencies
  • Configure resource limits for Gotenberg

Client Updates

  • Modify client to check conversion status
  • Implement retry logic for pending conversions
  • Add loading/waiting screens
  • Implement error handling

Testing

  • Test upload → conversion → download flow
  • Test multiple concurrent conversions
  • Test error handling (corrupted PPTX, etc.)
  • Test Gotenberg timeout handling
  • Test cache invalidation on file update
  • Load test with multiple clients
  • Test Gotenberg health checks

Monitoring & Operations

  • Set up logging for conversions
  • Monitor Gotenberg health endpoint
  • Implement cleanup job for old files
  • Add metrics for conversion times
  • Set up alerts for failed conversions
  • Monitor shared storage disk usage
  • Document backup procedures

Security

  • Validate file types before conversion
  • Set file size limits
  • Sanitize filenames
  • Implement rate limiting
  • Secure inter-container communication

14. Gotenberg Configuration Options

Environment Variables:

gotenberg:
  image: gotenberg/gotenberg:8
  environment:
    # API Configuration
    - GOTENBERG_API_TIMEOUT=300s
    - GOTENBERG_API_PORT=3000

    # Logging
    - GOTENBERG_LOG_LEVEL=info  # debug, info, warn, error

    # LibreOffice
    - GOTENBERG_LIBREOFFICE_DISABLE_ROUTES=false
    - GOTENBERG_LIBREOFFICE_AUTO_START=true

    # Chromium (if needed for HTML/Markdown)
    - GOTENBERG_CHROMIUM_DISABLE_ROUTES=true  # Disable if not needed

    # Resource limits
    - GOTENBERG_LIBREOFFICE_MAX_QUEUE_SIZE=100

Custom Gotenberg Configuration:

For advanced configurations, create a gotenberg.yml:

api:
  timeout: 300s
  port: 3000

libreoffice:
  autoStart: true
  maxQueueSize: 100

chromium:
  disableRoutes: true

Mount it in docker-compose:

gotenberg:
  image: gotenberg/gotenberg:8
  volumes:
    - ./gotenberg.yml:/etc/gotenberg/config.yml:ro
    - shared-storage:/shared

15. Troubleshooting

Common Issues:

Gotenberg timeout:

# Increase timeout for large files
response = requests.post(
    f'{GOTENBERG_URL}/forms/libreoffice/convert',
    files=files,
    timeout=600  # 10 minutes for large PPTX
)

Memory issues:

# Increase Gotenberg memory limit
gotenberg:
  deploy:
    resources:
      limits:
        memory: 4G

File permission issues:

# Ensure proper permissions on shared volume
chmod -R 755 /shared
chown -R 1000:1000 /shared

Gotenberg not responding:

# Check health before conversion
def ensure_gotenberg_healthy():
    try:
        response = requests.get(f'{GOTENBERG_URL}/health', timeout=5)
        if response.status_code != 200:
            raise Exception("Gotenberg unhealthy")
    except Exception as e:
        logger.error(f"Gotenberg health check failed: {e}")
        raise

This architecture provides a production-ready, scalable solution using Gotenberg as a specialized conversion service with efficient file sharing via Docker volumes!

16. Best Practices Specific to Infoscreen

  • Idempotency by content: Always compute a SHA256 of the uploaded source and include it in the unique key (source_event_media_id, target_format, file_hash). This prevents duplicate work for identical content and auto-busts cache on change.
  • Strict MIME/type validation: Accept only .ppt, .pptx, .odp for conversion. Reject unknown types early. Consider reading the first bytes (magic) for extra safety.
  • Bounded retries with jitter: Retry conversions on transient HTTP 5xx or timeouts up to N times with exponential backoff. Do not retry on 4xx or clear user errors.
  • Output naming: Derive deterministic output paths under media/converted/, e.g., .pdf. Ensure no path traversal and sanitize names.
  • Timeouts and size limits: Enforce server-side max upload size and per-job conversion timeout (e.g., 10 minutes). Return clear errors for oversized/long-running files.
  • Isolation and quotas: Set CPU/memory limits for Gotenberg; consider a concurrency cap per worker to avoid DB starvation.
  • Health probes before work: Check Gotenberg /health prior to enqueue spikes; fail-fast to avoid queue pile-ups when Gotenberg is down.
  • Observability: Log job IDs, file hashes, durations, and sizes. Expose a small /api/conversions/status summary for operational visibility.
  • Cleanup policy: Periodically delete orphaned conversions (media deleted) and failed jobs older than X days. Keep successful PDFs aligned with DB rows.
  • Security: Never trust client paths; always resolve relative to the known media root. Do not expose the shared volume directly; serve via API only.
  • Backpressure: If queue length exceeds a threshold, surface 503/“try later” on new uploads or pause enqueue to protect the system.