infoscreen/pptx_conversion_guide_gotenberg.md

# Recommended Implementation: PPTX-to-PDF Conversion System with Gotenberg

## Architecture Overview

**Asynchronous server-side conversion using Gotenberg with shared storage**

```
User Upload → API saves PPTX → Job in Queue → Worker calls Gotenberg API
                                                      ↓
                                           Gotenberg converts via shared volume
                                                      ↓
Client requests → API checks DB status → PDF ready? → Download PDF from shared storage
                                       → Pending? → "Please wait"
                                       → Failed? → Retry/Error
```

## 1. Database Schema

```sql
CREATE TABLE media_files (
    id UUID PRIMARY KEY,
    filename VARCHAR(255),
    original_path VARCHAR(512),
    file_type VARCHAR(10),
    mime_type VARCHAR(100),
    uploaded_at TIMESTAMP,
    updated_at TIMESTAMP
);

CREATE TABLE conversions (
    id UUID PRIMARY KEY,
    source_file_id UUID REFERENCES media_files(id) ON DELETE CASCADE,
    target_format VARCHAR(10),          -- 'pdf'
    target_path VARCHAR(512),           -- Path to generated PDF
    status VARCHAR(20),                 -- 'pending', 'processing', 'ready', 'failed'
    started_at TIMESTAMP,
    completed_at TIMESTAMP,
    error_message TEXT,
    file_hash VARCHAR(64)               -- Hash of PPTX for cache invalidation
);

CREATE INDEX idx_conversions_source ON conversions(source_file_id, target_format);
```

## 2. Components

### **API Server (existing)**
- Accepts uploads
- Creates DB entries
- Enqueues jobs
- Delivers status and files

### **Background Worker (new)**
- Runs as separate process in **same container** as API
- Processes conversion jobs from queue
- Calls Gotenberg API for conversion
- Updates database with results
- Technology: Python RQ, Celery, or similar

### **Gotenberg Container (new)**
- Dedicated conversion service
- HTTP API for document conversion
- Handles LibreOffice conversions internally
- Accesses files via shared volume

### **Message Queue**
- Redis (recommended for start - simple, fast)
- Alternative: RabbitMQ for more features

### **Redis Container (separate)**
- Handles job queue
- Minimal resource footprint

### **Shared Storage**
- Docker volume mounted to all containers that need file access
- API, Worker, and Gotenberg all access same files
- Simplifies file exchange between services

## 3. Detailed Workflow

### **Upload Process:**

```python
@app.post("/upload")
async def upload_file(file):
    # 1. Save PPTX to shared volume
    file_path = save_to_disk(file)  # e.g., /shared/uploads/abc123.pptx

    # 2. DB entry for original file
    file_record = db.create_media_file({
        'filename': file.filename,
        'original_path': file_path,
        'file_type': 'pptx'
    })

    # 3. Create conversion record
    conversion = db.create_conversion({
        'source_file_id': file_record.id,
        'target_format': 'pdf',
        'status': 'pending',
        'file_hash': calculate_hash(file_path)
    })

    # 4. Enqueue job (asynchronous!)
    queue.enqueue(convert_to_pdf_via_gotenberg, conversion.id)

    # 5. Return immediately to user
    return {
        'file_id': file_record.id,
        'status': 'uploaded',
        'conversion_status': 'pending'
    }
```

### **Worker Process (calls Gotenberg):**

```python
import requests
import os

GOTENBERG_URL = os.getenv('GOTENBERG_URL', 'http://gotenberg:3000')

def convert_to_pdf_via_gotenberg(conversion_id):
    conversion = db.get_conversion(conversion_id)
    source_file = db.get_media_file(conversion.source_file_id)

    # Status update: processing
    db.update_conversion(conversion_id, {
        'status': 'processing',
        'started_at': now()
    })

    try:
        # Prepare output path
        pdf_filename = f"{conversion.id}.pdf"
        pdf_path = f"/shared/converted/{pdf_filename}"

        # Call Gotenberg API
        # Gotenberg accesses the file via shared volume
        with open(source_file.original_path, 'rb') as f:
            files = {
                'files': (os.path.basename(source_file.original_path), f)
            }

            response = requests.post(
                f'{GOTENBERG_URL}/forms/libreoffice/convert',
                files=files,
                timeout=300  # 5 minutes timeout
            )
            response.raise_for_status()

        # Save PDF to shared volume
        with open(pdf_path, 'wb') as pdf_file:
            pdf_file.write(response.content)

        # Success
        db.update_conversion(conversion_id, {
            'status': 'ready',
            'target_path': pdf_path,
            'completed_at': now()
        })

    except requests.exceptions.Timeout:
        db.update_conversion(conversion_id, {
            'status': 'failed',
            'error_message': 'Conversion timeout after 5 minutes',
            'completed_at': now()
        })
    except requests.exceptions.RequestException as e:
        db.update_conversion(conversion_id, {
            'status': 'failed',
            'error_message': f'Gotenberg API error: {str(e)}',
            'completed_at': now()
        })
    except Exception as e:
        db.update_conversion(conversion_id, {
            'status': 'failed',
            'error_message': str(e),
            'completed_at': now()
        })
```

### **Alternative: Direct File Access via Shared Volume**

If you prefer Gotenberg to read from shared storage directly (more efficient for large files):

```python
def convert_to_pdf_via_gotenberg_shared(conversion_id):
    conversion = db.get_conversion(conversion_id)
    source_file = db.get_media_file(conversion.source_file_id)

    db.update_conversion(conversion_id, {
        'status': 'processing',
        'started_at': now()
    })

    try:
        pdf_filename = f"{conversion.id}.pdf"
        pdf_path = f"/shared/converted/{pdf_filename}"

        # Gotenberg reads directly from shared volume
        # We just tell it where to find the file
        with open(source_file.original_path, 'rb') as f:
            files = {'files': f}

            response = requests.post(
                f'{GOTENBERG_URL}/forms/libreoffice/convert',
                files=files,
                timeout=300
            )
            response.raise_for_status()

        # Write result to shared volume
        with open(pdf_path, 'wb') as pdf_file:
            pdf_file.write(response.content)

        db.update_conversion(conversion_id, {
            'status': 'ready',
            'target_path': pdf_path,
            'completed_at': now()
        })

    except Exception as e:
        db.update_conversion(conversion_id, {
            'status': 'failed',
            'error_message': str(e),
            'completed_at': now()
        })
```

### **Client Download:**

```python
@app.get("/files/{file_id}/display")
async def get_display_file(file_id):
    file = db.get_media_file(file_id)

    # Only for PPTX: check PDF conversion
    if file.file_type == 'pptx':
        conversion = db.get_latest_conversion(file.id, target_format='pdf')

        if not conversion:
            # Shouldn't happen, but just to be safe
            trigger_new_conversion(file.id)
            return {'status': 'pending', 'message': 'Conversion is being created'}

        if conversion.status == 'ready':
            # Serve PDF from shared storage
            return FileResponse(conversion.target_path)

        elif conversion.status == 'failed':
            # Optional: Auto-retry
            trigger_new_conversion(file.id)
            return {'status': 'failed', 'error': conversion.error_message}

        else:  # pending or processing
            return {'status': conversion.status, 'message': 'Please wait...'}

    # Serve other file types directly
    return FileResponse(file.original_path)
```

## 4. Docker Setup

```yaml
version: '3.8'

services:
  # Your API Server
  api:
    build: ./api
    command: uvicorn main:app --host 0.0.0.0 --port 8000
    ports:
      - "8000:8000"
    volumes:
      - shared-storage:/shared  # Shared volume
    environment:
      - REDIS_URL=redis://redis:6379
      - DATABASE_URL=postgresql://postgres:password@postgres:5432/infoscreen
      - GOTENBERG_URL=http://gotenberg:3000
    depends_on:
      - redis
      - postgres
      - gotenberg
    restart: unless-stopped

  # Worker (same codebase as API, different command)
  worker:
    build: ./api  # Same build as API!
    command: python worker.py  # or: rq worker
    volumes:
      - shared-storage:/shared  # Shared volume
    environment:
      - REDIS_URL=redis://redis:6379
      - DATABASE_URL=postgresql://postgres:password@postgres:5432/infoscreen
      - GOTENBERG_URL=http://gotenberg:3000
    depends_on:
      - redis
      - postgres
      - gotenberg
    restart: unless-stopped
    # Optional: Multiple workers
    deploy:
      replicas: 2

  # Gotenberg - Document Conversion Service
  gotenberg:
    image: gotenberg/gotenberg:8
    # Gotenberg doesn't need the shared volume if files are sent via HTTP
    # But mount it if you want direct file access
    volumes:
      - shared-storage:/shared  # Optional: for direct file access
    environment:
      # Gotenberg configuration
      - GOTENBERG_API_TIMEOUT=300s
      - GOTENBERG_LOG_LEVEL=info
    restart: unless-stopped
    # Resource limits (optional but recommended)
    deploy:
      resources:
        limits:
          cpus: '2.0'
          memory: 2G
        reservations:
          cpus: '0.5'
          memory: 512M

  # Redis - separate container
  redis:
    image: redis:7-alpine
    volumes:
      - redis-data:/data
    command: redis-server --appendonly yes
    restart: unless-stopped

  # Your existing Postgres
  postgres:
    image: postgres:15
    environment:
      - POSTGRES_DB=infoscreen
      - POSTGRES_PASSWORD=password
    volumes:
      - postgres-data:/var/lib/postgresql/data
    restart: unless-stopped

  # Optional: Redis Commander (UI for debugging)
  redis-commander:
    image: rediscommander/redis-commander
    environment:
      - REDIS_HOSTS=local:redis:6379
    ports:
      - "8081:8081"
    depends_on:
      - redis

volumes:
  shared-storage:  # New: Shared storage for all file operations
  redis-data:
  postgres-data:
```

## 5. Storage Structure

```
/shared/
├── uploads/           # Original uploaded files (PPTX, etc.)
│   ├── abc123.pptx
│   ├── def456.pptx
│   └── ...
└── converted/         # Converted PDF files
    ├── uuid-1.pdf
    ├── uuid-2.pdf
    └── ...
```

## 6. Gotenberg Integration Details

### **Gotenberg API Endpoints:**

Gotenberg provides various conversion endpoints:

```python
# LibreOffice conversion (for PPTX, DOCX, ODT, etc.)
POST http://gotenberg:3000/forms/libreoffice/convert

# HTML to PDF
POST http://gotenberg:3000/forms/chromium/convert/html

# Markdown to PDF
POST http://gotenberg:3000/forms/chromium/convert/markdown

# Merge PDFs
POST http://gotenberg:3000/forms/pdfengines/merge
```

### **Example Conversion Request:**

```python
import requests

def convert_with_gotenberg(input_file_path, output_file_path):
    """
    Convert document using Gotenberg
    """
    with open(input_file_path, 'rb') as f:
        files = {
            'files': (os.path.basename(input_file_path), f,
                     'application/vnd.openxmlformats-officedocument.presentationml.presentation')
        }

        # Optional: Add conversion parameters
        data = {
            'landscape': 'false',  # Portrait mode
            'nativePageRanges': '1-',  # All pages
        }

        response = requests.post(
            'http://gotenberg:3000/forms/libreoffice/convert',
            files=files,
            data=data,
            timeout=300
        )

        if response.status_code == 200:
            with open(output_file_path, 'wb') as out:
                out.write(response.content)
            return True
        else:
            raise Exception(f"Gotenberg error: {response.status_code} - {response.text}")
```

### **Advanced Options:**

```python
# With custom PDF properties
data = {
    'landscape': 'false',
    'nativePageRanges': '1-10',  # Only first 10 pages
    'pdfFormat': 'PDF/A-1a',     # PDF/A format
    'exportFormFields': 'false',
}

# With password protection
data = {
    'userPassword': 'secret123',
    'ownerPassword': 'admin456',
}
```

## 7. Client Behavior (Pi5)

```python
# On the Pi5 client
def display_file(file_id):
    response = api.get(f"/files/{file_id}/display")

    if response.content_type == 'application/pdf':
        # PDF is ready
        download_and_display(response)
        subprocess.run(['impressive', downloaded_pdf])

    elif response.json()['status'] in ['pending', 'processing']:
        # Wait and retry
        show_loading_screen("Presentation is being prepared...")
        time.sleep(5)
        display_file(file_id)  # Retry

    else:
        # Error
        show_error_screen("Error loading presentation")
```

## 8. Additional Features

### **Cache Invalidation on PPTX Update:**

```python
@app.put("/files/{file_id}")
async def update_file(file_id, new_file):
    # Delete old conversions and PDFs
    conversions = db.get_conversions_for_file(file_id)
    for conv in conversions:
        if conv.target_path and os.path.exists(conv.target_path):
            os.remove(conv.target_path)

    db.mark_conversions_as_obsolete(file_id)

    # Update file
    update_media_file(file_id, new_file)

    # Trigger new conversion
    trigger_conversion(file_id, 'pdf')
```

### **Status API for Monitoring:**

```python
@app.get("/admin/conversions/status")
async def get_conversion_stats():
    return {
        'pending': db.count(status='pending'),
        'processing': db.count(status='processing'),
        'failed': db.count(status='failed'),
        'avg_duration_seconds': db.avg_duration(),
        'gotenberg_health': check_gotenberg_health()
    }

def check_gotenberg_health():
    try:
        response = requests.get(
            f'{GOTENBERG_URL}/health',
            timeout=5
        )
        return response.status_code == 200
    except:
        return False
```

### **Cleanup Job (Cronjob):**

```python
def cleanup_old_conversions():
    # Remove PDFs from deleted files
    orphaned = db.get_orphaned_conversions()
    for conv in orphaned:
        if conv.target_path and os.path.exists(conv.target_path):
            os.remove(conv.target_path)
        db.delete_conversion(conv.id)

    # Clean up old failed conversions
    old_failed = db.get_old_failed_conversions(older_than_days=7)
    for conv in old_failed:
        db.delete_conversion(conv.id)
```

## 9. Advantages of Using Gotenberg

✅ **Specialized Service**: Optimized specifically for document conversion
✅ **No LibreOffice Management**: Gotenberg handles LibreOffice lifecycle internally
✅ **Better Resource Management**: Isolated conversion process
✅ **HTTP API**: Clean, standard interface
✅ **Production Ready**: Battle-tested, actively maintained
✅ **Multiple Formats**: Supports PPTX, DOCX, ODT, HTML, Markdown, etc.
✅ **PDF Features**: Merge, encrypt, watermark PDFs
✅ **Health Checks**: Built-in health endpoint
✅ **Horizontal Scaling**: Can run multiple Gotenberg instances
✅ **Memory Safe**: Automatic cleanup and restart on issues

## 10. Migration Path

### **Phase 1 (MVP):**
- 1 worker process in API container
- Redis for queue (separate container)
- Gotenberg for conversion (separate container)
- Basic DB schema
- Shared volume for file exchange
- Simple retry logic

### **Phase 2 (as needed):**
- Multiple worker instances
- Multiple Gotenberg instances (load balancing)
- Monitoring & alerting
- Prioritization logic
- Advanced caching strategies
- PDF optimization/compression

**Start simple, scale when needed!**

## 11. Key Decisions Summary

| Aspect | Decision | Reason |
|--------|----------|--------|
| **Conversion Location** | Server-side (Gotenberg) | One conversion per file, consistent results |
| **Conversion Service** | Dedicated Gotenberg container | Specialized, production-ready, better isolation |
| **Conversion Timing** | Asynchronous (on upload) | No client waiting time, predictable performance |
| **Data Storage** | Database-tracked | Status visibility, robust error handling |
| **File Exchange** | Shared Docker volume | Simple, efficient, no network overhead |
| **Queue System** | Redis (separate container) | Standard pattern, scalable, maintainable |
| **Worker Architecture** | Background process in API container | Simple start, easy to separate later |

## 12. File Flow Diagram

```
┌─────────────┐
│ User Upload │
│   (PPTX)    │
└──────┬──────┘
       │
       ▼
┌──────────────────────┐
│   API Server         │
│ 1. Save to /shared   │
│ 2. Create DB record  │
│ 3. Enqueue job       │
└──────┬───────────────┘
       │
       ▼
┌──────────────────┐
│  Redis Queue     │
└──────┬───────────┘
       │
       ▼
┌──────────────────────┐
│  Worker Process      │
│ 1. Get job           │
│ 2. Call Gotenberg    │
│ 3. Update DB         │
└──────┬───────────────┘
       │
       ▼
┌──────────────────────┐
│  Gotenberg           │
│ 1. Read from /shared │
│ 2. Convert PPTX      │
│ 3. Return PDF        │
└──────┬───────────────┘
       │
       ▼
┌──────────────────────┐
│  Worker saves PDF    │
│  to /shared/converted│
└──────┬───────────────┘
       │
       ▼
┌──────────────────────┐
│  Client Requests     │
│ 1. Check DB          │
│ 2. Download PDF      │
│ 3. Display           │
└──────────────────────┘
   (via impressive)
```

## 13. Implementation Checklist

### Database Setup
- [ ] Create `media_files` table
- [ ] Create `conversions` table
- [ ] Add indexes for performance
- [ ] Set up foreign key constraints

### Storage Setup
- [ ] Create shared Docker volume
- [ ] Set up directory structure (/shared/uploads, /shared/converted)
- [ ] Configure proper permissions

### API Changes
- [ ] Modify upload endpoint to save to shared storage
- [ ] Create DB records for uploads
- [ ] Add conversion job enqueueing
- [ ] Implement file download endpoint with status checking
- [ ] Add status API for monitoring
- [ ] Implement cache invalidation on file update

### Worker Setup
- [ ] Create worker script/module
- [ ] Implement Gotenberg API calls
- [ ] Add error handling and retry logic
- [ ] Set up logging and monitoring
- [ ] Handle timeouts and failures

### Docker Configuration
- [ ] Add Gotenberg container to docker-compose.yml
- [ ] Add Redis container to docker-compose.yml
- [ ] Configure worker container
- [ ] Set up shared volume mounts
- [ ] Configure environment variables
- [ ] Set up container dependencies
- [ ] Configure resource limits for Gotenberg

### Client Updates
- [ ] Modify client to check conversion status
- [ ] Implement retry logic for pending conversions
- [ ] Add loading/waiting screens
- [ ] Implement error handling

### Testing
- [ ] Test upload → conversion → download flow
- [ ] Test multiple concurrent conversions
- [ ] Test error handling (corrupted PPTX, etc.)
- [ ] Test Gotenberg timeout handling
- [ ] Test cache invalidation on file update
- [ ] Load test with multiple clients
- [ ] Test Gotenberg health checks

### Monitoring & Operations
- [ ] Set up logging for conversions
- [ ] Monitor Gotenberg health endpoint
- [ ] Implement cleanup job for old files
- [ ] Add metrics for conversion times
- [ ] Set up alerts for failed conversions
- [ ] Monitor shared storage disk usage
- [ ] Document backup procedures

### Security
- [ ] Validate file types before conversion
- [ ] Set file size limits
- [ ] Sanitize filenames
- [ ] Implement rate limiting
- [ ] Secure inter-container communication

## 14. Gotenberg Configuration Options

### **Environment Variables:**

```yaml
gotenberg:
  image: gotenberg/gotenberg:8
  environment:
    # API Configuration
    - GOTENBERG_API_TIMEOUT=300s
    - GOTENBERG_API_PORT=3000

    # Logging
    - GOTENBERG_LOG_LEVEL=info  # debug, info, warn, error

    # LibreOffice
    - GOTENBERG_LIBREOFFICE_DISABLE_ROUTES=false
    - GOTENBERG_LIBREOFFICE_AUTO_START=true

    # Chromium (if needed for HTML/Markdown)
    - GOTENBERG_CHROMIUM_DISABLE_ROUTES=true  # Disable if not needed

    # Resource limits
    - GOTENBERG_LIBREOFFICE_MAX_QUEUE_SIZE=100
```

### **Custom Gotenberg Configuration:**

For advanced configurations, create a `gotenberg.yml`:

```yaml
api:
  timeout: 300s
  port: 3000

libreoffice:
  autoStart: true
  maxQueueSize: 100

chromium:
  disableRoutes: true
```

Mount it in docker-compose:

```yaml
gotenberg:
  image: gotenberg/gotenberg:8
  volumes:
    - ./gotenberg.yml:/etc/gotenberg/config.yml:ro
    - shared-storage:/shared
```

## 15. Troubleshooting

### **Common Issues:**

**Gotenberg timeout:**
```python
# Increase timeout for large files
response = requests.post(
    f'{GOTENBERG_URL}/forms/libreoffice/convert',
    files=files,
    timeout=600  # 10 minutes for large PPTX
)
```

**Memory issues:**
```yaml
# Increase Gotenberg memory limit
gotenberg:
  deploy:
    resources:
      limits:
        memory: 4G
```

**File permission issues:**
```bash
# Ensure proper permissions on shared volume
chmod -R 755 /shared
chown -R 1000:1000 /shared
```

**Gotenberg not responding:**
```python
# Check health before conversion
def ensure_gotenberg_healthy():
    try:
        response = requests.get(f'{GOTENBERG_URL}/health', timeout=5)
        if response.status_code != 200:
            raise Exception("Gotenberg unhealthy")
    except Exception as e:
        logger.error(f"Gotenberg health check failed: {e}")
        raise
```

---

**This architecture provides a production-ready, scalable solution using Gotenberg as a specialized conversion service with efficient file sharing via Docker volumes!**

## 16. Best Practices Specific to Infoscreen

- Idempotency by content: Always compute a SHA‑256 of the uploaded source and include it in the unique key (source_event_media_id, target_format, file_hash). This prevents duplicate work for identical content and auto-busts cache on change.
- Strict MIME/type validation: Accept only .ppt, .pptx, .odp for conversion. Reject unknown types early. Consider reading the first bytes (magic) for extra safety.
- Bounded retries with jitter: Retry conversions on transient HTTP 5xx or timeouts up to N times with exponential backoff. Do not retry on 4xx or clear user errors.
- Output naming: Derive deterministic output paths under media/converted/, e.g., <basename>.pdf. Ensure no path traversal and sanitize names.
- Timeouts and size limits: Enforce server-side max upload size and per-job conversion timeout (e.g., 10 minutes). Return clear errors for oversized/long-running files.
- Isolation and quotas: Set CPU/memory limits for Gotenberg; consider a concurrency cap per worker to avoid DB starvation.
- Health probes before work: Check Gotenberg /health prior to enqueue spikes; fail-fast to avoid queue pile-ups when Gotenberg is down.
- Observability: Log job IDs, file hashes, durations, and sizes. Expose a small /api/conversions/status summary for operational visibility.
- Cleanup policy: Periodically delete orphaned conversions (media deleted) and failed jobs older than X days. Keep successful PDFs aligned with DB rows.
- Security: Never trust client paths; always resolve relative to the known media root. Do not expose the shared volume directly; serve via API only.
- Backpressure: If queue length exceeds a threshold, surface 503/“try later” on new uploads or pause enqueue to protect the system.