# Recommended Implementation: PPTX-to-PDF Conversion System with Gotenberg ## Architecture Overview **Asynchronous server-side conversion using Gotenberg with shared storage** ``` User Upload → API saves PPTX → Job in Queue → Worker calls Gotenberg API ↓ Gotenberg converts via shared volume ↓ Client requests → API checks DB status → PDF ready? → Download PDF from shared storage → Pending? → "Please wait" → Failed? → Retry/Error ``` ## 1. Database Schema ```sql CREATE TABLE media_files ( id UUID PRIMARY KEY, filename VARCHAR(255), original_path VARCHAR(512), file_type VARCHAR(10), mime_type VARCHAR(100), uploaded_at TIMESTAMP, updated_at TIMESTAMP ); CREATE TABLE conversions ( id UUID PRIMARY KEY, source_file_id UUID REFERENCES media_files(id) ON DELETE CASCADE, target_format VARCHAR(10), -- 'pdf' target_path VARCHAR(512), -- Path to generated PDF status VARCHAR(20), -- 'pending', 'processing', 'ready', 'failed' started_at TIMESTAMP, completed_at TIMESTAMP, error_message TEXT, file_hash VARCHAR(64) -- Hash of PPTX for cache invalidation ); CREATE INDEX idx_conversions_source ON conversions(source_file_id, target_format); ``` ## 2. Components ### **API Server (existing)** - Accepts uploads - Creates DB entries - Enqueues jobs - Delivers status and files ### **Background Worker (new)** - Runs as separate process in **same container** as API - Processes conversion jobs from queue - Calls Gotenberg API for conversion - Updates database with results - Technology: Python RQ, Celery, or similar ### **Gotenberg Container (new)** - Dedicated conversion service - HTTP API for document conversion - Handles LibreOffice conversions internally - Accesses files via shared volume ### **Message Queue** - Redis (recommended for start - simple, fast) - Alternative: RabbitMQ for more features ### **Redis Container (separate)** - Handles job queue - Minimal resource footprint ### **Shared Storage** - Docker volume mounted to all containers that need file access - API, Worker, and Gotenberg all access same files - Simplifies file exchange between services ## 3. Detailed Workflow ### **Upload Process:** ```python @app.post("/upload") async def upload_file(file): # 1. Save PPTX to shared volume file_path = save_to_disk(file) # e.g., /shared/uploads/abc123.pptx # 2. DB entry for original file file_record = db.create_media_file({ 'filename': file.filename, 'original_path': file_path, 'file_type': 'pptx' }) # 3. Create conversion record conversion = db.create_conversion({ 'source_file_id': file_record.id, 'target_format': 'pdf', 'status': 'pending', 'file_hash': calculate_hash(file_path) }) # 4. Enqueue job (asynchronous!) queue.enqueue(convert_to_pdf_via_gotenberg, conversion.id) # 5. Return immediately to user return { 'file_id': file_record.id, 'status': 'uploaded', 'conversion_status': 'pending' } ``` ### **Worker Process (calls Gotenberg):** ```python import requests import os GOTENBERG_URL = os.getenv('GOTENBERG_URL', 'http://gotenberg:3000') def convert_to_pdf_via_gotenberg(conversion_id): conversion = db.get_conversion(conversion_id) source_file = db.get_media_file(conversion.source_file_id) # Status update: processing db.update_conversion(conversion_id, { 'status': 'processing', 'started_at': now() }) try: # Prepare output path pdf_filename = f"{conversion.id}.pdf" pdf_path = f"/shared/converted/{pdf_filename}" # Call Gotenberg API # Gotenberg accesses the file via shared volume with open(source_file.original_path, 'rb') as f: files = { 'files': (os.path.basename(source_file.original_path), f) } response = requests.post( f'{GOTENBERG_URL}/forms/libreoffice/convert', files=files, timeout=300 # 5 minutes timeout ) response.raise_for_status() # Save PDF to shared volume with open(pdf_path, 'wb') as pdf_file: pdf_file.write(response.content) # Success db.update_conversion(conversion_id, { 'status': 'ready', 'target_path': pdf_path, 'completed_at': now() }) except requests.exceptions.Timeout: db.update_conversion(conversion_id, { 'status': 'failed', 'error_message': 'Conversion timeout after 5 minutes', 'completed_at': now() }) except requests.exceptions.RequestException as e: db.update_conversion(conversion_id, { 'status': 'failed', 'error_message': f'Gotenberg API error: {str(e)}', 'completed_at': now() }) except Exception as e: db.update_conversion(conversion_id, { 'status': 'failed', 'error_message': str(e), 'completed_at': now() }) ``` ### **Alternative: Direct File Access via Shared Volume** If you prefer Gotenberg to read from shared storage directly (more efficient for large files): ```python def convert_to_pdf_via_gotenberg_shared(conversion_id): conversion = db.get_conversion(conversion_id) source_file = db.get_media_file(conversion.source_file_id) db.update_conversion(conversion_id, { 'status': 'processing', 'started_at': now() }) try: pdf_filename = f"{conversion.id}.pdf" pdf_path = f"/shared/converted/{pdf_filename}" # Gotenberg reads directly from shared volume # We just tell it where to find the file with open(source_file.original_path, 'rb') as f: files = {'files': f} response = requests.post( f'{GOTENBERG_URL}/forms/libreoffice/convert', files=files, timeout=300 ) response.raise_for_status() # Write result to shared volume with open(pdf_path, 'wb') as pdf_file: pdf_file.write(response.content) db.update_conversion(conversion_id, { 'status': 'ready', 'target_path': pdf_path, 'completed_at': now() }) except Exception as e: db.update_conversion(conversion_id, { 'status': 'failed', 'error_message': str(e), 'completed_at': now() }) ``` ### **Client Download:** ```python @app.get("/files/{file_id}/display") async def get_display_file(file_id): file = db.get_media_file(file_id) # Only for PPTX: check PDF conversion if file.file_type == 'pptx': conversion = db.get_latest_conversion(file.id, target_format='pdf') if not conversion: # Shouldn't happen, but just to be safe trigger_new_conversion(file.id) return {'status': 'pending', 'message': 'Conversion is being created'} if conversion.status == 'ready': # Serve PDF from shared storage return FileResponse(conversion.target_path) elif conversion.status == 'failed': # Optional: Auto-retry trigger_new_conversion(file.id) return {'status': 'failed', 'error': conversion.error_message} else: # pending or processing return {'status': conversion.status, 'message': 'Please wait...'} # Serve other file types directly return FileResponse(file.original_path) ``` ## 4. Docker Setup ```yaml version: '3.8' services: # Your API Server api: build: ./api command: uvicorn main:app --host 0.0.0.0 --port 8000 ports: - "8000:8000" volumes: - shared-storage:/shared # Shared volume environment: - REDIS_URL=redis://redis:6379 - DATABASE_URL=postgresql://postgres:password@postgres:5432/infoscreen - GOTENBERG_URL=http://gotenberg:3000 depends_on: - redis - postgres - gotenberg restart: unless-stopped # Worker (same codebase as API, different command) worker: build: ./api # Same build as API! command: python worker.py # or: rq worker volumes: - shared-storage:/shared # Shared volume environment: - REDIS_URL=redis://redis:6379 - DATABASE_URL=postgresql://postgres:password@postgres:5432/infoscreen - GOTENBERG_URL=http://gotenberg:3000 depends_on: - redis - postgres - gotenberg restart: unless-stopped # Optional: Multiple workers deploy: replicas: 2 # Gotenberg - Document Conversion Service gotenberg: image: gotenberg/gotenberg:8 # Gotenberg doesn't need the shared volume if files are sent via HTTP # But mount it if you want direct file access volumes: - shared-storage:/shared # Optional: for direct file access environment: # Gotenberg configuration - GOTENBERG_API_TIMEOUT=300s - GOTENBERG_LOG_LEVEL=info restart: unless-stopped # Resource limits (optional but recommended) deploy: resources: limits: cpus: '2.0' memory: 2G reservations: cpus: '0.5' memory: 512M # Redis - separate container redis: image: redis:7-alpine volumes: - redis-data:/data command: redis-server --appendonly yes restart: unless-stopped # Your existing Postgres postgres: image: postgres:15 environment: - POSTGRES_DB=infoscreen - POSTGRES_PASSWORD=password volumes: - postgres-data:/var/lib/postgresql/data restart: unless-stopped # Optional: Redis Commander (UI for debugging) redis-commander: image: rediscommander/redis-commander environment: - REDIS_HOSTS=local:redis:6379 ports: - "8081:8081" depends_on: - redis volumes: shared-storage: # New: Shared storage for all file operations redis-data: postgres-data: ``` ## 5. Storage Structure ``` /shared/ ├── uploads/ # Original uploaded files (PPTX, etc.) │ ├── abc123.pptx │ ├── def456.pptx │ └── ... └── converted/ # Converted PDF files ├── uuid-1.pdf ├── uuid-2.pdf └── ... ``` ## 6. Gotenberg Integration Details ### **Gotenberg API Endpoints:** Gotenberg provides various conversion endpoints: ```python # LibreOffice conversion (for PPTX, DOCX, ODT, etc.) POST http://gotenberg:3000/forms/libreoffice/convert # HTML to PDF POST http://gotenberg:3000/forms/chromium/convert/html # Markdown to PDF POST http://gotenberg:3000/forms/chromium/convert/markdown # Merge PDFs POST http://gotenberg:3000/forms/pdfengines/merge ``` ### **Example Conversion Request:** ```python import requests def convert_with_gotenberg(input_file_path, output_file_path): """ Convert document using Gotenberg """ with open(input_file_path, 'rb') as f: files = { 'files': (os.path.basename(input_file_path), f, 'application/vnd.openxmlformats-officedocument.presentationml.presentation') } # Optional: Add conversion parameters data = { 'landscape': 'false', # Portrait mode 'nativePageRanges': '1-', # All pages } response = requests.post( 'http://gotenberg:3000/forms/libreoffice/convert', files=files, data=data, timeout=300 ) if response.status_code == 200: with open(output_file_path, 'wb') as out: out.write(response.content) return True else: raise Exception(f"Gotenberg error: {response.status_code} - {response.text}") ``` ### **Advanced Options:** ```python # With custom PDF properties data = { 'landscape': 'false', 'nativePageRanges': '1-10', # Only first 10 pages 'pdfFormat': 'PDF/A-1a', # PDF/A format 'exportFormFields': 'false', } # With password protection data = { 'userPassword': 'secret123', 'ownerPassword': 'admin456', } ``` ## 7. Client Behavior (Pi5) ```python # On the Pi5 client def display_file(file_id): response = api.get(f"/files/{file_id}/display") if response.content_type == 'application/pdf': # PDF is ready download_and_display(response) subprocess.run(['impressive', downloaded_pdf]) elif response.json()['status'] in ['pending', 'processing']: # Wait and retry show_loading_screen("Presentation is being prepared...") time.sleep(5) display_file(file_id) # Retry else: # Error show_error_screen("Error loading presentation") ``` ## 8. Additional Features ### **Cache Invalidation on PPTX Update:** ```python @app.put("/files/{file_id}") async def update_file(file_id, new_file): # Delete old conversions and PDFs conversions = db.get_conversions_for_file(file_id) for conv in conversions: if conv.target_path and os.path.exists(conv.target_path): os.remove(conv.target_path) db.mark_conversions_as_obsolete(file_id) # Update file update_media_file(file_id, new_file) # Trigger new conversion trigger_conversion(file_id, 'pdf') ``` ### **Status API for Monitoring:** ```python @app.get("/admin/conversions/status") async def get_conversion_stats(): return { 'pending': db.count(status='pending'), 'processing': db.count(status='processing'), 'failed': db.count(status='failed'), 'avg_duration_seconds': db.avg_duration(), 'gotenberg_health': check_gotenberg_health() } def check_gotenberg_health(): try: response = requests.get( f'{GOTENBERG_URL}/health', timeout=5 ) return response.status_code == 200 except: return False ``` ### **Cleanup Job (Cronjob):** ```python def cleanup_old_conversions(): # Remove PDFs from deleted files orphaned = db.get_orphaned_conversions() for conv in orphaned: if conv.target_path and os.path.exists(conv.target_path): os.remove(conv.target_path) db.delete_conversion(conv.id) # Clean up old failed conversions old_failed = db.get_old_failed_conversions(older_than_days=7) for conv in old_failed: db.delete_conversion(conv.id) ``` ## 9. Advantages of Using Gotenberg ✅ **Specialized Service**: Optimized specifically for document conversion ✅ **No LibreOffice Management**: Gotenberg handles LibreOffice lifecycle internally ✅ **Better Resource Management**: Isolated conversion process ✅ **HTTP API**: Clean, standard interface ✅ **Production Ready**: Battle-tested, actively maintained ✅ **Multiple Formats**: Supports PPTX, DOCX, ODT, HTML, Markdown, etc. ✅ **PDF Features**: Merge, encrypt, watermark PDFs ✅ **Health Checks**: Built-in health endpoint ✅ **Horizontal Scaling**: Can run multiple Gotenberg instances ✅ **Memory Safe**: Automatic cleanup and restart on issues ## 10. Migration Path ### **Phase 1 (MVP):** - 1 worker process in API container - Redis for queue (separate container) - Gotenberg for conversion (separate container) - Basic DB schema - Shared volume for file exchange - Simple retry logic ### **Phase 2 (as needed):** - Multiple worker instances - Multiple Gotenberg instances (load balancing) - Monitoring & alerting - Prioritization logic - Advanced caching strategies - PDF optimization/compression **Start simple, scale when needed!** ## 11. Key Decisions Summary | Aspect | Decision | Reason | |--------|----------|--------| | **Conversion Location** | Server-side (Gotenberg) | One conversion per file, consistent results | | **Conversion Service** | Dedicated Gotenberg container | Specialized, production-ready, better isolation | | **Conversion Timing** | Asynchronous (on upload) | No client waiting time, predictable performance | | **Data Storage** | Database-tracked | Status visibility, robust error handling | | **File Exchange** | Shared Docker volume | Simple, efficient, no network overhead | | **Queue System** | Redis (separate container) | Standard pattern, scalable, maintainable | | **Worker Architecture** | Background process in API container | Simple start, easy to separate later | ## 12. File Flow Diagram ``` ┌─────────────┐ │ User Upload │ │ (PPTX) │ └──────┬──────┘ │ ▼ ┌──────────────────────┐ │ API Server │ │ 1. Save to /shared │ │ 2. Create DB record │ │ 3. Enqueue job │ └──────┬───────────────┘ │ ▼ ┌──────────────────┐ │ Redis Queue │ └──────┬───────────┘ │ ▼ ┌──────────────────────┐ │ Worker Process │ │ 1. Get job │ │ 2. Call Gotenberg │ │ 3. Update DB │ └──────┬───────────────┘ │ ▼ ┌──────────────────────┐ │ Gotenberg │ │ 1. Read from /shared │ │ 2. Convert PPTX │ │ 3. Return PDF │ └──────┬───────────────┘ │ ▼ ┌──────────────────────┐ │ Worker saves PDF │ │ to /shared/converted│ └──────┬───────────────┘ │ ▼ ┌──────────────────────┐ │ Client Requests │ │ 1. Check DB │ │ 2. Download PDF │ │ 3. Display │ └──────────────────────┘ (via impressive) ``` ## 13. Implementation Checklist ### Database Setup - [ ] Create `media_files` table - [ ] Create `conversions` table - [ ] Add indexes for performance - [ ] Set up foreign key constraints ### Storage Setup - [ ] Create shared Docker volume - [ ] Set up directory structure (/shared/uploads, /shared/converted) - [ ] Configure proper permissions ### API Changes - [ ] Modify upload endpoint to save to shared storage - [ ] Create DB records for uploads - [ ] Add conversion job enqueueing - [ ] Implement file download endpoint with status checking - [ ] Add status API for monitoring - [ ] Implement cache invalidation on file update ### Worker Setup - [ ] Create worker script/module - [ ] Implement Gotenberg API calls - [ ] Add error handling and retry logic - [ ] Set up logging and monitoring - [ ] Handle timeouts and failures ### Docker Configuration - [ ] Add Gotenberg container to docker-compose.yml - [ ] Add Redis container to docker-compose.yml - [ ] Configure worker container - [ ] Set up shared volume mounts - [ ] Configure environment variables - [ ] Set up container dependencies - [ ] Configure resource limits for Gotenberg ### Client Updates - [ ] Modify client to check conversion status - [ ] Implement retry logic for pending conversions - [ ] Add loading/waiting screens - [ ] Implement error handling ### Testing - [ ] Test upload → conversion → download flow - [ ] Test multiple concurrent conversions - [ ] Test error handling (corrupted PPTX, etc.) - [ ] Test Gotenberg timeout handling - [ ] Test cache invalidation on file update - [ ] Load test with multiple clients - [ ] Test Gotenberg health checks ### Monitoring & Operations - [ ] Set up logging for conversions - [ ] Monitor Gotenberg health endpoint - [ ] Implement cleanup job for old files - [ ] Add metrics for conversion times - [ ] Set up alerts for failed conversions - [ ] Monitor shared storage disk usage - [ ] Document backup procedures ### Security - [ ] Validate file types before conversion - [ ] Set file size limits - [ ] Sanitize filenames - [ ] Implement rate limiting - [ ] Secure inter-container communication ## 14. Gotenberg Configuration Options ### **Environment Variables:** ```yaml gotenberg: image: gotenberg/gotenberg:8 environment: # API Configuration - GOTENBERG_API_TIMEOUT=300s - GOTENBERG_API_PORT=3000 # Logging - GOTENBERG_LOG_LEVEL=info # debug, info, warn, error # LibreOffice - GOTENBERG_LIBREOFFICE_DISABLE_ROUTES=false - GOTENBERG_LIBREOFFICE_AUTO_START=true # Chromium (if needed for HTML/Markdown) - GOTENBERG_CHROMIUM_DISABLE_ROUTES=true # Disable if not needed # Resource limits - GOTENBERG_LIBREOFFICE_MAX_QUEUE_SIZE=100 ``` ### **Custom Gotenberg Configuration:** For advanced configurations, create a `gotenberg.yml`: ```yaml api: timeout: 300s port: 3000 libreoffice: autoStart: true maxQueueSize: 100 chromium: disableRoutes: true ``` Mount it in docker-compose: ```yaml gotenberg: image: gotenberg/gotenberg:8 volumes: - ./gotenberg.yml:/etc/gotenberg/config.yml:ro - shared-storage:/shared ``` ## 15. Troubleshooting ### **Common Issues:** **Gotenberg timeout:** ```python # Increase timeout for large files response = requests.post( f'{GOTENBERG_URL}/forms/libreoffice/convert', files=files, timeout=600 # 10 minutes for large PPTX ) ``` **Memory issues:** ```yaml # Increase Gotenberg memory limit gotenberg: deploy: resources: limits: memory: 4G ``` **File permission issues:** ```bash # Ensure proper permissions on shared volume chmod -R 755 /shared chown -R 1000:1000 /shared ``` **Gotenberg not responding:** ```python # Check health before conversion def ensure_gotenberg_healthy(): try: response = requests.get(f'{GOTENBERG_URL}/health', timeout=5) if response.status_code != 200: raise Exception("Gotenberg unhealthy") except Exception as e: logger.error(f"Gotenberg health check failed: {e}") raise ``` --- **This architecture provides a production-ready, scalable solution using Gotenberg as a specialized conversion service with efficient file sharing via Docker volumes!** ## 16. Best Practices Specific to Infoscreen - Idempotency by content: Always compute a SHA‑256 of the uploaded source and include it in the unique key (source_event_media_id, target_format, file_hash). This prevents duplicate work for identical content and auto-busts cache on change. - Strict MIME/type validation: Accept only .ppt, .pptx, .odp for conversion. Reject unknown types early. Consider reading the first bytes (magic) for extra safety. - Bounded retries with jitter: Retry conversions on transient HTTP 5xx or timeouts up to N times with exponential backoff. Do not retry on 4xx or clear user errors. - Output naming: Derive deterministic output paths under media/converted/, e.g., .pdf. Ensure no path traversal and sanitize names. - Timeouts and size limits: Enforce server-side max upload size and per-job conversion timeout (e.g., 10 minutes). Return clear errors for oversized/long-running files. - Isolation and quotas: Set CPU/memory limits for Gotenberg; consider a concurrency cap per worker to avoid DB starvation. - Health probes before work: Check Gotenberg /health prior to enqueue spikes; fail-fast to avoid queue pile-ups when Gotenberg is down. - Observability: Log job IDs, file hashes, durations, and sizes. Expose a small /api/conversions/status summary for operational visibility. - Cleanup policy: Periodically delete orphaned conversions (media deleted) and failed jobs older than X days. Keep successful PDFs aligned with DB rows. - Security: Never trust client paths; always resolve relative to the known media root. Do not expose the shared volume directly; serve via API only. - Backpressure: If queue length exceeds a threshold, surface 503/“try later” on new uploads or pause enqueue to protect the system.