24 KiB
Recommended Implementation: PPTX-to-PDF Conversion System with Gotenberg
Architecture Overview
Asynchronous server-side conversion using Gotenberg with shared storage
User Upload → API saves PPTX → Job in Queue → Worker calls Gotenberg API
↓
Gotenberg converts via shared volume
↓
Client requests → API checks DB status → PDF ready? → Download PDF from shared storage
→ Pending? → "Please wait"
→ Failed? → Retry/Error
1. Database Schema
CREATE TABLE media_files (
id UUID PRIMARY KEY,
filename VARCHAR(255),
original_path VARCHAR(512),
file_type VARCHAR(10),
mime_type VARCHAR(100),
uploaded_at TIMESTAMP,
updated_at TIMESTAMP
);
CREATE TABLE conversions (
id UUID PRIMARY KEY,
source_file_id UUID REFERENCES media_files(id) ON DELETE CASCADE,
target_format VARCHAR(10), -- 'pdf'
target_path VARCHAR(512), -- Path to generated PDF
status VARCHAR(20), -- 'pending', 'processing', 'ready', 'failed'
started_at TIMESTAMP,
completed_at TIMESTAMP,
error_message TEXT,
file_hash VARCHAR(64) -- Hash of PPTX for cache invalidation
);
CREATE INDEX idx_conversions_source ON conversions(source_file_id, target_format);
2. Components
API Server (existing)
- Accepts uploads
- Creates DB entries
- Enqueues jobs
- Delivers status and files
Background Worker (new)
- Runs as separate process in same container as API
- Processes conversion jobs from queue
- Calls Gotenberg API for conversion
- Updates database with results
- Technology: Python RQ, Celery, or similar
Gotenberg Container (new)
- Dedicated conversion service
- HTTP API for document conversion
- Handles LibreOffice conversions internally
- Accesses files via shared volume
Message Queue
- Redis (recommended for start - simple, fast)
- Alternative: RabbitMQ for more features
Redis Container (separate)
- Handles job queue
- Minimal resource footprint
Shared Storage
- Docker volume mounted to all containers that need file access
- API, Worker, and Gotenberg all access same files
- Simplifies file exchange between services
3. Detailed Workflow
Upload Process:
@app.post("/upload")
async def upload_file(file):
# 1. Save PPTX to shared volume
file_path = save_to_disk(file) # e.g., /shared/uploads/abc123.pptx
# 2. DB entry for original file
file_record = db.create_media_file({
'filename': file.filename,
'original_path': file_path,
'file_type': 'pptx'
})
# 3. Create conversion record
conversion = db.create_conversion({
'source_file_id': file_record.id,
'target_format': 'pdf',
'status': 'pending',
'file_hash': calculate_hash(file_path)
})
# 4. Enqueue job (asynchronous!)
queue.enqueue(convert_to_pdf_via_gotenberg, conversion.id)
# 5. Return immediately to user
return {
'file_id': file_record.id,
'status': 'uploaded',
'conversion_status': 'pending'
}
Worker Process (calls Gotenberg):
import requests
import os
GOTENBERG_URL = os.getenv('GOTENBERG_URL', 'http://gotenberg:3000')
def convert_to_pdf_via_gotenberg(conversion_id):
conversion = db.get_conversion(conversion_id)
source_file = db.get_media_file(conversion.source_file_id)
# Status update: processing
db.update_conversion(conversion_id, {
'status': 'processing',
'started_at': now()
})
try:
# Prepare output path
pdf_filename = f"{conversion.id}.pdf"
pdf_path = f"/shared/converted/{pdf_filename}"
# Call Gotenberg API
# Gotenberg accesses the file via shared volume
with open(source_file.original_path, 'rb') as f:
files = {
'files': (os.path.basename(source_file.original_path), f)
}
response = requests.post(
f'{GOTENBERG_URL}/forms/libreoffice/convert',
files=files,
timeout=300 # 5 minutes timeout
)
response.raise_for_status()
# Save PDF to shared volume
with open(pdf_path, 'wb') as pdf_file:
pdf_file.write(response.content)
# Success
db.update_conversion(conversion_id, {
'status': 'ready',
'target_path': pdf_path,
'completed_at': now()
})
except requests.exceptions.Timeout:
db.update_conversion(conversion_id, {
'status': 'failed',
'error_message': 'Conversion timeout after 5 minutes',
'completed_at': now()
})
except requests.exceptions.RequestException as e:
db.update_conversion(conversion_id, {
'status': 'failed',
'error_message': f'Gotenberg API error: {str(e)}',
'completed_at': now()
})
except Exception as e:
db.update_conversion(conversion_id, {
'status': 'failed',
'error_message': str(e),
'completed_at': now()
})
Alternative: Direct File Access via Shared Volume
If you prefer Gotenberg to read from shared storage directly (more efficient for large files):
def convert_to_pdf_via_gotenberg_shared(conversion_id):
conversion = db.get_conversion(conversion_id)
source_file = db.get_media_file(conversion.source_file_id)
db.update_conversion(conversion_id, {
'status': 'processing',
'started_at': now()
})
try:
pdf_filename = f"{conversion.id}.pdf"
pdf_path = f"/shared/converted/{pdf_filename}"
# Gotenberg reads directly from shared volume
# We just tell it where to find the file
with open(source_file.original_path, 'rb') as f:
files = {'files': f}
response = requests.post(
f'{GOTENBERG_URL}/forms/libreoffice/convert',
files=files,
timeout=300
)
response.raise_for_status()
# Write result to shared volume
with open(pdf_path, 'wb') as pdf_file:
pdf_file.write(response.content)
db.update_conversion(conversion_id, {
'status': 'ready',
'target_path': pdf_path,
'completed_at': now()
})
except Exception as e:
db.update_conversion(conversion_id, {
'status': 'failed',
'error_message': str(e),
'completed_at': now()
})
Client Download:
@app.get("/files/{file_id}/display")
async def get_display_file(file_id):
file = db.get_media_file(file_id)
# Only for PPTX: check PDF conversion
if file.file_type == 'pptx':
conversion = db.get_latest_conversion(file.id, target_format='pdf')
if not conversion:
# Shouldn't happen, but just to be safe
trigger_new_conversion(file.id)
return {'status': 'pending', 'message': 'Conversion is being created'}
if conversion.status == 'ready':
# Serve PDF from shared storage
return FileResponse(conversion.target_path)
elif conversion.status == 'failed':
# Optional: Auto-retry
trigger_new_conversion(file.id)
return {'status': 'failed', 'error': conversion.error_message}
else: # pending or processing
return {'status': conversion.status, 'message': 'Please wait...'}
# Serve other file types directly
return FileResponse(file.original_path)
4. Docker Setup
version: '3.8'
services:
# Your API Server
api:
build: ./api
command: uvicorn main:app --host 0.0.0.0 --port 8000
ports:
- "8000:8000"
volumes:
- shared-storage:/shared # Shared volume
environment:
- REDIS_URL=redis://redis:6379
- DATABASE_URL=postgresql://postgres:password@postgres:5432/infoscreen
- GOTENBERG_URL=http://gotenberg:3000
depends_on:
- redis
- postgres
- gotenberg
restart: unless-stopped
# Worker (same codebase as API, different command)
worker:
build: ./api # Same build as API!
command: python worker.py # or: rq worker
volumes:
- shared-storage:/shared # Shared volume
environment:
- REDIS_URL=redis://redis:6379
- DATABASE_URL=postgresql://postgres:password@postgres:5432/infoscreen
- GOTENBERG_URL=http://gotenberg:3000
depends_on:
- redis
- postgres
- gotenberg
restart: unless-stopped
# Optional: Multiple workers
deploy:
replicas: 2
# Gotenberg - Document Conversion Service
gotenberg:
image: gotenberg/gotenberg:8
# Gotenberg doesn't need the shared volume if files are sent via HTTP
# But mount it if you want direct file access
volumes:
- shared-storage:/shared # Optional: for direct file access
environment:
# Gotenberg configuration
- GOTENBERG_API_TIMEOUT=300s
- GOTENBERG_LOG_LEVEL=info
restart: unless-stopped
# Resource limits (optional but recommended)
deploy:
resources:
limits:
cpus: '2.0'
memory: 2G
reservations:
cpus: '0.5'
memory: 512M
# Redis - separate container
redis:
image: redis:7-alpine
volumes:
- redis-data:/data
command: redis-server --appendonly yes
restart: unless-stopped
# Your existing Postgres
postgres:
image: postgres:15
environment:
- POSTGRES_DB=infoscreen
- POSTGRES_PASSWORD=password
volumes:
- postgres-data:/var/lib/postgresql/data
restart: unless-stopped
# Optional: Redis Commander (UI for debugging)
redis-commander:
image: rediscommander/redis-commander
environment:
- REDIS_HOSTS=local:redis:6379
ports:
- "8081:8081"
depends_on:
- redis
volumes:
shared-storage: # New: Shared storage for all file operations
redis-data:
postgres-data:
5. Storage Structure
/shared/
├── uploads/ # Original uploaded files (PPTX, etc.)
│ ├── abc123.pptx
│ ├── def456.pptx
│ └── ...
└── converted/ # Converted PDF files
├── uuid-1.pdf
├── uuid-2.pdf
└── ...
6. Gotenberg Integration Details
Gotenberg API Endpoints:
Gotenberg provides various conversion endpoints:
# LibreOffice conversion (for PPTX, DOCX, ODT, etc.)
POST http://gotenberg:3000/forms/libreoffice/convert
# HTML to PDF
POST http://gotenberg:3000/forms/chromium/convert/html
# Markdown to PDF
POST http://gotenberg:3000/forms/chromium/convert/markdown
# Merge PDFs
POST http://gotenberg:3000/forms/pdfengines/merge
Example Conversion Request:
import requests
def convert_with_gotenberg(input_file_path, output_file_path):
"""
Convert document using Gotenberg
"""
with open(input_file_path, 'rb') as f:
files = {
'files': (os.path.basename(input_file_path), f,
'application/vnd.openxmlformats-officedocument.presentationml.presentation')
}
# Optional: Add conversion parameters
data = {
'landscape': 'false', # Portrait mode
'nativePageRanges': '1-', # All pages
}
response = requests.post(
'http://gotenberg:3000/forms/libreoffice/convert',
files=files,
data=data,
timeout=300
)
if response.status_code == 200:
with open(output_file_path, 'wb') as out:
out.write(response.content)
return True
else:
raise Exception(f"Gotenberg error: {response.status_code} - {response.text}")
Advanced Options:
# With custom PDF properties
data = {
'landscape': 'false',
'nativePageRanges': '1-10', # Only first 10 pages
'pdfFormat': 'PDF/A-1a', # PDF/A format
'exportFormFields': 'false',
}
# With password protection
data = {
'userPassword': 'secret123',
'ownerPassword': 'admin456',
}
7. Client Behavior (Pi5)
# On the Pi5 client
def display_file(file_id):
response = api.get(f"/files/{file_id}/display")
if response.content_type == 'application/pdf':
# PDF is ready
download_and_display(response)
subprocess.run(['impressive', downloaded_pdf])
elif response.json()['status'] in ['pending', 'processing']:
# Wait and retry
show_loading_screen("Presentation is being prepared...")
time.sleep(5)
display_file(file_id) # Retry
else:
# Error
show_error_screen("Error loading presentation")
8. Additional Features
Cache Invalidation on PPTX Update:
@app.put("/files/{file_id}")
async def update_file(file_id, new_file):
# Delete old conversions and PDFs
conversions = db.get_conversions_for_file(file_id)
for conv in conversions:
if conv.target_path and os.path.exists(conv.target_path):
os.remove(conv.target_path)
db.mark_conversions_as_obsolete(file_id)
# Update file
update_media_file(file_id, new_file)
# Trigger new conversion
trigger_conversion(file_id, 'pdf')
Status API for Monitoring:
@app.get("/admin/conversions/status")
async def get_conversion_stats():
return {
'pending': db.count(status='pending'),
'processing': db.count(status='processing'),
'failed': db.count(status='failed'),
'avg_duration_seconds': db.avg_duration(),
'gotenberg_health': check_gotenberg_health()
}
def check_gotenberg_health():
try:
response = requests.get(
f'{GOTENBERG_URL}/health',
timeout=5
)
return response.status_code == 200
except:
return False
Cleanup Job (Cronjob):
def cleanup_old_conversions():
# Remove PDFs from deleted files
orphaned = db.get_orphaned_conversions()
for conv in orphaned:
if conv.target_path and os.path.exists(conv.target_path):
os.remove(conv.target_path)
db.delete_conversion(conv.id)
# Clean up old failed conversions
old_failed = db.get_old_failed_conversions(older_than_days=7)
for conv in old_failed:
db.delete_conversion(conv.id)
9. Advantages of Using Gotenberg
✅ Specialized Service: Optimized specifically for document conversion ✅ No LibreOffice Management: Gotenberg handles LibreOffice lifecycle internally ✅ Better Resource Management: Isolated conversion process ✅ HTTP API: Clean, standard interface ✅ Production Ready: Battle-tested, actively maintained ✅ Multiple Formats: Supports PPTX, DOCX, ODT, HTML, Markdown, etc. ✅ PDF Features: Merge, encrypt, watermark PDFs ✅ Health Checks: Built-in health endpoint ✅ Horizontal Scaling: Can run multiple Gotenberg instances ✅ Memory Safe: Automatic cleanup and restart on issues
10. Migration Path
Phase 1 (MVP):
- 1 worker process in API container
- Redis for queue (separate container)
- Gotenberg for conversion (separate container)
- Basic DB schema
- Shared volume for file exchange
- Simple retry logic
Phase 2 (as needed):
- Multiple worker instances
- Multiple Gotenberg instances (load balancing)
- Monitoring & alerting
- Prioritization logic
- Advanced caching strategies
- PDF optimization/compression
Start simple, scale when needed!
11. Key Decisions Summary
| Aspect | Decision | Reason |
|---|---|---|
| Conversion Location | Server-side (Gotenberg) | One conversion per file, consistent results |
| Conversion Service | Dedicated Gotenberg container | Specialized, production-ready, better isolation |
| Conversion Timing | Asynchronous (on upload) | No client waiting time, predictable performance |
| Data Storage | Database-tracked | Status visibility, robust error handling |
| File Exchange | Shared Docker volume | Simple, efficient, no network overhead |
| Queue System | Redis (separate container) | Standard pattern, scalable, maintainable |
| Worker Architecture | Background process in API container | Simple start, easy to separate later |
12. File Flow Diagram
┌─────────────┐
│ User Upload │
│ (PPTX) │
└──────┬──────┘
│
▼
┌──────────────────────┐
│ API Server │
│ 1. Save to /shared │
│ 2. Create DB record │
│ 3. Enqueue job │
└──────┬───────────────┘
│
▼
┌──────────────────┐
│ Redis Queue │
└──────┬───────────┘
│
▼
┌──────────────────────┐
│ Worker Process │
│ 1. Get job │
│ 2. Call Gotenberg │
│ 3. Update DB │
└──────┬───────────────┘
│
▼
┌──────────────────────┐
│ Gotenberg │
│ 1. Read from /shared │
│ 2. Convert PPTX │
│ 3. Return PDF │
└──────┬───────────────┘
│
▼
┌──────────────────────┐
│ Worker saves PDF │
│ to /shared/converted│
└──────┬───────────────┘
│
▼
┌──────────────────────┐
│ Client Requests │
│ 1. Check DB │
│ 2. Download PDF │
│ 3. Display │
└──────────────────────┘
(via impressive)
13. Implementation Checklist
Database Setup
- Create
media_filestable - Create
conversionstable - Add indexes for performance
- Set up foreign key constraints
Storage Setup
- Create shared Docker volume
- Set up directory structure (/shared/uploads, /shared/converted)
- Configure proper permissions
API Changes
- Modify upload endpoint to save to shared storage
- Create DB records for uploads
- Add conversion job enqueueing
- Implement file download endpoint with status checking
- Add status API for monitoring
- Implement cache invalidation on file update
Worker Setup
- Create worker script/module
- Implement Gotenberg API calls
- Add error handling and retry logic
- Set up logging and monitoring
- Handle timeouts and failures
Docker Configuration
- Add Gotenberg container to docker-compose.yml
- Add Redis container to docker-compose.yml
- Configure worker container
- Set up shared volume mounts
- Configure environment variables
- Set up container dependencies
- Configure resource limits for Gotenberg
Client Updates
- Modify client to check conversion status
- Implement retry logic for pending conversions
- Add loading/waiting screens
- Implement error handling
Testing
- Test upload → conversion → download flow
- Test multiple concurrent conversions
- Test error handling (corrupted PPTX, etc.)
- Test Gotenberg timeout handling
- Test cache invalidation on file update
- Load test with multiple clients
- Test Gotenberg health checks
Monitoring & Operations
- Set up logging for conversions
- Monitor Gotenberg health endpoint
- Implement cleanup job for old files
- Add metrics for conversion times
- Set up alerts for failed conversions
- Monitor shared storage disk usage
- Document backup procedures
Security
- Validate file types before conversion
- Set file size limits
- Sanitize filenames
- Implement rate limiting
- Secure inter-container communication
14. Gotenberg Configuration Options
Environment Variables:
gotenberg:
image: gotenberg/gotenberg:8
environment:
# API Configuration
- GOTENBERG_API_TIMEOUT=300s
- GOTENBERG_API_PORT=3000
# Logging
- GOTENBERG_LOG_LEVEL=info # debug, info, warn, error
# LibreOffice
- GOTENBERG_LIBREOFFICE_DISABLE_ROUTES=false
- GOTENBERG_LIBREOFFICE_AUTO_START=true
# Chromium (if needed for HTML/Markdown)
- GOTENBERG_CHROMIUM_DISABLE_ROUTES=true # Disable if not needed
# Resource limits
- GOTENBERG_LIBREOFFICE_MAX_QUEUE_SIZE=100
Custom Gotenberg Configuration:
For advanced configurations, create a gotenberg.yml:
api:
timeout: 300s
port: 3000
libreoffice:
autoStart: true
maxQueueSize: 100
chromium:
disableRoutes: true
Mount it in docker-compose:
gotenberg:
image: gotenberg/gotenberg:8
volumes:
- ./gotenberg.yml:/etc/gotenberg/config.yml:ro
- shared-storage:/shared
15. Troubleshooting
Common Issues:
Gotenberg timeout:
# Increase timeout for large files
response = requests.post(
f'{GOTENBERG_URL}/forms/libreoffice/convert',
files=files,
timeout=600 # 10 minutes for large PPTX
)
Memory issues:
# Increase Gotenberg memory limit
gotenberg:
deploy:
resources:
limits:
memory: 4G
File permission issues:
# Ensure proper permissions on shared volume
chmod -R 755 /shared
chown -R 1000:1000 /shared
Gotenberg not responding:
# Check health before conversion
def ensure_gotenberg_healthy():
try:
response = requests.get(f'{GOTENBERG_URL}/health', timeout=5)
if response.status_code != 200:
raise Exception("Gotenberg unhealthy")
except Exception as e:
logger.error(f"Gotenberg health check failed: {e}")
raise
This architecture provides a production-ready, scalable solution using Gotenberg as a specialized conversion service with efficient file sharing via Docker volumes!
16. Best Practices Specific to Infoscreen
- Idempotency by content: Always compute a SHA‑256 of the uploaded source and include it in the unique key (source_event_media_id, target_format, file_hash). This prevents duplicate work for identical content and auto-busts cache on change.
- Strict MIME/type validation: Accept only .ppt, .pptx, .odp for conversion. Reject unknown types early. Consider reading the first bytes (magic) for extra safety.
- Bounded retries with jitter: Retry conversions on transient HTTP 5xx or timeouts up to N times with exponential backoff. Do not retry on 4xx or clear user errors.
- Output naming: Derive deterministic output paths under media/converted/, e.g., .pdf. Ensure no path traversal and sanitize names.
- Timeouts and size limits: Enforce server-side max upload size and per-job conversion timeout (e.g., 10 minutes). Return clear errors for oversized/long-running files.
- Isolation and quotas: Set CPU/memory limits for Gotenberg; consider a concurrency cap per worker to avoid DB starvation.
- Health probes before work: Check Gotenberg /health prior to enqueue spikes; fail-fast to avoid queue pile-ups when Gotenberg is down.
- Observability: Log job IDs, file hashes, durations, and sizes. Expose a small /api/conversions/status summary for operational visibility.
- Cleanup policy: Periodically delete orphaned conversions (media deleted) and failed jobs older than X days. Keep successful PDFs aligned with DB rows.
- Security: Never trust client paths; always resolve relative to the known media root. Do not expose the shared volume directly; serve via API only.
- Backpressure: If queue length exceeds a threshold, surface 503/“try later” on new uploads or pause enqueue to protect the system.