Files
infoscreen/pptx_conversion_guide_gotenberg.md
2025-10-10 15:20:14 +00:00

816 lines
24 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Recommended Implementation: PPTX-to-PDF Conversion System with Gotenberg
## Architecture Overview
**Asynchronous server-side conversion using Gotenberg with shared storage**
```
User Upload → API saves PPTX → Job in Queue → Worker calls Gotenberg API
Gotenberg converts via shared volume
Client requests → API checks DB status → PDF ready? → Download PDF from shared storage
→ Pending? → "Please wait"
→ Failed? → Retry/Error
```
## 1. Database Schema
```sql
CREATE TABLE media_files (
id UUID PRIMARY KEY,
filename VARCHAR(255),
original_path VARCHAR(512),
file_type VARCHAR(10),
mime_type VARCHAR(100),
uploaded_at TIMESTAMP,
updated_at TIMESTAMP
);
CREATE TABLE conversions (
id UUID PRIMARY KEY,
source_file_id UUID REFERENCES media_files(id) ON DELETE CASCADE,
target_format VARCHAR(10), -- 'pdf'
target_path VARCHAR(512), -- Path to generated PDF
status VARCHAR(20), -- 'pending', 'processing', 'ready', 'failed'
started_at TIMESTAMP,
completed_at TIMESTAMP,
error_message TEXT,
file_hash VARCHAR(64) -- Hash of PPTX for cache invalidation
);
CREATE INDEX idx_conversions_source ON conversions(source_file_id, target_format);
```
## 2. Components
### **API Server (existing)**
- Accepts uploads
- Creates DB entries
- Enqueues jobs
- Delivers status and files
### **Background Worker (new)**
- Runs as separate process in **same container** as API
- Processes conversion jobs from queue
- Calls Gotenberg API for conversion
- Updates database with results
- Technology: Python RQ, Celery, or similar
### **Gotenberg Container (new)**
- Dedicated conversion service
- HTTP API for document conversion
- Handles LibreOffice conversions internally
- Accesses files via shared volume
### **Message Queue**
- Redis (recommended for start - simple, fast)
- Alternative: RabbitMQ for more features
### **Redis Container (separate)**
- Handles job queue
- Minimal resource footprint
### **Shared Storage**
- Docker volume mounted to all containers that need file access
- API, Worker, and Gotenberg all access same files
- Simplifies file exchange between services
## 3. Detailed Workflow
### **Upload Process:**
```python
@app.post("/upload")
async def upload_file(file):
# 1. Save PPTX to shared volume
file_path = save_to_disk(file) # e.g., /shared/uploads/abc123.pptx
# 2. DB entry for original file
file_record = db.create_media_file({
'filename': file.filename,
'original_path': file_path,
'file_type': 'pptx'
})
# 3. Create conversion record
conversion = db.create_conversion({
'source_file_id': file_record.id,
'target_format': 'pdf',
'status': 'pending',
'file_hash': calculate_hash(file_path)
})
# 4. Enqueue job (asynchronous!)
queue.enqueue(convert_to_pdf_via_gotenberg, conversion.id)
# 5. Return immediately to user
return {
'file_id': file_record.id,
'status': 'uploaded',
'conversion_status': 'pending'
}
```
### **Worker Process (calls Gotenberg):**
```python
import requests
import os
GOTENBERG_URL = os.getenv('GOTENBERG_URL', 'http://gotenberg:3000')
def convert_to_pdf_via_gotenberg(conversion_id):
conversion = db.get_conversion(conversion_id)
source_file = db.get_media_file(conversion.source_file_id)
# Status update: processing
db.update_conversion(conversion_id, {
'status': 'processing',
'started_at': now()
})
try:
# Prepare output path
pdf_filename = f"{conversion.id}.pdf"
pdf_path = f"/shared/converted/{pdf_filename}"
# Call Gotenberg API
# Gotenberg accesses the file via shared volume
with open(source_file.original_path, 'rb') as f:
files = {
'files': (os.path.basename(source_file.original_path), f)
}
response = requests.post(
f'{GOTENBERG_URL}/forms/libreoffice/convert',
files=files,
timeout=300 # 5 minutes timeout
)
response.raise_for_status()
# Save PDF to shared volume
with open(pdf_path, 'wb') as pdf_file:
pdf_file.write(response.content)
# Success
db.update_conversion(conversion_id, {
'status': 'ready',
'target_path': pdf_path,
'completed_at': now()
})
except requests.exceptions.Timeout:
db.update_conversion(conversion_id, {
'status': 'failed',
'error_message': 'Conversion timeout after 5 minutes',
'completed_at': now()
})
except requests.exceptions.RequestException as e:
db.update_conversion(conversion_id, {
'status': 'failed',
'error_message': f'Gotenberg API error: {str(e)}',
'completed_at': now()
})
except Exception as e:
db.update_conversion(conversion_id, {
'status': 'failed',
'error_message': str(e),
'completed_at': now()
})
```
### **Alternative: Direct File Access via Shared Volume**
If you prefer Gotenberg to read from shared storage directly (more efficient for large files):
```python
def convert_to_pdf_via_gotenberg_shared(conversion_id):
conversion = db.get_conversion(conversion_id)
source_file = db.get_media_file(conversion.source_file_id)
db.update_conversion(conversion_id, {
'status': 'processing',
'started_at': now()
})
try:
pdf_filename = f"{conversion.id}.pdf"
pdf_path = f"/shared/converted/{pdf_filename}"
# Gotenberg reads directly from shared volume
# We just tell it where to find the file
with open(source_file.original_path, 'rb') as f:
files = {'files': f}
response = requests.post(
f'{GOTENBERG_URL}/forms/libreoffice/convert',
files=files,
timeout=300
)
response.raise_for_status()
# Write result to shared volume
with open(pdf_path, 'wb') as pdf_file:
pdf_file.write(response.content)
db.update_conversion(conversion_id, {
'status': 'ready',
'target_path': pdf_path,
'completed_at': now()
})
except Exception as e:
db.update_conversion(conversion_id, {
'status': 'failed',
'error_message': str(e),
'completed_at': now()
})
```
### **Client Download:**
```python
@app.get("/files/{file_id}/display")
async def get_display_file(file_id):
file = db.get_media_file(file_id)
# Only for PPTX: check PDF conversion
if file.file_type == 'pptx':
conversion = db.get_latest_conversion(file.id, target_format='pdf')
if not conversion:
# Shouldn't happen, but just to be safe
trigger_new_conversion(file.id)
return {'status': 'pending', 'message': 'Conversion is being created'}
if conversion.status == 'ready':
# Serve PDF from shared storage
return FileResponse(conversion.target_path)
elif conversion.status == 'failed':
# Optional: Auto-retry
trigger_new_conversion(file.id)
return {'status': 'failed', 'error': conversion.error_message}
else: # pending or processing
return {'status': conversion.status, 'message': 'Please wait...'}
# Serve other file types directly
return FileResponse(file.original_path)
```
## 4. Docker Setup
```yaml
version: '3.8'
services:
# Your API Server
api:
build: ./api
command: uvicorn main:app --host 0.0.0.0 --port 8000
ports:
- "8000:8000"
volumes:
- shared-storage:/shared # Shared volume
environment:
- REDIS_URL=redis://redis:6379
- DATABASE_URL=postgresql://postgres:password@postgres:5432/infoscreen
- GOTENBERG_URL=http://gotenberg:3000
depends_on:
- redis
- postgres
- gotenberg
restart: unless-stopped
# Worker (same codebase as API, different command)
worker:
build: ./api # Same build as API!
command: python worker.py # or: rq worker
volumes:
- shared-storage:/shared # Shared volume
environment:
- REDIS_URL=redis://redis:6379
- DATABASE_URL=postgresql://postgres:password@postgres:5432/infoscreen
- GOTENBERG_URL=http://gotenberg:3000
depends_on:
- redis
- postgres
- gotenberg
restart: unless-stopped
# Optional: Multiple workers
deploy:
replicas: 2
# Gotenberg - Document Conversion Service
gotenberg:
image: gotenberg/gotenberg:8
# Gotenberg doesn't need the shared volume if files are sent via HTTP
# But mount it if you want direct file access
volumes:
- shared-storage:/shared # Optional: for direct file access
environment:
# Gotenberg configuration
- GOTENBERG_API_TIMEOUT=300s
- GOTENBERG_LOG_LEVEL=info
restart: unless-stopped
# Resource limits (optional but recommended)
deploy:
resources:
limits:
cpus: '2.0'
memory: 2G
reservations:
cpus: '0.5'
memory: 512M
# Redis - separate container
redis:
image: redis:7-alpine
volumes:
- redis-data:/data
command: redis-server --appendonly yes
restart: unless-stopped
# Your existing Postgres
postgres:
image: postgres:15
environment:
- POSTGRES_DB=infoscreen
- POSTGRES_PASSWORD=password
volumes:
- postgres-data:/var/lib/postgresql/data
restart: unless-stopped
# Optional: Redis Commander (UI for debugging)
redis-commander:
image: rediscommander/redis-commander
environment:
- REDIS_HOSTS=local:redis:6379
ports:
- "8081:8081"
depends_on:
- redis
volumes:
shared-storage: # New: Shared storage for all file operations
redis-data:
postgres-data:
```
## 5. Storage Structure
```
/shared/
├── uploads/ # Original uploaded files (PPTX, etc.)
│ ├── abc123.pptx
│ ├── def456.pptx
│ └── ...
└── converted/ # Converted PDF files
├── uuid-1.pdf
├── uuid-2.pdf
└── ...
```
## 6. Gotenberg Integration Details
### **Gotenberg API Endpoints:**
Gotenberg provides various conversion endpoints:
```python
# LibreOffice conversion (for PPTX, DOCX, ODT, etc.)
POST http://gotenberg:3000/forms/libreoffice/convert
# HTML to PDF
POST http://gotenberg:3000/forms/chromium/convert/html
# Markdown to PDF
POST http://gotenberg:3000/forms/chromium/convert/markdown
# Merge PDFs
POST http://gotenberg:3000/forms/pdfengines/merge
```
### **Example Conversion Request:**
```python
import requests
def convert_with_gotenberg(input_file_path, output_file_path):
"""
Convert document using Gotenberg
"""
with open(input_file_path, 'rb') as f:
files = {
'files': (os.path.basename(input_file_path), f,
'application/vnd.openxmlformats-officedocument.presentationml.presentation')
}
# Optional: Add conversion parameters
data = {
'landscape': 'false', # Portrait mode
'nativePageRanges': '1-', # All pages
}
response = requests.post(
'http://gotenberg:3000/forms/libreoffice/convert',
files=files,
data=data,
timeout=300
)
if response.status_code == 200:
with open(output_file_path, 'wb') as out:
out.write(response.content)
return True
else:
raise Exception(f"Gotenberg error: {response.status_code} - {response.text}")
```
### **Advanced Options:**
```python
# With custom PDF properties
data = {
'landscape': 'false',
'nativePageRanges': '1-10', # Only first 10 pages
'pdfFormat': 'PDF/A-1a', # PDF/A format
'exportFormFields': 'false',
}
# With password protection
data = {
'userPassword': 'secret123',
'ownerPassword': 'admin456',
}
```
## 7. Client Behavior (Pi5)
```python
# On the Pi5 client
def display_file(file_id):
response = api.get(f"/files/{file_id}/display")
if response.content_type == 'application/pdf':
# PDF is ready
download_and_display(response)
subprocess.run(['impressive', downloaded_pdf])
elif response.json()['status'] in ['pending', 'processing']:
# Wait and retry
show_loading_screen("Presentation is being prepared...")
time.sleep(5)
display_file(file_id) # Retry
else:
# Error
show_error_screen("Error loading presentation")
```
## 8. Additional Features
### **Cache Invalidation on PPTX Update:**
```python
@app.put("/files/{file_id}")
async def update_file(file_id, new_file):
# Delete old conversions and PDFs
conversions = db.get_conversions_for_file(file_id)
for conv in conversions:
if conv.target_path and os.path.exists(conv.target_path):
os.remove(conv.target_path)
db.mark_conversions_as_obsolete(file_id)
# Update file
update_media_file(file_id, new_file)
# Trigger new conversion
trigger_conversion(file_id, 'pdf')
```
### **Status API for Monitoring:**
```python
@app.get("/admin/conversions/status")
async def get_conversion_stats():
return {
'pending': db.count(status='pending'),
'processing': db.count(status='processing'),
'failed': db.count(status='failed'),
'avg_duration_seconds': db.avg_duration(),
'gotenberg_health': check_gotenberg_health()
}
def check_gotenberg_health():
try:
response = requests.get(
f'{GOTENBERG_URL}/health',
timeout=5
)
return response.status_code == 200
except:
return False
```
### **Cleanup Job (Cronjob):**
```python
def cleanup_old_conversions():
# Remove PDFs from deleted files
orphaned = db.get_orphaned_conversions()
for conv in orphaned:
if conv.target_path and os.path.exists(conv.target_path):
os.remove(conv.target_path)
db.delete_conversion(conv.id)
# Clean up old failed conversions
old_failed = db.get_old_failed_conversions(older_than_days=7)
for conv in old_failed:
db.delete_conversion(conv.id)
```
## 9. Advantages of Using Gotenberg
**Specialized Service**: Optimized specifically for document conversion
**No LibreOffice Management**: Gotenberg handles LibreOffice lifecycle internally
**Better Resource Management**: Isolated conversion process
**HTTP API**: Clean, standard interface
**Production Ready**: Battle-tested, actively maintained
**Multiple Formats**: Supports PPTX, DOCX, ODT, HTML, Markdown, etc.
**PDF Features**: Merge, encrypt, watermark PDFs
**Health Checks**: Built-in health endpoint
**Horizontal Scaling**: Can run multiple Gotenberg instances
**Memory Safe**: Automatic cleanup and restart on issues
## 10. Migration Path
### **Phase 1 (MVP):**
- 1 worker process in API container
- Redis for queue (separate container)
- Gotenberg for conversion (separate container)
- Basic DB schema
- Shared volume for file exchange
- Simple retry logic
### **Phase 2 (as needed):**
- Multiple worker instances
- Multiple Gotenberg instances (load balancing)
- Monitoring & alerting
- Prioritization logic
- Advanced caching strategies
- PDF optimization/compression
**Start simple, scale when needed!**
## 11. Key Decisions Summary
| Aspect | Decision | Reason |
|--------|----------|--------|
| **Conversion Location** | Server-side (Gotenberg) | One conversion per file, consistent results |
| **Conversion Service** | Dedicated Gotenberg container | Specialized, production-ready, better isolation |
| **Conversion Timing** | Asynchronous (on upload) | No client waiting time, predictable performance |
| **Data Storage** | Database-tracked | Status visibility, robust error handling |
| **File Exchange** | Shared Docker volume | Simple, efficient, no network overhead |
| **Queue System** | Redis (separate container) | Standard pattern, scalable, maintainable |
| **Worker Architecture** | Background process in API container | Simple start, easy to separate later |
## 12. File Flow Diagram
```
┌─────────────┐
│ User Upload │
│ (PPTX) │
└──────┬──────┘
┌──────────────────────┐
│ API Server │
│ 1. Save to /shared │
│ 2. Create DB record │
│ 3. Enqueue job │
└──────┬───────────────┘
┌──────────────────┐
│ Redis Queue │
└──────┬───────────┘
┌──────────────────────┐
│ Worker Process │
│ 1. Get job │
│ 2. Call Gotenberg │
│ 3. Update DB │
└──────┬───────────────┘
┌──────────────────────┐
│ Gotenberg │
│ 1. Read from /shared │
│ 2. Convert PPTX │
│ 3. Return PDF │
└──────┬───────────────┘
┌──────────────────────┐
│ Worker saves PDF │
│ to /shared/converted│
└──────┬───────────────┘
┌──────────────────────┐
│ Client Requests │
│ 1. Check DB │
│ 2. Download PDF │
│ 3. Display │
└──────────────────────┘
(via impressive)
```
## 13. Implementation Checklist
### Database Setup
- [ ] Create `media_files` table
- [ ] Create `conversions` table
- [ ] Add indexes for performance
- [ ] Set up foreign key constraints
### Storage Setup
- [ ] Create shared Docker volume
- [ ] Set up directory structure (/shared/uploads, /shared/converted)
- [ ] Configure proper permissions
### API Changes
- [ ] Modify upload endpoint to save to shared storage
- [ ] Create DB records for uploads
- [ ] Add conversion job enqueueing
- [ ] Implement file download endpoint with status checking
- [ ] Add status API for monitoring
- [ ] Implement cache invalidation on file update
### Worker Setup
- [ ] Create worker script/module
- [ ] Implement Gotenberg API calls
- [ ] Add error handling and retry logic
- [ ] Set up logging and monitoring
- [ ] Handle timeouts and failures
### Docker Configuration
- [ ] Add Gotenberg container to docker-compose.yml
- [ ] Add Redis container to docker-compose.yml
- [ ] Configure worker container
- [ ] Set up shared volume mounts
- [ ] Configure environment variables
- [ ] Set up container dependencies
- [ ] Configure resource limits for Gotenberg
### Client Updates
- [ ] Modify client to check conversion status
- [ ] Implement retry logic for pending conversions
- [ ] Add loading/waiting screens
- [ ] Implement error handling
### Testing
- [ ] Test upload → conversion → download flow
- [ ] Test multiple concurrent conversions
- [ ] Test error handling (corrupted PPTX, etc.)
- [ ] Test Gotenberg timeout handling
- [ ] Test cache invalidation on file update
- [ ] Load test with multiple clients
- [ ] Test Gotenberg health checks
### Monitoring & Operations
- [ ] Set up logging for conversions
- [ ] Monitor Gotenberg health endpoint
- [ ] Implement cleanup job for old files
- [ ] Add metrics for conversion times
- [ ] Set up alerts for failed conversions
- [ ] Monitor shared storage disk usage
- [ ] Document backup procedures
### Security
- [ ] Validate file types before conversion
- [ ] Set file size limits
- [ ] Sanitize filenames
- [ ] Implement rate limiting
- [ ] Secure inter-container communication
## 14. Gotenberg Configuration Options
### **Environment Variables:**
```yaml
gotenberg:
image: gotenberg/gotenberg:8
environment:
# API Configuration
- GOTENBERG_API_TIMEOUT=300s
- GOTENBERG_API_PORT=3000
# Logging
- GOTENBERG_LOG_LEVEL=info # debug, info, warn, error
# LibreOffice
- GOTENBERG_LIBREOFFICE_DISABLE_ROUTES=false
- GOTENBERG_LIBREOFFICE_AUTO_START=true
# Chromium (if needed for HTML/Markdown)
- GOTENBERG_CHROMIUM_DISABLE_ROUTES=true # Disable if not needed
# Resource limits
- GOTENBERG_LIBREOFFICE_MAX_QUEUE_SIZE=100
```
### **Custom Gotenberg Configuration:**
For advanced configurations, create a `gotenberg.yml`:
```yaml
api:
timeout: 300s
port: 3000
libreoffice:
autoStart: true
maxQueueSize: 100
chromium:
disableRoutes: true
```
Mount it in docker-compose:
```yaml
gotenberg:
image: gotenberg/gotenberg:8
volumes:
- ./gotenberg.yml:/etc/gotenberg/config.yml:ro
- shared-storage:/shared
```
## 15. Troubleshooting
### **Common Issues:**
**Gotenberg timeout:**
```python
# Increase timeout for large files
response = requests.post(
f'{GOTENBERG_URL}/forms/libreoffice/convert',
files=files,
timeout=600 # 10 minutes for large PPTX
)
```
**Memory issues:**
```yaml
# Increase Gotenberg memory limit
gotenberg:
deploy:
resources:
limits:
memory: 4G
```
**File permission issues:**
```bash
# Ensure proper permissions on shared volume
chmod -R 755 /shared
chown -R 1000:1000 /shared
```
**Gotenberg not responding:**
```python
# Check health before conversion
def ensure_gotenberg_healthy():
try:
response = requests.get(f'{GOTENBERG_URL}/health', timeout=5)
if response.status_code != 200:
raise Exception("Gotenberg unhealthy")
except Exception as e:
logger.error(f"Gotenberg health check failed: {e}")
raise
```
---
**This architecture provides a production-ready, scalable solution using Gotenberg as a specialized conversion service with efficient file sharing via Docker volumes!**
## 16. Best Practices Specific to Infoscreen
- Idempotency by content: Always compute a SHA256 of the uploaded source and include it in the unique key (source_event_media_id, target_format, file_hash). This prevents duplicate work for identical content and auto-busts cache on change.
- Strict MIME/type validation: Accept only .ppt, .pptx, .odp for conversion. Reject unknown types early. Consider reading the first bytes (magic) for extra safety.
- Bounded retries with jitter: Retry conversions on transient HTTP 5xx or timeouts up to N times with exponential backoff. Do not retry on 4xx or clear user errors.
- Output naming: Derive deterministic output paths under media/converted/, e.g., <basename>.pdf. Ensure no path traversal and sanitize names.
- Timeouts and size limits: Enforce server-side max upload size and per-job conversion timeout (e.g., 10 minutes). Return clear errors for oversized/long-running files.
- Isolation and quotas: Set CPU/memory limits for Gotenberg; consider a concurrency cap per worker to avoid DB starvation.
- Health probes before work: Check Gotenberg /health prior to enqueue spikes; fail-fast to avoid queue pile-ups when Gotenberg is down.
- Observability: Log job IDs, file hashes, durations, and sizes. Expose a small /api/conversions/status summary for operational visibility.
- Cleanup policy: Periodically delete orphaned conversions (media deleted) and failed jobs older than X days. Keep successful PDFs aligned with DB rows.
- Security: Never trust client paths; always resolve relative to the known media root. Do not expose the shared volume directly; serve via API only.
- Backpressure: If queue length exceeds a threshold, surface 503/“try later” on new uploads or pause enqueue to protect the system.