Pool sensor v2: VCC monitoring, database resilience, receiver improvements

- Added voltage monitoring table and storage pipeline - Extended pool payload to 17 bytes with VCC field (protocol v2) - Improved database connection pool resilience (reduced pool size, aggressive recycling, pool disposal on failure) - Added environment variable support for database configuration - Fixed receiver MQTT deprecation warning (CallbackAPIVersion.VERSION2) - Silenced excessive RSSI status logging in receiver - Added reset flag tracking and reporting - Updated Docker compose with DB config and log rotation limits
2026-01-25 11:25:15 +00:00
parent d1c1f63cb9
commit f55c1fe6f1
9 changed files with 1512 additions and 101 deletions
--- a/DATABASE_CONNECTIVITY_FIXES.md
+++ b/DATABASE_CONNECTIVITY_FIXES.md
@@ -0,0 +1,176 @@
+# Database Connectivity Issues - Analysis & Fixes
+
+## Problem Summary
+
+The NAS container experiences **intermittent database connectivity** failures with the error:
+```
+Exception during reset or similar
+_mysql_connector.MySQLInterfaceError: Lost connection to MySQL server during query
+```
+
+While Docker for Desktop works reliably and MySQL Workbench can connect without issues.
+
+---
+
+## Root Causes Identified
+
+### 1. **Aggressive Connection Pool Settings**
+- **Old config**: `pool_size=5` + `max_overflow=10` = up to 15 simultaneous connections
+- **Problem**: Creates excessive connections that exhaust database resources or trigger connection limits
+- **Result**: Pool reset failures when trying to return/reset dead connections
+
+### 2. **Insufficient Connection Recycling**
+- **Old config**: `pool_recycle=1800` (30 minutes)
+- **Problem**: Connections held too long; database may timeout/close them due to `wait_timeout` or network issues
+- **Result**: When SQLAlchemy tries to reuse connections, they're already dead
+
+### 3. **Conflicting autocommit Setting**
+- **Old config**: `autocommit=True` in connect_args
+- **Problem**: When autocommit is enabled, there's nothing to rollback, but SQLAlchemy still tries during pool reset
+- **Result**: Rollback fails on dead connections → traceback logged
+
+### 4. **Pool Reset on Dead Connections**
+- **Config**: `pool_reset_on_return="none"` (correct) but **didn't dispose pool on failure**
+- **Problem**: When a connection dies, the pool kept trying to reuse it
+- **Result**: Repeated failures until the next retry window (30 seconds)
+
+### 5. **Network/Database Timeout Issues (NAS-specific)**
+- **Likely cause**: NAS MariaDB has aggressive connection timeouts
+- **Or**: Container network has higher packet loss/latency than Docker Desktop
+- **Or**: Pool exhaustion prevents new connections from being established
+
+---
+
+## Applied Fixes
+
+### ✅ Fix 1: Conservative Connection Pool (Lines 183-195)
+
+```python
+pool_size=3,             # Reduced from 5
+max_overflow=5,          # Reduced from 10
+pool_recycle=300,        # Reduced from 1800 (every 5 mins vs 30 mins)
+autocommit=False,        # Removed - let SQLAlchemy manage transactions
+```
+
+**Why this works:**
+- Fewer simultaneous connections = less resource contention
+- Aggressive recycling = avoids stale connections killed by database
+- Proper transaction management = cleaner rollback handling
+
+### ✅ Fix 2: Pool Disposal on Connection Failure (Lines 530-533)
+
+```python
+except exc.OperationalError as e:
+    sql_engine.dispose()  # ← CRITICAL: Force all connections to be closed/recreated
+    logger.warning(f"Lost database connectivity: {e}")
+```
+
+**Why this works:**
+- When connection fails, dump the entire pool
+- Next connection attempt gets fresh connections
+- Avoids repeated failures trying to reuse dead connections
+
+### ✅ Fix 3: Environment Variable Support (Lines 169-175)
+
+```python
+DB_HOST = os.getenv("DB_HOST", "192.168.43.102")
+DB_PORT = int(os.getenv("DB_PORT", "3306"))
+# ... etc
+```
+
+**Why this matters:**
+- Different deployments can now use different database hosts
+- Docker Desktop can use `192.168.43.102`
+- NAS can use `mariadb` (Docker DNS) or different IP if needed
+
+---
+
+## Recommended MariaDB Configuration
+
+The NAS MariaDB should have appropriate timeout settings:
+
+```sql
+-- Check current settings
+SHOW VARIABLES LIKE 'wait_timeout';
+SHOW VARIABLES LIKE 'interactive_timeout';
+SHOW VARIABLES LIKE 'max_connections';
+SHOW VARIABLES LIKE 'max_allowed_packet';
+
+-- Recommended settings (in /etc/mysql/mariadb.conf.d/50-server.cnf)
+[mysqld]
+wait_timeout = 600                 # 10 minutes (allow idle connections longer)
+interactive_timeout = 600
+max_connections = 100              # Ensure enough for pool + workbench
+max_allowed_packet = 64M
+```
+
+---
+
+## Deployment Instructions
+
+### For Docker Desktop:
+```bash
+# Use default or override in your compose
+docker-compose -f docker-compose.yml up
+```
+
+### For NAS:
+Update your docker-compose or environment file:
+```yaml
+environment:
+  - DB_HOST=192.168.43.102  # or your NAS's actual IP/hostname
+  - DB_PORT=3306
+  - DB_USER=weatherdata
+  - DB_PASSWORD=cfCU$swM!HfK82%*
+  - DB_NAME=weatherdata
+  - DB_CONNECT_TIMEOUT=5
+```
+
+---
+
+## Monitoring
+
+The application now logs database configuration at startup:
+```
+DB config: host=192.168.43.102:3306, user=weatherdata, db=weatherdata
+```
+
+Monitor the logs for:
+- **"Database reachable again"** → Connection recovered
+- **"Lost database connectivity"** → Transient failure detected and pool disposed
+- **"Stored batch locally"** → Data queued to SQLite while DB unavailable
+
+---
+
+## Testing
+
+### Test 1: Verify Environment Variables
+```bash
+# Run container with override
+docker run -e DB_HOST=test-host ... python datacollector.py
+# Check log: "DB config: host=test-host:3306"
+```
+
+### Test 2: Simulate Connection Loss
+```python
+# In Python shell connected to container
+import requests
+requests.get('http://container:port/shutdown')  # Reconnect simulation
+# Should see: "Database still unreachable" → "Database reachable again"
+```
+
+### Test 3: Monitor Pool State
+Enable pool logging:
+```python
+echo_pool=True  # Line 195 in datacollector.py
+```
+
+---
+
+## Expected Behavior After Fix
+
+- ✅ Connection pool adapts to transient failures
+- ✅ Stale connections are recycled frequently
+- ✅ Pool is disposed on failure to prevent cascading errors
+- ✅ Different environments can specify different hosts
+- ✅ Data is cached locally if database is temporarily unavailable