weatherstation-datacollector/DATABASE_CONNECTIVITY_FIXES.md

# Database Connectivity Issues - Analysis & Fixes

## Problem Summary

The NAS container experiences **intermittent database connectivity** failures with the error:
```
Exception during reset or similar
_mysql_connector.MySQLInterfaceError: Lost connection to MySQL server during query
```

While Docker for Desktop works reliably and MySQL Workbench can connect without issues.

---

## Root Causes Identified

### 1. **Aggressive Connection Pool Settings**
- **Old config**: `pool_size=5` + `max_overflow=10` = up to 15 simultaneous connections
- **Problem**: Creates excessive connections that exhaust database resources or trigger connection limits
- **Result**: Pool reset failures when trying to return/reset dead connections

### 2. **Insufficient Connection Recycling**
- **Old config**: `pool_recycle=1800` (30 minutes)
- **Problem**: Connections held too long; database may timeout/close them due to `wait_timeout` or network issues
- **Result**: When SQLAlchemy tries to reuse connections, they're already dead

### 3. **Conflicting autocommit Setting**
- **Old config**: `autocommit=True` in connect_args
- **Problem**: When autocommit is enabled, there's nothing to rollback, but SQLAlchemy still tries during pool reset
- **Result**: Rollback fails on dead connections → traceback logged

### 4. **Pool Reset on Dead Connections**
- **Config**: `pool_reset_on_return="none"` (correct) but **didn't dispose pool on failure**
- **Problem**: When a connection dies, the pool kept trying to reuse it
- **Result**: Repeated failures until the next retry window (30 seconds)

### 5. **Network/Database Timeout Issues (NAS-specific)**
- **Likely cause**: NAS MariaDB has aggressive connection timeouts
- **Or**: Container network has higher packet loss/latency than Docker Desktop
- **Or**: Pool exhaustion prevents new connections from being established

---

## Applied Fixes

### ✅ Fix 1: Conservative Connection Pool (Lines 183-195)

```python
pool_size=3,             # Reduced from 5
max_overflow=5,          # Reduced from 10
pool_recycle=300,        # Reduced from 1800 (every 5 mins vs 30 mins)
autocommit=False,        # Removed - let SQLAlchemy manage transactions
```

**Why this works:**
- Fewer simultaneous connections = less resource contention
- Aggressive recycling = avoids stale connections killed by database
- Proper transaction management = cleaner rollback handling

### ✅ Fix 2: Pool Disposal on Connection Failure (Lines 530-533)

```python
except exc.OperationalError as e:
    sql_engine.dispose()  # ← CRITICAL: Force all connections to be closed/recreated
    logger.warning(f"Lost database connectivity: {e}")
```

**Why this works:**
- When connection fails, dump the entire pool
- Next connection attempt gets fresh connections
- Avoids repeated failures trying to reuse dead connections

### ✅ Fix 3: Environment Variable Support (Lines 169-175)

```python
DB_HOST = os.getenv("DB_HOST", "192.168.43.102")
DB_PORT = int(os.getenv("DB_PORT", "3306"))
# ... etc
```

**Why this matters:**
- Different deployments can now use different database hosts
- Docker Desktop can use `192.168.43.102`
- NAS can use `mariadb` (Docker DNS) or different IP if needed

---

## Recommended MariaDB Configuration

The NAS MariaDB should have appropriate timeout settings:

```sql
-- Check current settings
SHOW VARIABLES LIKE 'wait_timeout';
SHOW VARIABLES LIKE 'interactive_timeout';
SHOW VARIABLES LIKE 'max_connections';
SHOW VARIABLES LIKE 'max_allowed_packet';

-- Recommended settings (in /etc/mysql/mariadb.conf.d/50-server.cnf)
[mysqld]
wait_timeout = 600                 # 10 minutes (allow idle connections longer)
interactive_timeout = 600
max_connections = 100              # Ensure enough for pool + workbench
max_allowed_packet = 64M
```

---

## Deployment Instructions

### For Docker Desktop:
```bash
# Use default or override in your compose
docker-compose -f docker-compose.yml up
```

### For NAS:
Update your docker-compose or environment file:
```yaml
environment:
  - DB_HOST=192.168.43.102  # or your NAS's actual IP/hostname
  - DB_PORT=3306
  - DB_USER=weatherdata
  - DB_PASSWORD=cfCU$swM!HfK82%*
  - DB_NAME=weatherdata
  - DB_CONNECT_TIMEOUT=5
```

---

## Monitoring

The application now logs database configuration at startup:
```
DB config: host=192.168.43.102:3306, user=weatherdata, db=weatherdata
```

Monitor the logs for:
- **"Database reachable again"** → Connection recovered
- **"Lost database connectivity"** → Transient failure detected and pool disposed
- **"Stored batch locally"** → Data queued to SQLite while DB unavailable

---

## Testing

### Test 1: Verify Environment Variables
```bash
# Run container with override
docker run -e DB_HOST=test-host ... python datacollector.py
# Check log: "DB config: host=test-host:3306"
```

### Test 2: Simulate Connection Loss
```python
# In Python shell connected to container
import requests
requests.get('http://container:port/shutdown')  # Reconnect simulation
# Should see: "Database still unreachable" → "Database reachable again"
```

### Test 3: Monitor Pool State
Enable pool logging:
```python
echo_pool=True  # Line 195 in datacollector.py
```

---

## Expected Behavior After Fix

- ✅ Connection pool adapts to transient failures
- ✅ Stale connections are recycled frequently
- ✅ Pool is disposed on failure to prevent cascading errors
- ✅ Different environments can specify different hosts
- ✅ Data is cached locally if database is temporarily unavailable