Files
infoscreen/TV_POWER_CANARY_VALIDATION_CHECKLIST.md

191 lines
6.5 KiB
Markdown

# TV Power Coordination Canary Validation Checklist (Phase 1)
Manual verification checklist for Phase-1 server-side group-level power-intent publishing before production rollout.
## Preconditions
- Scheduler running with `POWER_INTENT_PUBLISH_ENABLED=true`
- One canary group selected for testing (example: group_id=1)
- Mosquitto broker running and accessible
- Database with seeded test data (canary group with events)
## Validation Scenarios
### 1. Baseline Payload Structure
**Goal**: Retained topic shows correct Phase-1 contract.
Instructions:
1. Subscribe to `infoscreen/groups/1/power/intent` (canary group, QoS 1)
2. Verify received payload contains:
- `schema_version: "v1"`
- `group_id: 1`
- `desired_state: "on"` or `"off"` (string)
- `reason: "active_event"` or `"no_active_event"` (string)
- `intent_id: "<uuid>"` (not empty, valid UUID v4 format)
- `issued_at: "2026-03-31T14:22:15Z"` (ISO 8601 with Z suffix)
- `expires_at: "2026-03-31T14:24:00Z"` (ISO 8601 with Z suffix, always > issued_at)
- `poll_interval_sec: 30` (integer, matches scheduler poll interval)
**Pass criteria**: All fields present, correct types and formats, no extra/malformed fields.
### 2. Event Start Transition
**Goal**: ON intent published immediately when event becomes active.
Instructions:
1. Create an event for canary group starting 2 minutes from now
2. Wait for event start time
3. Check retained topic immediately after event start
4. Verify `desired_state: "on"` and `reason: "active_event"`
5. Note the `intent_id` value
**Pass criteria**:
- `desired_state: "on"` appears within 30 seconds of event start
- No OFF in between (if a prior OFF existed)
### 3. Event End Transition
**Goal**: OFF intent published when last active event ends.
Instructions:
1. In setup from Scenario 2, wait for the event to end (< 5 min duration)
2. Check retained topic after end time
3. Verify `desired_state: "off"` and `reason: "no_active_event"`
**Pass criteria**:
- `desired_state: "off"` appears within 30 seconds of event end
- New `intent_id` generated (different from Scenario 2)
### 4. Adjacent Events (No OFF Blip)
**Goal**: When one event ends and next starts immediately after, no OFF is published.
Instructions:
1. Create two consecutive events for canary group, each 3 minutes:
- Event A: 14:00-14:03
- Event B: 14:03-14:06
2. Watch retained topic through both event boundaries
3. Capture all `desired_state` changes
**Pass criteria**:
- `desired_state: "on"` throughout both events
- No OFF at 14:03 (boundary between them)
- One or two transitions total (transition at A start only, or at A start + semantic change reasons)
### 5. Heartbeat Republish (Unchanged Intent)
**Goal**: Intent republishes each poll cycle with same intent_id if state unchanged.
Instructions:
1. Create a long-duration event (15+ minutes) for canary group
2. Subscribe to power intent topic
3. Capture timestamps and intent_ids for 3 consecutive poll cycles (90 seconds with default 30s polls)
4. Verify:
- Payload received at T, T+30s, T+60s
- Same `intent_id` across all three
- Different `issued_at` timestamps (should increment by ~30s)
**Pass criteria**:
- At least 3 payloads received within ~90 seconds
- Same `intent_id` for all
- Each `issued_at` is later than previous
- Each `expires_at` is 90 seconds after its `issued_at`
### 6. Scheduler Restart (Immediate Republish)
**Goal**: On scheduler process start, immediate published active intent.
Instructions:
1. Create and start an event for canary group (duration ≥ 5 minutes)
2. Wait for event to be active
3. Kill and restart scheduler process
4. Check retained topic within 5 seconds after restart
5. Verify `desired_state: "on"` and `reason: "active_event"`
**Pass criteria**:
- Correct ON intent retained within 5 seconds of restart
- No OFF published during restart/reconnect
### 7. Broker Reconnection (Retained Recovery)
**Goal**: On MQTT reconnect, scheduler republishes cached intents.
Instructions:
1. Create and start an event for canary group
2. Subscribe to power intent topic
3. Note the current `intent_id` and payload
4. Restart Mosquitto broker (simulates network interruption)
5. Verify retained topic is immediately republished after reconnect
**Pass criteria**:
- Correct ON intent reappears on retained topic within 5 seconds of broker restart
- Same `intent_id` (no new transition UUID)
- Publish metrics show `retained_republish_total` incremented
### 8. Feature Flag Disable
**Goal**: No power-intent publishes when feature disabled.
Instructions:
1. Set `POWER_INTENT_PUBLISH_ENABLED=false` in scheduler env
2. Restart scheduler
3. Create and start a new event for canary group
4. Subscribe to power intent topic
5. Wait 90 seconds
**Pass criteria**:
- No messages on `infoscreen/groups/1/power/intent`
- Scheduler logs show no `event=power_intent_publish*` lines
### 9. Scheduler Logs Inspection
**Goal**: Logs contain structured fields for observability.
Instructions:
1. Run canary with one active event
2. Collect scheduler logs for 60 seconds
3. Filter for `event=power_intent_publish` lines
**Pass criteria**:
- Each log line contains: `group_id`, `desired_state`, `reason`, `intent_id`, `issued_at`, `expires_at`, `transition_publish`, `heartbeat_publish`, `topic`, `qos`, `retained`
- No malformed JSON in payloads
- Error logs (if any) are specific and actionable
### 10. Expiry Validation
**Goal**: Payloads never published with `expires_at <= issued_at`.
Instructions:
1. Capture power-intent payloads for 120+ seconds
2. Parse `issued_at` and `expires_at` for each
3. Verify `expires_at > issued_at` for all
**Pass criteria**:
- All 100% of payloads have valid expiry window
- Typical delta is 90 seconds (min expiry)
## Summary Report Template
After running all scenarios, capture:
```
Canary Validation Report
Date: [date]
Scheduler version: [git commit hash]
Test group ID: [id]
Environment: [dev/test/prod]
Scenario Results:
1. Baseline Payload: ✓/✗ [notes]
2. Event Start: ✓/✗ [notes]
3. Event End: ✓/✗ [notes]
4. Adjacent Events: ✓/✗ [notes]
5. Heartbeat Republish: ✓/✗ [notes]
6. Restart: ✓/✗ [notes]
7. Broker Reconnect: ✓/✗ [notes]
8. Feature Flag: ✓/✗ [notes]
9. Logs: ✓/✗ [notes]
10. Expiry Validation: ✓/✗ [notes]
Overall: [Ready for production / Blockers found]
Issues: [list if any]
```
## Rollout Gate
Power-intent Phase 1 is ready for production rollout only when:
- All 10 scenarios pass
- Zero unintended OFF between adjacent events
- All log fields present and correct
- Feature flag default remains `false`
- Transition latency <= 30 seconds nominal case