Firmware Change Protocol

The ESP32 is the only thing standing between software bugs and dead plants. When the planner goes offline, when the ingestor crashes, when the network drops — the firmware keeps running, holding whatever setpoints and override logic it had at boot. That means firmware changes cannot be validated only by the greenhouse reacting to them in real time. By the time a regression is visible in the data, it has already misbehaved for hours.

The Firmware Change Protocol is the three-layer gate every firmware change must pass before OTA. Each layer answers a different question, and each layer exists because an earlier generation of this project did not have it and paid the price.

Layer 1 — Unit tests for control logic

firmware/test/test_greenhouse_logic.cpp runs 59 native C++ tests covering resolve_mode(), resolve_equipment(), evaluate_overrides(), hysteresis bands, heat stage thresholds, relief cycle counters, and sensor fault handling. Each test pins a specific decision path: given these sensors and setpoints, the controller must do this.

These tests run on the development host (no ESP32 required) in under two seconds. They catch logic regressions on the fastest possible feedback loop: a dev edits a .h file, runs make test-firmware, and knows immediately if the change broke a known-good decision.

What Layer 1 cannot prove: the changed code still behaves correctly against real-world sensor patterns it has never seen. A unit test asserts “at 52°F, heat2 fires” — it does not assert “across 8 months of telemetry, the override flag set we emit matches the old implementation’s override set.” That’s Layer 2’s job.

Layer 2 — Replay harness against a golden CSV

firmware/test/replay_overrides.cpp replays 180,000 rows of real greenhouse telemetry (8 months, re-exported every 2–3 sprints) through the current firmware’s decision functions and reports the distribution of override flags fired per hour. A checksum-like scorecard lets a reviewer compare this run to the previous release: if occupancy-blocks-moisture suddenly fires 4× more often, that’s a signal. The harness also runs a synthetic self-test — seven hand-built input scenarios that each force exactly one override flag. If a flag fails to fire on its synthetic input, the build exits nonzero and blocks OTA.

Origin story: OBS-1e dead-code incident. During Sprint 16 the unit tests passed, the build passed, the OTA succeeded, and the dashboards looked fine. But buried in greenhouse_logic.h a vpd_dry_override path was unreachable — the enclosing determine_mode() had already mutated state so the inner guard !vpd_wants_seal could never be true. The unit tests did not catch it because they tested evaluate_overrides() in isolation with hand-constructed state. The replay harness did catch it: replaying real telemetry showed vpd_dry_override firing zero times in eight months, which is physically implausible. The fix was a one-line refactor in determine_mode() to set state.dry_override_active explicitly. We added the synthetic self-test to make the “never fires” failure mode a hard error rather than a quiet anomaly.

What Layer 2 cannot prove: the firmware that flashed is actually running. Layers 1 and 2 both run against source code on a dev machine. They do not touch the ESP32.

Layer 3 — Post-deploy sensor-health sweep + OTA auto-rollback

Every make firmware-deploy runs: compile → OTA upload → wait 60 seconds for reboot → scripts/sensor-health-sweep.sh. The sweep queries TimescaleDB for fresh readings on every sensor the firmware is supposed to be publishing. If 26 or more sensors report recent data and zero are stale, the new firmware is promoted to last-good.ota.bin. If any sensor is missing, the deploy immediately OTAs the previous last-good.ota.bin back to the controller and opens a firmware_rollback alert. The greenhouse is never more than about 90 seconds away from a validated firmware state.

Origin story: the silent south probe failure. The south temperature probe failed in a subtle way — it stopped publishing but the firmware still reported “4 probes active” because the aggregation lambda was summing old cached values. The planner averaged over the stale reading for a week before the discrepancy between south and the other three probes grew large enough to trigger a band-compliance alert. Layer 3 exists because by the time band compliance shifts, the sensor has been lying for hours to days. A post-deploy sweep that counts live publishing sensors, not cached values, catches this immediately. Layer 3 was retroactively extended to catch any firmware change that silently drops an entity — not just probe failures.

What Layer 3 cannot prove: the plant is happy. Layer 3 proves “every sensor we expected is publishing.” It does not prove “the greenhouse is still inside the band.” That belongs to the planner scorecard on a longer time scale.

What the three layers are for, together

  • Layer 1 catches logic bugs within the decision functions, in two seconds.
  • Layer 2 catches regressions across the full distribution of real-world inputs, including dead-code paths, in under a minute.
  • Layer 3 catches deployment-time entity or calibration drift before it runs for more than 90 seconds.

A firmware change that passes all three gates can still be wrong — in ways that require days of telemetry to surface — but it cannot be silently wrong in the ways the three origin incidents above were. That’s the point. The protocol is not a proof of correctness. It is a commitment that every class of failure we’ve been burned by has a dedicated gate, running automatically, on every change.

The command surface

make test-firmware          # Layer 1 + Layer 2 (native tests + replay)
make test-replay-overrides  # Layer 2 in isolation (refreshes CSV first)
make firmware-deploy        # Layer 3 (compile → OTA → sweep → promote/rollback)

make check runs all three as part of the preflight gate before any commit touching firmware/.