11. Near-real-time live frame from the chunks bucket¶
Status¶
Accepted
Context¶
The displayed radar ran ~5 min behind reality. A read-only lag investigation traced
~90% of that to a by-design choice (ADR-0001): we read the assembled bucket
(unidata-nexrad-level2), where one object is one complete volume scan. A volume must
finish scanning, assemble, and upload before that object appears — measured at a median
~309 s (≈5 min) after the volume's start, even though we only ever use the lowest
0.5° reflectivity cut, which is scanned first. No data/correctness bug was found; the
floor is structural. The only way under it is the real-time chunks bucket
(unidata-nexrad-level2-chunks), which RadarScope-class viewers read per-sweep.
Slice 26a proved the decode path: concatenating a volume's chunk objects in order
yields a partial AR2V stream that pyart.io.read_nexrad_archive(BytesIO) reads directly
(no MetPy), and at radar.nsweeps >= 2 the 0.5° cut is frozen and byte-identical to
the eventual assembled volume's tilt (max abs diff 0.0 dBZ; the live PNG is md5-equal
to the assembled render). 26b wires that assembler into collection and serving.
Decision¶
Add the live frame as an additive path; the assembled bucket stays the archive source of truth (backfill, retention, byte-for-byte storage).
- One frame, one row, a
sourceflag. Asourcecolumn ('assembled'default |'live') is added tovolumesby an idempotentALTER TABLEmigration (the Slice-5 pattern).UNIQUE(site, scan_time)is unchanged — a scan is one row regardless of how it arrived. The live frame is a first-class rendered frame; the serve/timeline API and the frontend are untouched (theySELECT *and never readsource). - Live attempt in the collect loop, on the existing cadence. After the assembled
attempt each cycle, a gated live attempt finds the active volume dir, assembles the
0.5° cut, renders it via
render_sweep, and indexes itsource='live'. The 0.5° cut completes ~73 s after volume start; the existing 30 s poll catches it within ~30 s → ~1–2 min displayed latency. No separate timer/thread. - Reconciliation as a dedicated per-cycle sweep, not inline in the fetch. The
assembled path only ever fetches the single latest volume per cycle, so upgrading a
live row only when that fetch happens to hit its scan would strand any scan polling
skipped as a permanent partial. Instead, each cycle a sweep takes every
source='live'row older thanLIVE_RECONCILE_DELAY(6 min — past the measured ~5 min S3 lag), checks the deterministic assembled key (naming.archive_key), and if present downloads the complete volume, overwrites the partial artifact in place, andUPDATEs the row tosource='assembled'(path/s3_key/size only). No re-render — the PNG is provably identical (26a), sorender_status/image_pathare untouched and the displayed frame never visibly changes. A missing object or any fetch error leaves the row untouched to retry next cycle; theUPDATEis the only mutation and runs only after a successful download, so a failure never corrupts the index. - Bounded cost. A per-site cursor rides the active dir across polls (one cheap LIST), and a decode is attempted only when chunks advanced and the scan isn't already indexed — a completed/indexed volume is never re-decoded. The active-dir scan (O(dirs), ~45 s serial for ~500 rotating dirs) is parallelized to ~2 s and run only at cold start or volume rollover. A still-incomplete volume older than a give-up age is abandoned to the assembled path so the cursor can't stick.
- Off switch.
BACKSCATTER_LIVE_CHUNKS(default on). Off = exactly the prior assembled-only behavior — a clean fallback.
Consequences¶
- Displayed live latency drops from ~5–6 min to ~1–2 min. Measured live run: a live frame 0.6 min old while the freshest assembled volume was 6.0 min old (5.4 min closed).
- The archive still ends as the complete assembled volume for every scan_time — live rows are upgraded in place, never left as permanent partials, and never duplicated. A live partial is written to the canonical path so the upgrade overwrites it (no orphan).
- The live write is a fourth concurrent writer across the same
(site, scan_time)rows (collect-assembled, backfill, API edits, now live). WAL +busy_timeout+ the UNIQUE backstop already cover it; proven by a three-writer (live + reconcile + backfill) concurrency test — no duplicate rows, no lock errors, index intact. - Still keyless/anonymous, no new dependency (pyart only).
Alternatives considered¶
- Inline upgrade in
fetch_key(the originally-sketched approach): smaller change, but only upgrades the latest scan the assembled fetch happens to hit — intermediate live scans skipped by polling would stay partial forever. Rejected: it can't deliver the "every scan ends complete" guarantee. The dedicated sweep does, for the same cost. - A separate faster live poll loop/thread. Lower latency floor, but a second cadence to reason about and another thread against the same DB for marginal gain over the 30 s cycle. Deferred — revisit only if ~1–2 min proves not fresh enough.
- Serve the live frame from a side channel (not a
volumesrow). Would duplicate the timeline/serve logic and special-case the UI. Rejected: making it a normal row with a flag reuses everything and keeps one code path.