10. Web-triggered backfill — an in-process async job¶
Status¶
Accepted
Context¶
A brand-new install opens to an empty archive: collection has just started, so there
are no frames yet (Slice 18 makes that state honest with a "collecting now…" card).
The fix for "I don't want to wait ~5 minutes for the first frame" is backfill —
pull recent assembled volumes from S3 and render them into the archive. That capability
already exists as the backscatter backfill CLI (Slice 12: plan_backfill /
run_backfill, idempotent, dedupes on (site, scan_time)). This decision is about
exposing it behind a one-click web button safely.
Two hard problems:
-
It can't run in the request. A backfill over hours of data is dozens of sequential download+decode+render steps — minutes of blocking work. It must not run inside the HTTP request/response or block the event loop.
-
Two writers, one SQLite DB, two processes.
serve(FastAPI) andcollect(the live loop) run as separate OS processes sharing/data/backscatter.db(docker-entrypoint launches both; in dev they're two terminals). A backfill started inside the serve process is a third writer in a different process from the collector. An in-process lock cannot coordinate across processes — so it's the wrong tool here.
Decision¶
Run the job on a daemon thread inside the serve process, tracked by an in-memory
JobManager singleton. No external queue/broker.
POST /api/backfill(body: optionallocation,hours) starts a job and returns 202 with the job's id + initial status.GET /api/backfill/{id}polls status (state, total, fetched/rendered/skipped/already_have, error).GET /api/backfillreturns the current/last job so the UI can restore progress on reload.- The job runs on one
threading.Thread(daemon=True)per start. The worker owns its own SQLite connection (sqlite3 connections are bound to their creating thread viacheck_same_thread) and its own unsigned S3 client (boto3 clients aren't shared safely), both built inside the worker and closed when it ends. - One job at a time. A concurrent start (double-click, two tabs) is rejected with 409 carrying the running job's status. The manager keeps a single slot holding the running — then last-finished — job.
- The job reuses
plan_backfill/run_backfillunchanged except for one additiveprogress_cbkwarg that feeds live counts to the manager. No pipeline is reimplemented; the same dedupe, skip-on-bad-volume, and per-volume render apply. - Range is bounded: a click backfills the last 6 hours; the request is
hard-capped at 24 hours server-side (400 over cap). 24h sits well inside the
default 30-day retention window, so no prune warning is needed (the CLI, which allows
arbitrary ranges, keeps its
older_than_retentionwarning).
Concurrent writes are made safe by SQLite itself, not an app-level lock. WAL +
busy_timeout + the UNIQUE(site, scan_time) dedupe — the exact mechanism the
collector already relies on:
- WAL: readers (API frame queries) never block writers, and vice-versa. Only writer-vs-writer contends.
- Writes are millisecond-scale, spaced seconds apart. In both collector and
backfill the slow work (download + Py-ART render) happens before any DB touch; the
actual writes are single-statement INSERT/UPDATE commits. No multi-statement
transaction is ever held open across the slow work, so the contended critical section
is sub-millisecond.
busy_timeout(raised 5s → 15s for three-writer headroom) cannot realistically be exhausted. - Same-key race (backfill "last 6h" overlapping what collect just pulled): a
volume_existspre-check plus theUNIQUEbackstop; the losing INSERT raisesIntegrityError, whichrun_backfill's per-volumetry/exceptalready absorbs.
Job state lives only in memory and is lost on process restart. That's acceptable: a backfill is idempotent, so an interrupted run is simply re-run and re-skips whatever already landed.
Consequences¶
- No new runtime dependency, no broker, no extra process — fits the LAN-first, single-container model.
- A restart mid-backfill loses the job's status (and stops the run), never data. The UI treats a 404 on a previously-known job id as "lost on restart → just re-run."
- The "one job at a time" guarantee is per serve process. backscatter serves
single-process (
uvicorn.run, no--workers); if anyone runs multiple workers, each gets its own manager and the guard weakens to one-per-worker. The DB-level safety still holds (it's the same as collector + backfill). Keep serve single-process, or move the guard to a DB-backed lock. - The progress bar updates every
progress_every(25) volumes plus once at the end — coarse but fine for a ≤24h range.
Alternatives considered¶
- FastAPI
BackgroundTasks— fire-and-forget tied to a request, with no pollable handle and no one-at-a-time guard. Wrong shape for a start + status-poll job. asyncio.create_task+run_in_executor— viable but pulls the job's lifecycle into the event loop and default threadpool, entangling it with request handling for no benefit over a plain daemon thread.- Subprocess (spawn
backscatter backfill) — cleanest isolation (a 4th process, nocheck_same_threadconcern), and the two-writer DB safety makes it unnecessary; kept as the escape hatch if render ever needs hard memory isolation. No in-process status to poll without parsing stdout or a status file. - External task queue (Celery/RQ/Redis) — overkill for one-at-a-time work on a home server; violates the no-extra-infra spirit.
- App-level write lock shared between the backfill thread and request handlers — ineffective: it can't coordinate against the collector (a different process) and risks holding a Python lock across a multi-second render if mis-scoped. Rejected in favor of SQLite's own file locking.