Architecture Decision Records¶

This document records the significant architectural decisions behind Iris and the reasoning that produced them. Each ADR is intentionally short and follows a Context / Decision / Consequences format, with the consequences section listing real trade-offs — not just upsides.

The constraint that frames almost every decision: Iris targets a single box with one NVIDIA T4 (16 GB), self-hosted, co-located with other production workloads. That constraint is what makes the "obvious" answer wrong as often as not.

ADR-001: Face-anchored identity, appearance as a within-session helper¶

Status: Accepted

Context¶

Cross-camera re-identification is usually built appearance-first: a person/vehicle ReID embedding (clothing, body shape, colour) is the primary key, and identities are clustered on it. That works in academic benchmarks where the gallery and query are captured minutes apart, in the same clothes, often the same day.

In a real deployment it fails in two directions:

It under-merges across time. Clothing is not an identity. The same person on Tuesday and Thursday produces two appearance clusters, so a system keyed on appearance cannot answer "is this the same person as last week?"
It over-creates within a scene. A single person walking away from a camera — back of the head, then a side view, then a back view again — produces a stream of appearance crops that don't match each other well. An appearance-first creator happily mints a new identity for each, and one person becomes dozens of duplicates. This "back-view explosion" is the single most destructive failure mode for an unattended VMS.

Decision¶

Anchor identity on the face, the only cue that is stable across clothing change, viewpoint, lighting and days. The face pipeline is SCRFD-10G detection → 5-landmark affine alignment → ArcFace 512-d embedding (insightface buffalo_l).

A person is registered as a new identity only when a decent face is seen, gated on a det_score × frontalness quality score (reid_face_exemplar_min_quality, default 0.35). A faceless crop — a back or side view — can attach to an existing identity by appearance within a session, but can never spawn a new person identity (require_face_for_new_person, default true; see IdentityManager._quality_ok_for_new in app/reid/manager.py).

Appearance (OSNet-AIN x1.0, MSMT17, exported to ONNX) is kept strictly as a within-session helper: it is time-windowed (reid_app_window_seconds, default 600 s) and exponentially decayed (reid_app_decay_tau_seconds, default 12 h) so it links sightings minutes apart but is not trusted to link across days. Matching is class-scoped (Identity.object_class) and gated by a best-minus-second-best margin.

Consequences¶

+ Eliminates back-view duplicate explosion by construction: a non-identifiable view cannot create an identity.
+ Cross-day re-identification works, because the anchor (face embedding) is the same vector on Tuesday and Thursday.
+ Honest about its own limits (see ADR-002).
− Vehicles and other non-person objects have no face. They fall back to appearance + a colour gate + class scope, which is genuinely weaker — two identical white sedans can merge. We accept this; the system is correct under its constraints rather than uniformly confident.
− A person who is never seen face-on at a given camera is simply not enrolled there. This is intentional, but it means coverage depends on camera placement.

ADR-002: Treat "no biometric in the frame" as a correctness requirement, not a bug¶

Status: Accepted

Context¶

There is constant pressure — from users and from one's own engineering pride — to "identify everyone who appears." A camera mounted above a doorway often only ever sees the back of a head. No re-ID method, ours or anyone's, can extract a stable cross-day identity from a crop that contains no face.

The tempting move is to lean harder on appearance and pretend a back-view crop is identifiable. That is exactly what produces fabricated and duplicate identities.

Decision¶

Make non-fabrication an explicit design goal. When the only evidence is a non-identifiable view, the system drops the crop rather than minting or guessing an identity (_create_new returns match_kind="dropped"). Provisional, faceless single-sightings are created only for non-person objects (which legitimately have no face) and are reaped by the maintenance pass if they never accrue a second sighting (reid_provisional_grace_seconds).

Consequences¶

+ No phantom identities, no duplicate-per-angle inflation, no false "matches" presented to an operator as fact.
+ The data model stays trustworthy: every person identity is backed by at least one real face exemplar.
− Recall is bounded by camera geometry. A back-only camera contributes recordings and tracks but few or no identities. This is communicated as a placement consideration, not hidden.

ADR-003: One OS process per camera, not async tasks in one process¶

Status: Accepted

Context¶

Each camera needs: an RTSP decode loop, GPU inference (detection + face + ReID), a rolling ffmpeg recorder, JPEG encoding, and DB writes. The two obvious topologies are (a) a single async process running every camera as a coroutine, or (b) one OS process per camera.

The hot path is CPU- and GIL-bound (OpenCV decode, NumPy, JPEG encode, ONNX pre/post-processing), so asyncio would not actually parallelise it; one slow camera would stall the event loop for all of them. A crashing decoder (corrupt stream, ffmpeg wedge) in a shared process risks every camera.

Decision¶

Spawn one subprocess per enabled camera using the multiprocessing "spawn" start method (WorkerManager, app/workers/manager.py; CameraWorker, app/workers/camera_worker.py). spawn (not fork) is required so each child gets a clean CUDA context — forking a process that has touched CUDA is unsafe. The manager reconciles running workers against the DB on a timer (worker_reconcile_seconds), restarting crashed workers and stopping orphaned ones. Components (detector, recognizer, embedder, gallery) are built inside the child so all import/CUDA cost is paid past the spawn boundary.

Consequences¶

+ True parallelism across cameras; no shared GIL on the hot path.
+ Fault isolation: a wedged decoder or an OOM-killed worker takes down one camera, and the manager restarts it. The API process and other cameras are unaffected.
+ Each worker has its own DB connection and its own derived in-memory state (gallery, FAISS index), re-synced from the shared SQLite DB on a timer so workers converge.
− Higher baseline memory: each worker loads its own copy of the models. On a 16 GB T4 this caps practical camera count, and the ONNX models are deliberately small (ADR-006) partly for this reason.
− Cross-worker state is eventually consistent (gallery reload every reid_gallery_reload_seconds, default 30 s), so an identity minted on camera A is visible to camera B after a short lag rather than instantly.

ADR-004: Drop-to-latest decode + off-loop persistence¶

Status: Accepted

Context¶

Within a worker, the danger is the analysis pass falling behind real time. If frames are queued, a momentarily slow detection pass builds an ever-growing backlog and the live view drifts seconds into the past. Separately, the hot path must never block on slow I/O: ffmpeg clip assembly, JPEG encoding of thumbnails, and fsync on DB commits are each tens to hundreds of milliseconds.

Decision¶

A three-thread pipeline inside each worker (CameraWorker._decode_loop / _loop / drain threads):

Decode thread (drop-to-latest). Owns the RTSP VideoCapture and keeps only the newest frame under a Condition; an unconsumed frame is overwritten, never queued. The capture is opened with CAP_PROP_BUFFERSIZE=1 and RTSP-over-TCP. So under load the system drops stale frames instead of accumulating latency.
Main loop. Consumes the latest frame, runs detection + tracking + re-ID. If it is slow, it simply processes fewer, newer frames — it never falls behind.
Off-loop drain threads. A clip-assembly thread and a persist thread drain bounded queues; all thumbnail/face-sample JPEG writes and their DB commits, plus all ffmpeg clip assembly, happen here, batched into one transaction per drain. The detect loop only enqueues.

Two further bounds keep the loop honest: adaptive detection cadence (full rate while objects are present, throttled to detect_interval_idle after the scene stays empty for active_grace_seconds) and a per-frame re-ID cap (max_reid_per_frame, default 4; fresh/unassigned tracks prioritised) so a crowd cannot stall the loop.

Consequences¶

+ Live latency stays bounded regardless of analysis cost; the newest frame always wins.
+ The hot path never blocks on ffmpeg, JPEG encode, or fsync — those run on drain threads and serialise DB writes (avoiding SQLite write contention within the worker).
+ GPU/CPU/disk drop sharply on idle cameras.
− Dropped frames mean detection is not run on every decoded frame. For event recording this is fine (the clip is stream-copied from the segment buffer, ADR-005, and is unaffected by detection cadence), but a very brief, fast-moving object between detection passes on a throttled idle camera could be missed. Cadence defaults are chosen so the active interval is full-rate the instant something enters.
− Bounded queues mean that under sustained overload, thumbnails/clips can be dropped (logged) rather than blocking — a deliberate choice to protect the loop. There is an inline fallback for the persist queue to avoid data loss in the common case.

ADR-005: Track-mode recording from a warm stream-copy segment buffer¶

Status: Accepted

Context¶

Re-encoding video to cut clips is expensive and would compete with inference for the same cores. We also want a clip to begin before the triggering object appeared (pre-roll), which is impossible if recording starts only at trigger time.

Decision¶

Each worker runs a continuous ffmpeg -f segment recorder with -c:v copy (stream copy, no decode/encode — near-zero CPU) writing short rolling segments (segment_seconds, default 2 s) to disk, pruned to a retention window (segment_retention_seconds, default 120 s). A stall watchdog restarts ffmpeg with backoff if it wedges (app/recording/segmenter.py).

Recording mode is track: one Event per presence, with the clip spanning [enter − pre, last + post], assembled by concatenating the relevant on-disk segments after the object leaves view. Because pre-roll already sits in the buffer, the clip starts before the trigger. Manual recording (the ● REC button in Live Monitoring) uses the same buffer with a stateless start/stop: the server returns a trusted start timestamp, the client echoes it back on stop, and the clip is concatenated from the buffer — no worker round-trip, no recording state held server-side (app/api/live.py).

Consequences¶

+ Recording cost is dominated by disk writes, not CPU; inference keeps the cores.
+ Genuine pre-roll for free, since the buffer is always warm.
+ Manual record is stateless and survives worker restarts (footage is already on disk).
− Stream-copy means clips inherit the camera's GOP/keyframe structure; cut points snap to segment boundaries (±segment_seconds), not to exact frames. Acceptable for review footage.
− Retention is a fixed rolling window per camera; a clip can only reach as far back as the buffer holds (the manual endpoint clamps its request to retention − 5 s).

Status: Accepted

Context¶

One T4 (16 GB) must host detection, face detect+embed, and appearance ReID, for several cameras, and leave headroom because the box is shared with other production workloads. Large/heavy ReID backbones or a per-camera DeepStream pipeline would not fit, and the appetite to "use the biggest model available" has to be resisted.

Decision¶

Standardise on compact models run through onnxruntime-gpu with the CUDAExecutionProvider:

Detection: YOLOv8n exported to ONNX. Running it on the GPU costs roughly one CPU core per camera versus ~7 on CPU — the GPU offload is the whole point.
Face: insightface SCRFD-10G + ArcFace (buffalo_l).
Appearance: OSNet-AIN x1.0 (domain-generalisable, MSMT17) exported to ONNX, 512-d.

onnxruntime-gpu lets multiple sessions share the single GPU context cleanly. A subtle but critical build detail: insightface hard-depends on the CPU onnxruntime wheel, which installs into the same package directory and shadows onnxruntime-gpu, silently dropping everything back to CPU. The Dockerfile uninstalls the CPU build and force-reinstalls the GPU build to keep the CUDAExecutionProvider (Dockerfile). Every model degrades gracefully to CPU if the GPU provider is genuinely unavailable (_resolve_providers in app/reid/embedder.py), and the whole ReID/face layer degrades to "events without identities" if models are missing — it never crashes the loop.

Consequences¶

+ All three model families fit on one T4 with room for several per-camera workers, plus headroom for the co-hosted workload.
+ A single inference runtime (onnxruntime) for detection and appearance simplifies deployment and device selection.
− YOLOv8n and OSNet are accuracy/throughput compromises; a heavier detector/backbone would recall more but would not fit the budget. The model paths are configurable, so a deployment with more GPU can swap in larger variants.
− An optional DeepStream backend exists for higher throughput, but it adds significant operational complexity and is off by default.

ADR-007: SQLite (WAL) as the single store, not a server database¶

Status: Accepted

Context¶

The system is multi-process (one API process plus N worker subprocesses) all reading and writing the same metadata: cameras, events, identities, sightings, exemplars. A server DB (Postgres/MySQL) would handle concurrent writers cleanly but adds a service to run, secure, back up, and keep alive — on a single box where "fewer moving parts" is itself a feature.

Decision¶

Use SQLite in WAL mode as the one source of truth (app/db/database.py), with derived in-memory state (the FAISS face index and the identity gallery) rebuilt from it per worker and re-synced on a timer. Each connection sets pragmas tuned for this access pattern: journal_mode=WAL (readers don't block the writer), synchronous=NORMAL (durable enough with WAL, far faster than FULL), foreign_keys=ON (FK cascades are enforced), and busy_timeout=5000 (a worker waits rather than erroring on a momentary write lock). Workers further serialise their own writes onto drain threads (ADR-004) and keep transactions short, so write contention stays low. Schema evolution is handled by idempotent PRAGMA table_info-guarded ALTER TABLE shims rather than a migration framework.

Consequences¶

+ Zero-ops: no DB server to run, secure, or monitor; the entire state is a single file (plus media) that backs up by copying a directory.
+ WAL gives concurrent readers + one writer, which matches the workload (each worker writes its own rows; reads are plentiful).
+ Fewer attack-surface services on a shared box.
− SQLite has a single writer. The architecture is built around that (short transactions, batched commits on drain threads, busy_timeout backoff), but it is a real ceiling: this does not scale to dozens of high-traffic cameras hammering writes simultaneously. At that point a server DB would be the right call.
− Hand-rolled ALTER TABLE shims are simpler than Alembic but require discipline; they are idempotent and run at startup before any worker writes.

Status: Accepted

Context¶

This is a video surveillance system: unauthenticated access is unacceptable. We already run an nginx cookie-SSO gateway on the box for other services, which is the natural place to terminate TLS and validate sessions. The app should not re-implement session management, but it also must not be trivially spoofable.

Decision¶

Bind the app to loopback only (127.0.0.1:8120, published as such by compose) and reach it solely through nginx. nginx validates the SSO cookie (auth_request) and injects a trusted identity header (default X-Email), overriding any client-supplied value — so a request that reaches the app with that header is, by construction, authenticated by the gateway and cannot be spoofed from outside.

Authentication is fail-closed: auth_required defaults to true (app/config.py), so a request with neither the SSO header nor a valid API key is rejected with 401 (require_user in app/auth.py). A second credential path — Authorization: Bearer <API_KEY> — exists for SSH-tunnel/CLI use that bypasses nginx; the key is compared in constant time with hmac.compare_digest to avoid timing leaks. Only when auth_required is explicitly disabled (local dev) does the dependency degrade to an anonymous principal.

Consequences¶

+ No session/credential code duplicated in the app; nginx owns TLS and SSO.
+ Header-injection spoofing is mitigated by nginx overriding the client header and the app binding to loopback (the header is only trustworthy because nothing but nginx can reach the port).
+ Fail-closed by default: a misconfiguration tends toward "locked out," not "wide open."
− Security depends on the deployment honouring the contract: the app must not be exposed on a public interface, because anything that can reach the port can set the trusted header. This is documented and enforced by the loopback bind + compose port pinning, but it is a deployment-time invariant, not a code-time one.
− The bearer key is a single shared secret (no per-user scoping); it is intended for operator/CLI access, not multi-tenant use.

ADR-009: Non-root container with world-accessible GPU device nodes¶

Status: Accepted

Context¶

The container processes untrusted input: RTSP streams from cameras, uploaded enrollment images, and operator-supplied strings. A container running as root multiplies the blast radius of any write primitive or RCE. But GPU workloads are commonly run as root because that is the path of least resistance for device access.

Decision¶

Run the container as the unprivileged host uid (1000:1000) with cap_drop: [ALL] and no-new-privileges: true, and mount the application source read-only (./app:/app/app:ro) so the container can never rewrite its own code (docker-compose.yml). The GPU still works because the NVIDIA device nodes (/dev/nvidia*) are world-accessible — no root or added capability is needed to use them. Defence-in-depth across the surfaces this image touches:

No pickle for vectors. Embeddings are (de)serialised with numpy.frombuffer, not pickle, eliminating a deserialisation-RCE class on the most-written hot data.
ffmpeg sandboxing. Both the recorder and the HLS transcoder pass an explicit -protocol_whitelist (rtsp/rtp/tcp/tls/… only, no file) plus rtsp:// scheme validation, blocking SSRF and local-file reads via crafted stream URLs (app/recording/segmenter.py, app/recording/hls.py).
Path-traversal containment. Every endpoint that serves a file under data/ checks os.path.commonpath against the data root before opening it (events, identities, people, face-groups APIs).
Upload guard. A decompression-bomb (pixel-count) guard on enrollment uploads.
Secrets masking. RTSP credentials are masked in every API response; the raw URL never leaves the server (app/api/cameras.py).
Bounded bodies. Request bodies and id-lists/strings have max-length limits.

A hard cgroup mem_limit + oom_score_adj ensures the VMS can never trigger a host-wide OOM that would kill a co-hosted VM — if it exceeds its budget, a cgroup-scoped OOM kills one camera worker (which restarts), and the VMS is the preferred global victim.

Consequences¶

+ A write/RCE primitive inside the container cannot tamper with host-owned files, gain new privileges, or rewrite the app source.
+ GPU access with zero added privilege — the common "GPU needs root" assumption is simply not true for device-node access here.
+ The co-hosted production workload is protected from VMS memory blowups by the cgroup limits, not by hope.
− Read-only source and a fixed uid mean the container can only write under the bind-mounted data/ and models/ (owned by uid 1000 on the host); deployments must get host directory ownership right or writes fail.
− World-accessible GPU nodes are a host-level property; on a multi-tenant host this is a (pre-existing) consideration, but it is what makes unprivileged GPU use possible and is standard for NVIDIA container runtimes.

Architecture Decision Records¶

ADR-001: Face-anchored identity, appearance as a within-session helper¶

Context¶

Decision¶

Consequences¶

ADR-002: Treat "no biometric in the frame" as a correctness requirement, not a bug¶

Context¶

Decision¶

Consequences¶

ADR-003: One OS process per camera, not async tasks in one process¶

Context¶

Decision¶

Consequences¶

ADR-004: Drop-to-latest decode + off-loop persistence¶

Context¶

Decision¶

Consequences¶

ADR-005: Track-mode recording from a warm stream-copy segment buffer¶

Context¶

Decision¶

Consequences¶

ADR-006: Small ONNX models on onnxruntime-gpu, sharing one T4¶

Context¶

Decision¶

Consequences¶

ADR-007: SQLite (WAL) as the single store, not a server database¶

Context¶

Decision¶

Consequences¶

ADR-008: Fail-closed auth behind an nginx cookie-SSO gateway¶

Context¶

Decision¶

Consequences¶

ADR-009: Non-root container with world-accessible GPU device nodes¶

Context¶

Decision¶

Consequences¶