Skip to content

Architecture

Iris is a self-hosted Video Management System (VMS) that runs cross-camera, face-anchored re-identification on a single NVIDIA T4 (16 GB). This document describes how the system is structured: the process and threading model, the end-to-end lifecycle of a frame, the data model, how the GPU is shared, the adaptive work-shaping that keeps the hot path real-time, the recording subsystem, and how the frontend is served.

The intended audience is an engineer who needs to reason about, operate, or extend the system. Everything below is traceable to source; concrete file and function references are given throughout.


1. System overview

There is a single FastAPI application process (the API process) and one spawned subprocess per enabled camera (the camera workers). The API process owns the HTTP surface, the SQLite database, the authoritative in-memory galleries, and the worker lifecycle. Each camera worker owns exactly one RTSP stream and runs the full detect → track → re-ID pipeline on it.

flowchart TB
    subgraph client["Browser (vanilla-JS SPA)"]
      UI["Live grid · History · Identities · People · Faces"]
    end

    nginx["nginx<br/>TLS · cookie-SSO (auth_request)<br/>injects X-Email, strips client copy"]
    client -->|HTTPS| nginx
    nginx -->|"127.0.0.1:8120"| api

    subgraph api["API process (FastAPI + uvicorn)"]
      routers["Routers: cameras · live · events ·<br/>identities · people · face-groups · system"]
      wm["WorkerManager<br/>(spawn / reconcile / status)"]
      gal["Authoritative IdentityGallery<br/>FaceIndex (FAISS)"]
      maint["ReID maintenance thread<br/>(decay · prune · merge)"]
      hls["HLS manager (on-demand)"]
      static["StaticFiles SPA mount"]
    end

    db[("SQLite (WAL)<br/>vms.db — single source of truth")]
    shm[["/dev/shm/vms_frames<br/>one JPEG slot per camera"]]
    seg[["data/segments/&lt;cam&gt;<br/>rolling ffmpeg buffer"]]

    api --- db
    wm -.spawn.-> w1
    wm -.spawn.-> w2

    subgraph w1["Camera worker N (subprocess, spawn)"]
      dec["decode thread<br/>(drop-to-latest)"]
      loop["main loop<br/>detect → track → re-ID"]
      clip["clip drain thread"]
      pers["persist drain thread"]
      ffmpeg["ffmpeg segmenter<br/>(-c:v copy)"]
      dec --> loop
      loop --> clip
      loop --> pers
    end
    w2["Camera worker M ..."]

    loop --> shm
    ffmpeg --> seg
    w1 --- db
    routers -->|read JPEG slot| shm
    routers -->|assemble manual clip| seg
    routers --- db

Cross-process state is deliberately minimal and uses the lowest-friction channel for each kind of data:

  • Metadata + media paths flow through the SQLite DB. The DB is the single source of truth (app/db/models.py); every in-memory structure (FAISS face index, identity gallery) is derived state rebuilt from it.
  • Live preview frames are passed as whole JPEG files via a per-camera frame slot on tmpfs (/dev/shm/vms_frames/cam_<id>.jpg), written atomically. This avoids the fragility of fixed-size shared_memory blocks for variable-size JPEGs while staying effectively as fast as shared memory (WorkerManager._resolve_frames_dir, read_frame).
  • Worker health (state, fps, last_seen, pid) is published into a multiprocessing.Manager dict-of-dicts (WorkerManager.status).

2. Process and threading model

This is the core of the design, so it gets the deepest treatment. The guiding principle: the detection hot path must never block on I/O it does not strictly need — RTSP decode latency, ffmpeg clip assembly, JPEG encode, or fsync.

2.1 One subprocess per camera (spawn)

WorkerManager (app/workers/manager.py) spawns one OS process per enabled camera via multiprocessing.get_context("spawn"). Spawn — not fork — is mandatory because CUDA and onnxruntime contexts are not fork-safe; a forked child inherits a broken CUDA handle. The entrypoint _worker_entrypoint imports cv2/onnxruntime inside the child so the parent process never loads those heavy, GPU-touching libraries.

Process-per-camera buys:

  • Fault isolation. A camera with a wedged decoder, a corrupt stream, or a segfault in a native dep takes down only its own worker. The manager's reconcile loop (_reconcile_loop, default every 20 s) and the cgroup OOM policy (see §5) restart it; the rest of the cameras and the API are unaffected.
  • A clean lifecycle boundary. Deleting or disabling a camera stops exactly one process. WorkerManager.sync is the architectural guarantee that running workers == enabled cameras: it starts workers for enabled cameras and terminates any orphaned worker whose camera is gone, so a deleted camera can never keep generating data.
  • True parallelism across cameras without the GIL: each worker is a separate interpreter.

Config crosses the spawn boundary as a plain, picklable dict (WorkerManager._camera_config) — per-camera tunables are resolved against global Settings defaults before the dict is built, so the worker needs the settings object only for model paths, not for every threshold.

2.2 Inside a worker: a three-thread pipeline

Each CameraWorker (app/workers/camera_worker.py) runs three cooperating thread roles. The reasoning for splitting them is latency and back-pressure control.

sequenceDiagram
    participant RTSP as RTSP camera
    participant Dec as Decode thread<br/>(_decode_loop)
    participant Slot as latest-frame slot<br/>(Condition + seq)
    participant Loop as Main loop<br/>(_loop)
    participant GPU as GPU (YOLO/ArcFace/OSNet)
    participant DB as SQLite
    participant Clip as Clip drain<br/>(_clip_drain_loop)
    participant Pers as Persist drain<br/>(_persist_drain_loop)

    loop as fast as stream delivers
        Dec->>RTSP: cap.read()
        Dec->>Slot: overwrite latest_frame, ++seq, notify()
        Note over Slot: an unconsumed frame is DROPPED here
    end

    loop每 newest frame
        Loop->>Slot: wait_for(seq != last_seq)
        Slot-->>Loop: newest frame only (stale ones skipped)
        Loop->>GPU: detect (adaptive cadence)
        Loop->>GPU: embed faces/bodies (≤ max_reid_per_frame)
        Loop->>DB: assign sightings, commit (cheap rows)
        Loop->>Clip: enqueue clip job on track close (put_nowait)
        Loop->>Pers: enqueue thumbnail/face-sample (put_nowait)
    end

    Clip->>DB: build_clip_from_track + update Event (off hot path)
    Pers->>DB: imwrite crops + batched commit (off hot path)

(1) Decode thread — drop-to-latest (_decode_loop)

A dedicated producer thread owns the OpenCV/FFMPEG VideoCapture and does nothing but cap.read() in a tight loop, plus reconnect handling with backoff. Each frame it reads overwrites the single _latest_frame slot under a threading.Condition, bumps _frame_seq, and notifies the consumer.

Why a separate decode thread that keeps only the newest frame: RTSP delivers frames at the camera's wall-clock rate. If analysis is momentarily slower than the stream (a crowd, a GC pause), a queue would grow and the worker would fall progressively behind real time — fatal for a live monitoring system. By keeping only the latest frame, a slow analysis pass simply drops the stale frames it missed; the worker always processes the freshest available image and stays anchored to real time. CAP_PROP_BUFFERSIZE=1 keeps the driver's own buffer shallow for the same reason.

(2) Main loop — synchronous detect → track → re-ID (_loop)

The consumer blocks on the condition until a frame newer than the one it last processed arrives (wait_for(self._frame_seq != last_seq)), then runs the full pipeline synchronously on that single frame:

  1. _safe_detect → YOLOv8n on the GPU, filtered to the camera's trigger_classes.
  2. _track_and_identify → advance the greedy IoU tracker (app/reid/tracker.py), finalize closed tracks, and (re-)embed/assign a bounded number of active tracks (§4).
  3. _handle_detections / _finalize_presence → birth events and enqueue clip jobs.
  4. _maybe_write_frame_slot → throttled JPEG encode of the annotated preview.

These steps are intentionally synchronous and single-threaded: detection, tracking, and identity assignment share frame state and ordering assumptions, and running them on one thread keeps the logic simple and the per-frame work bounded. The only thing the loop must never do is block on slow, non-essential I/O — which is what the drain threads are for.

(3) Off-loop drain threads — persist + clip assembly

Two daemon drain threads take the slow work off the hot path:

  • Clip drain (_clip_drain_loop): on track close the loop only writes a cheap Event row and put_nowaits a small job dict. The drain thread calls build_clip_from_track (which waits for post-roll segments to finalize and shells out to ffmpeg concat) and then updates the Event row with the clip path. All clip-thread DB writes go through this one thread, serializing them to avoid SQLite write contention. Bounded queue (maxsize=64); a full queue drops the clip with a warning rather than stalling detection.
  • Persist drain (_persist_drain_loop): all body-crop thumbnail and face-sample writes — cv2.imwrite (JPEG encode) plus the Sighting/ FaceSample DB commits — run here, batched (up to 64 jobs) into one transaction per drain. The loop only enqueues a job carrying a .copy() of the crop (the frame buffer is overwritten by the next cap.read()). If the queue (maxsize=256) is saturated, _enqueue_persist falls back to an inline write so no data is lost.

Why this split: ffmpeg post-roll waits are seconds long; JPEG encode and fsync are milliseconds but unbounded under disk pressure. Either, inline, would let the worker drift behind the live stream and miss detections. Pushing them to drain threads means the hot path's only synchronous DB work is small, indexed row inserts/updates.

Teardown ordering (_teardown)

Shutdown is ordered to avoid losing in-flight work or hanging on dead threads: stop the decode producer first, flush() the tracker so open presences still record, drain the clip queue while the segmenter is still alive, then the persist queue, then close() the components. Joins on drain/persist threads only occur when those threads are actually alive (else queue.join() would block forever).


3. End-to-end frame lifecycle

A single frame's journey through one worker:

  1. Decode. The decode thread reads a frame from RTSP and publishes it as the sole latest frame (any unconsumed predecessor is dropped).
  2. Pickup. The main loop wakes, takes the latest frame + timestamp, ticks fps, and publishes an online heartbeat (~1 Hz throttle).
  3. Cadence gate. Adaptive cadence (§4) decides whether to detect this frame. If not, the loop reuses _last_boxes for the preview overlay and skips straight to the frame-slot write.
  4. Detect. YOLOv8n runs on the GPU (_safe_detect); boxes are filtered to trigger_classes. If any trigger object is present, _last_activity_ts is refreshed (keeping the camera in active mode).
  5. Track. ObjectTracker.update greedily associates this frame's boxes to existing tracks by same-class IoU, opens tracks for unmatched boxes, and closes tracks idle beyond track_gap_seconds.
  6. Finalize closed tracks. Each closed track births (in track mode) one Event for the whole presence and enqueues a clip job, then accrues its dwell time into a PresenceSegment and Identity.total_seconds (_finalize_presence).
  7. (Re-)identify active tracks. A bounded set of due tracks is embedded (IdentityPipeline.extract → ArcFace face vector + OSNet body vector), and IdentityManager.assign links each to an existing identity or mints a new one (§4, §6). Sighting rows are committed; thumbnails/face-samples are enqueued to the persist drain.
  8. Preview. _maybe_write_frame_slot re-encodes the annotated frame to JPEG at the active/idle preview fps and writes it atomically (tmp + replace) to the camera's frame slot, where the /api/live/{id}/stream MJPEG endpoint reads it.

Meanwhile, entirely independent of this loop, the per-camera ffmpeg segmenter (§5) is continuously writing 2-second -c:v copy segments to disk, so the pre-roll for any event already exists the instant a track opens.


4. GPU sharing, adaptive cadence, and the per-frame re-ID cap

4.1 GPU sharing model

A single T4 (16 GB) is shared by all models across all camera workers. There is no explicit GPU scheduler; sharing is achieved by keeping each model small and each worker's GPU demand bounded:

  • Detection: YOLOv8n exported to ONNX, run via onnxruntime-gpu (app/detect/yolo_onnx.py). Running detection on the GPU costs ~1 CPU core/camera versus ~7 on CPU — the dominant reason inference is offloaded.
  • Faces: insightface buffalo_l — SCRFD-10G detector + ArcFace embedder — shared via one FaceRecognizer per worker. Face detection runs once per frame on the whole frame, then each face is assigned to the smallest containing person box (IdentityPipeline.extract), so there is no second face model and no per-crop re-detection.
  • Appearance: OSNet-AIN x1.0 (MSMT17) exported to ONNX via ReIDEmbedder.
  • Vehicle attributes: optional NVIDIA TAO make/body-type classifiers, only invoked for vehicle-class crops.

The Dockerfile takes deliberate care that the CUDAExecutionProvider is the one that actually loads: insightface hard-depends on the CPU onnxruntime wheel, which shadows onnxruntime-gpu in the same package dir; the build uninstalls the CPU build and force-reinstalls the GPU build so inference lands on the T4.

Because every model is loaded lazily inside each child process (_build_components), VRAM grows with the number of cameras. The two mechanisms below are what keep aggregate GPU (and CPU, and disk) demand bounded as cameras scale.

4.2 Adaptive detection cadence

Each worker tracks whether its scene is active. A scene is active for active_grace_seconds after the last frame that contained a trigger object (active = (now - self._last_activity_ts) < self.active_grace_seconds). The detection interval follows that state:

  • Active: detect every detect_interval seconds (default 0.0 = every frame).
  • Idle: detect every detect_interval_idle seconds (default 0.5).

The live-preview JPEG encode rate follows the same active/idle state (active_preview_fps vs idle_preview_fps), because encoding an empty scene at full rate is pure waste. The net effect: a quiet camera consumes a fraction of the GPU/CPU/disk of a busy one, yet the moment an object enters, the camera snaps to full rate and never misses the entrance (the activity timestamp is set on the very detection that first sees the object).

4.3 Per-frame re-ID cap

Re-ID embedding (ArcFace + OSNet + optional TAO) is far more expensive than detection. Without a bound, a crowd of N people would force N embeddings per frame and stall the loop. _track_and_identify therefore:

  1. Selects only tracks due for (re-)identification. Fresh tracks with no identity yet use the fast cadence (reid_sample_seconds, default 3 s); already-identified tracks refresh on the slower "confident" cadence (reid_confident_sample_seconds, default 9 s), because identity is sticky.
  2. Prioritizes unassigned tracks, then oldest-waiting first.
  3. Caps the number actually embedded this frame at max_reid_per_frame (default 4).

So per-frame re-ID work is constant regardless of crowd size; the backlog just drains over subsequent frames. Identity stickiness (hysteresis in IdentityManager, IoU + time window) means a continuous track keeps its identity between embeds without re-evaluation.


5. Recording

5.1 Warm rolling segment buffer

Each worker runs one long-lived ffmpeg process via Segmenter (app/recording/segmenter.py) that continuously writes fixed-length (segment_seconds, ~2 s) .mp4 segments named with a UTC strftime pattern (seg_YYYYMMDDThhmmss.mp4) into data/segments/<camera_id>/. Key properties:

  • Stream copy, no decode. -map 0:v:0 -c:v copy — ffmpeg never re-encodes, so CPU is negligible and GPU is zero. The worker decodes separately for detection; the segmenter is purely an I/O recorder.
  • Audio dropped (-an). IP-camera audio (often G.711 with broken timestamps) periodically hung the segment muxer and silently stopped clips; dropping it makes the buffer rock-solid. (Live-with-sound is handled separately via on-demand HLS.)
  • Anti-SSRF. -protocol_whitelist rtsp,rtsps,rtp,rtcp,udp,tcp,tls,crypto prevents a malicious RTSP URL from making ffmpeg read local files or reach internal HTTP.
  • UTC-pinned filenames. The subprocess runs with TZ=UTC so segment timestamps match the UTC timestamps the worker writes to the DB — clip selection and pruning depend on this.
  • Bounded retention + self-healing. A watchdog thread prunes segments older than retention_seconds (default 120), restarts ffmpeg with backoff if it dies, and detects a hung ffmpeg (alive but producing no new segments) by watching the newest segment's mtime — restarting it on stall. PR_SET_PDEATHSIG ensures ffmpeg is SIGKILLed if the worker dies even uncleanly.

Because a configurable amount of pre-roll is always already on disk, clips can include the seconds before an object appeared without buffering frames in memory.

5.2 Track-mode events

The default recording_mode is track: exactly one Event per object presence. On track close, _finalize_presence enqueues a clip job; build_clip_from_track (app/recording/clipper.py) assembles the clip spanning [enter - pre_seconds, last + post_seconds]:

  1. Wait for the post-roll segment(s) covering the window end to finalize (_wait_for_post_roll) — the segmenter writes a segment to its final name only when it finishes, so the presence of a newer segment proves the tail is complete.
  2. Select every segment overlapping the window, excluding the still-growing live tail (no moov atom yet → would fail concat) and any non-finalized file (_is_finalized probes via ffprobe).
  3. Concatenate via ffmpeg's concat demuxer with -c copy -movflags +faststart (no re-encode) into data/recordings/<camera_id>/<event_id>.mp4.
  4. Extract one thumbnail near the trigger instant.

The legacy fixed-window trigger mode (_trigger_event / _record_and_persist) still exists but is disabled when recording_mode == "track" to avoid double-recording.

5.3 Manual recording

The operator can press ● REC in Live Monitoring. This is stateless: POST /api/live/{id}/record/start returns a server-trusted start timestamp the client echoes back to record/stop, which assembles [started_at, now] from the same on-disk segment buffer (build_clip_from_track via a SimpleNamespace segmenter shim) and persists a manual-labelled Event. No worker round-trip is needed; the buffer is the shared substrate.


6. Re-ID and the identity model (summary)

Re-ID is documented in depth elsewhere; here is what the architecture needs to hold. Identity is anchored on the face — the only cue stable across clothing change, viewpoint, lighting, and days. IdentityManager.assign (app/reid/manager.py) evaluates each sighting in order: sticky/hysteresis → confident face (with a best-minus-second margin) → appearance within the session's time window (with a face-contradiction veto and, for non-person objects, a colour gate) → otherwise a new identity, gated by a face-quality floor and a per-camera new-identity rate limit. A faceless back/side view never spawns a new identity — it can only attach to an existing one by appearance, else it is dropped. That gate is what stops a person seen from behind from exploding into dozens of duplicates.

The in-memory IdentityGallery (app/reid/gallery.py) is derived state: a FAISS IndexFlatIP over per-identity ArcFace exemplars (faces are time-stable, not decayed) plus per-identity OSNet appearance exemplars (time-decayed). Each worker holds its own gallery, rebuilt from the DB at startup and re-synced on a timer (_maybe_reload_gallery, default 30 s), so identities created by other workers converge. The API process holds the authoritative gallery and a background maintenance thread (app/reid/maintenance.py) that decays/prunes exemplars, recomputes centroids, deletes provisional noise identities, and performs conservative face-only auto-merges.


7. Data model

Ten tables in app/db/models.py, all on a single SQLite database in WAL mode with per-connection PRAGMAs (journal_mode=WAL, synchronous=NORMAL, foreign_keys=ON, busy_timeout=5000app/db/database.py). Vectors are 512-d little-endian float32, L2-normalized, stored as BLOBs and (de)serialized with numpy.frombuffer/tobytes (never pickle — no deserialization RCE).

erDiagram
    CAMERA ||--o{ EVENT : "has (cascade delete)"
    CAMERA ||--o{ SIGHTING : "captured on (cascade)"
    PERSON ||--o{ FACE_EMBEDDING : "enrolled (cascade)"
    PERSON |o--o{ EVENT : "best face match (SET NULL)"
    IDENTITY ||--o{ SIGHTING : "has (cascade)"
    IDENTITY ||--o{ FACE_EXEMPLAR : "has (cascade)"
    IDENTITY ||--o{ APPEARANCE_EXEMPLAR : "has (cascade)"
    IDENTITY ||--o{ PRESENCE_SEGMENT : "dwell (cascade)"
    IDENTITY |o--o{ EVENT : "auto identity (SET NULL / app-code)"
    EVENT |o--o{ SIGHTING : "links (SET NULL)"
    IDENTITY |o--|| SIGHTING : "rep_sighting (SET NULL, use_alter)"

    CAMERA {
      int id PK
      string rtsp_url
      bool enabled
      string status
      string trigger_classes "nullable per-cam tunables"
    }
    EVENT {
      int id PK
      int camera_id FK
      datetime ts
      string clip_path
      string thumb_path
      int person_id FK "manual match (SET NULL)"
      int identity_id "auto identity (plain INT on SQLite)"
      string label
    }
    PERSON {
      int id PK
      string name
    }
    FACE_EMBEDDING {
      int id PK
      int person_id FK
      blob vector "512 f32"
    }
    IDENTITY {
      int id PK
      string name
      bool is_named
      string object_class "class-scoped matching"
      float total_seconds
      bool is_provisional
      blob face_centroid
      blob appearance_centroid
    }
    SIGHTING {
      int id PK
      int identity_id FK
      int camera_id FK
      int event_id FK
      string match_kind "face|appearance|new"
      string thumb_path
    }
    FACE_EXEMPLAR {
      int id PK
      int identity_id FK
      blob vector
      float pose "signed yaw"
    }
    APPEARANCE_EXEMPLAR {
      int id PK
      int identity_id FK
      blob vector
      datetime ts "decay clock"
    }
    PRESENCE_SEGMENT {
      int id PK
      int identity_id FK
      float seconds
    }
    FACE_SAMPLE {
      int id PK
      blob vector "ArcFace"
      blob app_vector "OSNet"
      string label "named group"
    }

The ten tables:

Table Role
Camera An RTSP source + its per-camera tunables (nullable → fall back to global Settings). Heartbeat fields (status, last_seen) updated by the worker.
Event One recorded presence: clip path, thumbnail, denormalized manual (person_*) and auto (identity_*) match snapshots, track metadata.
Person A manually enrolled known person (the "People" layer).
FaceEmbedding A 512-d ArcFace vector belonging to a Person; the FAISS FaceIndex is derived from these.
Identity An auto-discovered person/object built online from sightings — no enrollment. Carries object_class (matching is class-scoped), total_seconds dwell, derived centroids, and is_provisional/is_named flags.
Sighting One identified detection: bbox, scores, match_kind, body-crop thumbnail.
FaceExemplar A representative ArcFace vector for an identity (cap ~8/12), bucketed by signed-yaw pose for the multi-view gallery.
AppearanceExemplar A per-identity OSNet vector with a capture ts for time-decay (cap ~16).
PresenceSegment One continuous appearance of an identity at one camera; summed into Identity.total_seconds (the dwell audit trail).
FaceSample A captured face crop + ArcFace vector (+ optional OSNet app_vector) for unsupervised face grouping — independent of the body-Re-ID identities.

7.1 Cascade and delete behavior

FK cascades are declared at the ORM level with passive_deletes=True (the DB enforces them, given foreign_keys=ON):

  • Delete a Camera → its Events and Sightings cascade-delete.
  • Delete a Person → its FaceEmbeddings cascade-delete; any Event.person_id referencing it is SET NULL (event history is preserved).
  • Delete an Identity → its Sightings, FaceExemplars, AppearanceExemplars, and PresenceSegments cascade-delete. Event.identity_id is a denormalized link that is nulled in application code (on SQLite it is a plain INTEGER column materialised by the schema shim, not a real FK — see below). The identities API delete_identity performs a recursive delete that also purges on-disk artifacts: the per-identity crop directory and the linked FaceSample rows + their crop files (FaceSample has no DB-level FK cascade).

7.2 Schema management

There is no migration framework. init_db (app/db/database.py) calls Base.metadata.create_all and then a set of idempotent shims (ensure_reid_schema, ensure_camera_schema, ensure_identity_object_schema, ensure_event_track_schema, ensure_face_pose_schema). These exist because SQLite cannot ALTER a column with an inline FK onto an existing table, so events.identity_id and friends are added as plain columns guarded by PRAGMA table_info checks (no-ops on re-run). The shims run in the API process at startup, before any worker writes a track-mode event. The identities.rep_sighting_idsightings.identity_id FK cycle is broken at create time with use_alter=True.


8. Frontend and the live surface

The frontend is a vanilla-JS single-page app with no build step (app/static/: index.html, app.js, identities.js, people.js, faces.js, CSS, plus a vendored hls.min.js). It is mounted as the last route in create_app (app/main.py):

app.mount("/", StaticFiles(directory=str(STATIC_DIR), html=True), name="static")

The API routers (/api/*) and /health are registered before the mount, so they take precedence; everything else falls through to static assets, with index.html served at /.

Live viewing has three modes, all backed by the worker frame slots and segment buffer:

  • Low-latency MJPEG grid. Each tile is an <img src="/api/live/{id}/stream?fps=…"> holding one long-lived multipart/x-mixed-replace connection (app/api/live.py). The generator pushes a part only when the frame slot changes and honors a ?fps= cap so the grid can request a lower rate (many tiles) while a focused viewer requests the full live_mjpeg_fps. nginx runs with proxy_buffering off / X-Accel-Buffering: no for per-frame flush.
  • Single snapshot. /api/live/{id}/snapshot returns the latest annotated JPEG (or a "no signal" placeholder so the UI never breaks while a worker warms up).
  • Live with sound. MJPEG carries no audio, so on demand the SPA requests /api/live/{id}/hls/index.m3u8, which starts an on-demand RTSP→HLS session (app/recording/hls.py) played by hls.js (or Safari native), with strict ^seg\d{5}\.ts$ segment-name validation.

History/clip playback streams the recorded mp4 with full HTTP Range support (app/api/events.py, _iter_file_range) so <video> seeking works, with os.path.commonpath containment guarding against path traversal out of the data root.

The focused monitor adds operator ergonomics on the client: mouse-wheel and two-finger pinch/pan zoom, double-tap, orientation-aware fullscreen, and the manual record button. The SPA is mobile-first responsive.


9. Security and deployment posture (architectural)

The relevant invariants the architecture depends on:

  • Trust boundary at nginx. The app binds 127.0.0.1:8120 only (compose publishes to loopback). nginx terminates TLS, performs cookie-SSO (auth_request), and injects a trusted identity header (X-Email) — and overrides any client-supplied copy (anti-spoof). require_user (app/auth.py) is fail-closed (auth_required defaults true) and accepts either the SSO header or an optional bearer API key compared in constant time (hmac.compare_digest).
  • Hardened container. Runs as non-root uid 1000 with cap_drop: [ALL], no-new-privileges, and a read-only app-code mount; the GPU still works because /dev/nvidia* are world-accessible.
  • Co-tenancy safety. A hard mem_limit (6 g) plus oom_score_adj: 800 make a runaway VMS the preferred OOM victim — a cgroup-scoped kill restarts one camera worker rather than letting the VMS push the shared host into a global OOM that would disrupt co-hosted VMs.
  • No secrets in the repo. .env.example documents AUTH_REQUIRED and API-key generation; RTSP credentials are masked in every API response (the raw URL stays server-side).

10. Where to start reading the code

Concern Entry point
App wiring, lifespan, router mounts, SPA app/main.py
Worker lifecycle, spawn, reconcile, frame slots app/workers/manager.py
The three-thread pipeline (decode / loop / drains) app/workers/camera_worker.py
Identity assignment (face → appearance → new) app/reid/manager.py
Feature extraction (ArcFace + OSNet, pose) app/reid/pipeline.py
Derived gallery (FAISS + appearance store) app/reid/gallery.py
Tracker (greedy IoU, dwell timing) app/reid/tracker.py
Rolling segment buffer / clip assembly app/recording/segmenter.py, clipper.py
Data model + cascades app/db/models.py, app/db/database.py
Auth / trust boundary app/auth.py, app/config.py
Live MJPEG / HLS / manual record app/api/live.py