Deployment & Operations¶

This document covers deploying and operating Iris on a single-GPU host. It is written for the reference target: one NVIDIA T4 (16 GB) behind an nginx cookie-SSO gateway, on a box that may also host other workloads (the safety knobs in docker-compose.yml exist precisely because this GPU/host is shared).

The service is a single container (vms) defined in docker-compose.yml. It binds loopback only and is reached exclusively through nginx — never published on a public interface directly.

1. Prerequisites¶

Host¶

Linux host with a recent kernel and Docker Engine + Compose v2 (docker compose, not the legacy docker-compose).
An NVIDIA GPU (reference: Tesla T4, 16 GB) with a driver new enough for CUDA 12.4 (the image is built on nvidia/cuda:12.4.1-cudnn-runtime-ubuntu22.04, see Dockerfile).
The NVIDIA Container Toolkit installed and wired into Docker, so the deploy.resources.reservations.devices GPU request in the compose file resolves.

Verify the driver and that containers can see the GPU before deploying:

# Host driver / CUDA visible to the driver
nvidia-smi

# Container toolkit works end-to-end (this image tag is illustrative)
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

If the second command prints the GPU table, the toolkit is correctly configured.

Software dependencies (baked into the image)¶

You do not install Python or its dependencies on the host — they are built into the image from pyproject.toml:

FastAPI + uvicorn, SQLAlchemy 2.x over SQLite (WAL), pydantic-settings.
onnxruntime-gpu (provides the CUDAExecutionProvider for YOLOv8n, SCRFD, ArcFace and OSNet).
insightface (SCRFD detector + ArcFace r50, the buffalo_l pack), faiss-cpu, OpenCV-headless, Pillow.
ffmpeg, libgl1, libglib2.0-0 are installed at the OS layer in the image for RTSP capture, segment recording, clip concat and thumbnails.

Build note (do not "fix" it): the Dockerfile installs pip install ., then pip uninstall -y onnxruntime and force-reinstalls onnxruntime-gpu. This is deliberate — insightface hard-depends on the CPU onnxruntime, which installs into the same package directory and shadows the GPU build, silently dropping the CUDAExecutionProvider (inference would fall back to CPU at ~7 cores/camera). The uninstall/reinstall restores GPU inference.

2. Configuration: `.env`¶

All runtime configuration is environment-driven and read once at startup by app/config.py via get_settings(). Copy the template and edit:

cp .env.example .env

Every value has a sensible default in app/config.py; .env only needs the few you want to override. The settings that matter most for a real deployment:

Auth / SSO (security-critical)¶

# Trusted header injected by the nginx SSO gateway after it validates the session.
# Presence of this header == authenticated. Must match the proxy_set_header below.
SSO_HEADER=X-Email

# Fail closed: reject any request with neither a trusted SSO header nor a valid API key.
# Keep this true in production.
AUTH_REQUIRED=true

# Optional bearer token for SSH-tunnel / CLI access that bypasses the SSO proxy.
# Compared in constant time (hmac.compare_digest). Leave UNSET to disable key auth.
# API_KEY=...

Generate a strong API key only if you need direct (non-browser) access through an SSH tunnel:

openssl rand -hex 32

Auth precedence (see app/auth.py):

A trusted SSO header (X-Email) → authenticated as that user.
Otherwise an Authorization: Bearer <API_KEY> (constant-time compared).
Otherwise: 401 when AUTH_REQUIRED=true (default). Set AUTH_REQUIRED=false only for local dev with no proxy and no key.

The app binds to 127.0.0.1 inside the container and is only reachable through nginx; that is why the trusted-header model is safe — and why nginx must override any client-supplied header (see §4).

Detection / device¶

DETECTOR_BACKEND=onnx     # YOLOv8n via onnxruntime-gpu (default). 'cpu' forces CPU.
DETECTOR_DEVICE=cuda      # cuda (default) | cpu. Set cpu to survive GPU eviction.
DETECTOR_FP16=true        # T4 supports fp16; set false for fp32.
DETECT_FPS=5              # worker processing rate, NOT the stream's native fps.

DETECTOR_DEVICE=cpu (or DETECTOR_BACKEND=cpu) is the escape hatch if the GPU is fully committed to another workload — detection, faces and Re-ID transparently fall back to the CPUExecutionProvider at a much higher CPU cost.

Recording¶

SEGMENT_SECONDS=2     # rolling pre-roll segment length
PRE_SECONDS=5         # clip pre-roll around a trigger (per-camera override available)
POST_SECONDS=10       # clip post-roll
TRIGGER_COOLDOWN=15   # debounce between events on one camera

Storage layout¶

DATA_DIR and MODELS_DIR are container paths bind-mounted from ./data and ./models (see compose). Defaults are fine; the DB lives at DATA_DIR/vms.db. Under DATA_DIR the worker writes:

recordings/<camera_id>/<event_id>.mp4
thumbnails/<event_id>.jpg
faces/<person_id>/<file>.jpg
segments/<camera_id>/seg_*.mp4      # rolling pre-roll buffer
hls/                                 # on-demand live-with-sound transcodes
identities/                          # cross-camera identity crops

3. Models¶

The container needs three model assets in ./models (bind-mounted, downloaded once, survive rebuilds). Use the helper scripts/download_models.py:

Asset	What it is	Source
`yolov8n.onnx`	COCO person/object detector	Ultralytics release asset
`insightface/models/buffalo_l/`	SCRFD-10G detector + ArcFace r50 (512-d)	fetched by insightface on first use
`osnet_*_msmt17.onnx`	OSNet appearance Re-ID (within-session helper)	release asset (`REID_ONNX_URL`) or offline export

Run it on the host (needs requests, and insightface/onnxruntime for the buffalo_l step), or inside the built container:

# On the host
pip install requests insightface onnxruntime
python scripts/download_models.py

# Or inside the container (no host Python needed)
docker compose run --rm --entrypoint python vms scripts/download_models.py

Re-runs are idempotent (existing files are skipped; --force re-downloads; --only {yolo,insightface,reid} scopes it).

OSNet notes. The OSNet ONNX is fetched from a release asset when REID_ONNX_URL is set; otherwise the script falls back to the offline exporter scripts/export_reid_onnx.py (needs torch+torchreid, dev-box only — these never ship in the runtime image). The default filename is osnet_x0_25_msmt17.onnx; the higher-accuracy osnet_ain_x1_0_msmt17.onnx (REID_MODEL in config.py) is the recommended cross-camera variant where VRAM allows.

insightface caches under INSIGHTFACE_HOME=/app/models/insightface (set in the Dockerfile), so even if you skip the buffalo_l download step it is fetched automatically on first use into the bind-mounted models/ dir.

Optional — vehicle attributes. If you want NVIDIA TAO make/body-type classifiers, extract them from the DeepStream image with scripts/fetch_vehicle_models.sh (requires the deepstream:7.1-triton-multiarch image pulled). These are optional and gitignored.

4. The nginx SSO vhost¶

Iris has no login page of its own. It trusts a header injected by nginx after nginx validates the session against your SSO service using the auth_request subrequest pattern. The container publishes only to loopback (127.0.0.1:8120 in compose), so nginx is the only path in.

Two rules make this safe and they are non-negotiable:

nginx must override any client-supplied header. It sets the trusted header from the auth subrequest response, replacing whatever the client sent. Without this, a client could spoof X-Email and impersonate anyone. This is exactly why publishing loopback-only matters — there is no way to reach the app except through this header-rewriting proxy.
The published port is loopback-only. Even on the same host, nothing reaches :8120 without going through nginx.

Reference vhost (TLS termination omitted for brevity; the auth_request pattern is the load-bearing part):

server {
    listen 443 ssl;
    server_name vms.example.internal;

    # ---- TLS config omitted ----

    # Internal auth subrequest: your SSO service validates the cookie/session
    # and returns 200 + the authenticated identity (here as X-Auth-Email), or 401.
    location = /_sso_auth {
        internal;
        proxy_pass              http://127.0.0.1:9000/validate;  # your SSO validator
        proxy_pass_request_body off;
        proxy_set_header        Content-Length "";
        proxy_set_header        X-Original-URI $request_uri;
    }

    location / {
        auth_request /_sso_auth;

        # Pull the identity out of the auth subrequest response...
        auth_request_set $sso_email $upstream_http_x_auth_email;

        # ...and inject it as the TRUSTED header. This proxy_set_header REPLACES
        # any X-Email the client tried to send (anti-spoof). Must match SSO_HEADER.
        proxy_set_header X-Email $sso_email;

        proxy_pass http://127.0.0.1:8120;

        # Streaming endpoints (MJPEG grid, HLS, clip Range) need buffering off
        # and long read timeouts.
        proxy_buffering    off;
        proxy_read_timeout 3600s;
        proxy_http_version 1.1;
    }

    # /health is unauthenticated by design (external liveness probe) — you
    # may expose it without auth_request, or keep it internal.
}

The /health endpoint is intentionally unauthenticated (it returns a liveness + GPU/worker snapshot for external liveness probing — see app/main.py); everything under /api/* and the SPA require the header or a bearer key.

5. Bringing it up¶

docker compose up -d --build
docker compose logs -f vms          # watch startup
curl -fsS http://127.0.0.1:8120/health | jq .

A healthy startup logs (from app/main.py):

FaceIndex loaded (faces=…)
FaceRecognizer instantiated (lazy model load)
IdentityGallery loaded
ReID maintenance thread started
HLS (live-with-sound) manager started
WorkerManager started
Startup complete (backend=onnx device=cuda port=8120)

The compose healthcheck curls /health every 30s (60s start period). Confirm the GPU is actually being used:

curl -fsS http://127.0.0.1:8120/api/system \
  -H "Authorization: Bearer $API_KEY" | jq '.backend.onnx_providers, .gpu'

onnx_providers must include CUDAExecutionProvider. If it shows only CPUExecutionProvider, the GPU build was shadowed — rebuild (see §1 build note) and check the toolkit.

6. GPU / VRAM budgeting — how many cameras fit¶

Architecture and what consumes the GPU¶

Each camera is one spawned subprocess (multiprocessing spawn) running a 3-thread pipeline: a drop-to-latest RTSP decode thread, the detection + tracking + Re-ID main loop (operating only on the newest frame — stale frames are dropped, never queued), and off-loop drain threads that do clip assembly, thumbnail/face-sample writes and DB commits so the hot path never blocks on ffmpeg/JPEG/fsync.

GPU work per camera comes from ONNX inference: YOLOv8n detection, plus SCRFD + ArcFace (faces) and OSNet (appearance) when a track is sampled. Several mechanisms bound that load:

Adaptive detection cadence — full rate (detect_interval) while objects are present, throttled to detect_interval_idle (0.5 s) after active_grace_seconds of an empty scene. Quiet cameras cost almost nothing.
DETECT_FPS / detect_every_n — the worker processes frames at the detection rate, not the stream's native fps.
max_reid_per_frame=4 — a hard cap on how many tracks are (re)identified per detection frame, so a crowd can't stall the loop (excess defers to the next frame).
reid_confident_sample_seconds=9 — already-identified tracks are re-embedded far less often than fresh ones, cutting GPU churn on a stable crowd.

Rough VRAM budget¶

The model footprint is small: YOLOv8n + buffalo_l (SCRFD+ArcFace) + OSNet under onnxruntime-gpu. The optional DeepStream override notes a ~2 GB VRAM budget handles a handful of cameras with the default ONNX backend and coexists with a co-hosted generative stack on the same T4.

Practical guidance for a single T4 (16 GB), with the GPU shared:

Reserve VRAM for any co-hosted GPU workload first; budget roughly ~2 GB for Iris's default ONNX pipeline at a handful of cameras.
Throughput, not VRAM, is usually the limit. Each camera at DETECT_FPS=5 is light; the centerpiece Re-ID is gated by max_reid_per_frame and the adaptive cadence. Scale DETECT_FPS down and lean on idle throttling to fit more cameras.
Check live headroom any time:

curl -fsS http://127.0.0.1:8120/health | jq '.gpu'   # {used_mb, total_mb}
nvidia-smi

If you need many high-FPS streams beyond what ONNX comfortably handles, the optional DeepStream backend (docker-compose.deepstream.yml) sources detections from a DeepStream container over the internal compose network — set DETECTOR_BACKEND=deepstream and bring it up with docker compose -f docker-compose.yml -f docker-compose.deepstream.yml up -d. The DeepStream graph itself is supplied separately; the default MVP does not use it.

Honest Re-ID limit. A camera that only ever sees the back of a head carries no biometric any method can use. Iris is designed to be correct under that constraint: a faceless back/side view never spawns a new identity (it can only attach to an existing one by appearance within a session, else is dropped). This avoids the classic duplicate-identity explosion — but it does mean back-only cameras yield fewer enrolled identities by design, not by bug. Plan camera placement to capture faces if cross-camera identity coverage matters.

7. Host-safety: cgroup memory cap & OOM priority¶

Because the GPU and host are shared with a production VM, the compose file hard-bounds Iris so it can never take the host down:

mem_limit:      6g     # cgroup RAM cap
memswap_limit:  6g     # no extra swap headroom beyond the cap
oom_score_adj:  800    # make VMS the PREFERRED OOM victim
shm_size:       512m   # POSIX shm for OpenCV/ffmpeg + frame slots

How this protects a co-hosted VM:

If Iris exceeds mem_limit, the kernel triggers a cgroup-scoped OOM that kills a camera worker inside the container (which then restarts) — it does not become a global, host-wide OOM that could kill another VM.
If host-wide memory pressure ever does occur, the high oom_score_adj (800) makes Iris the preferred victim. The protected VM should be set to a strongly negative oom_score_adj (e.g. -1000) so it is the last thing the kernel ever kills.

Container hardening (defense-in-depth, also in compose):

Runs as the non-root host uid 1000:1000 (GPU still works — /dev/nvidia* are world-accessible).
cap_drop: [ALL] and no-new-privileges:true (blocks setuid escalation).
App source mounted read-only (./app:/app/app:ro); the container never writes its own code.

(Additional in-app hardening lives in the code: RTSP credentials are masked in every API response, ffmpeg runs with -protocol_whitelist and rtsp-scheme validation against SSRF/local-file reads, upload pixel-bomb guard, path-traversal containment via commonpath, and vectors are (de)serialized with numpy.frombuffer rather than pickle.)

8. Operations¶

Logs & status¶

docker compose logs -f vms                 # follow
docker compose logs --since 1h vms         # recent
docker compose ps                          # container + health state

# Authenticated introspection (per-camera worker state, fps, providers, GPU)
curl -fsS http://127.0.0.1:8120/api/system -H "Authorization: Bearer $API_KEY" | jq .

Per-camera connection state comes from the live worker heartbeat (published ~every 1s into the manager registry, see app/api/cameras.py), surfaced as online/offline/error. A camera reads as offline when disabled, when no worker is running, or when its last heartbeat is older than status_stale_seconds (15 s).

Adding / editing a camera¶

Cameras are managed through the SPA (or the REST API) — there is no config file for them; they live in the cameras table and mutations drive the WorkerManager:

# Create a camera (worker starts immediately when enabled). RTSP creds stay
# server-side and are masked in every response.
curl -fsS -X POST http://127.0.0.1:8120/api/cameras \
  -H "Authorization: Bearer $API_KEY" -H "Content-Type: application/json" \
  -d '{"name":"Front Gate","rtsp_url":"rtsp://USER:PASS@10.0.0.10:554/stream1","enabled":true}'

# List cameras + live status (rtsp_url comes back masked: rtsp://***@host:554/...)
curl -fsS http://127.0.0.1:8120/api/cameras -H "Authorization: Bearer $API_KEY" | jq .

Behaviour to know (from app/api/cameras.py):

Create with enabled:true spawns the worker immediately.
Update restarts the worker when the RTSP URL changes, starts it on enable, stops it on disable, and restarts it when tuning fields (detect_conf, pre_seconds, post_seconds, trigger_classes, etc.) change.
A blank or masked (***) rtsp_url on an edit is treated as "keep the current one" — so echoing back the redacted value won't wipe the stored secret.
Repointing the RTSP URL purges that camera's prior events + their files (it's effectively a different physical camera).
Delete stops the worker, stops any HLS session, then recursively removes clips/thumbnails/segments/HLS/face-samples, cascades events + sightings, and drops any identity that existed only because of this camera (cross-camera identities survive).
Available trigger classes (COCO) for the camera form: GET /api/cameras/detect/classes.

The WorkerManager also runs a reconcile loop (worker_reconcile_seconds, 20 s) that keeps running workers in sync with the enabled cameras in the DB — restarting crashed workers and stopping orphaned ones, so transient RTSP drops self-heal.

Recording & retention¶

Track mode (default, recording_mode=track): one Event per presence, clip = [enter − pre, last + post], assembled with -c:v copy (near-zero CPU) from the warm rolling segment buffer. Manual recording (the ● REC button in Live Monitoring) assembles a clip from the same on-disk buffer via stateless record/start / record/stop.
Rolling segment buffer: SEGMENT_SECONDS (2 s) segments per camera, pruned by the segmenter's watchdog to segment_retention_seconds (120 s) — old segments past the retention window are deleted automatically (app/recording/segmenter.py). This buffer is bounded and self-cleaning; it is not your event archive.
Event clip retention: recorded event clips/thumbnails are not auto-expired by a time policy — they accumulate under data/recordings and data/thumbnails until deleted. Reclaim space via the events API:

# Delete one event (removes the row + clip + thumbnail files)
curl -fsS -X DELETE http://127.0.0.1:8120/api/events/123 -H "Authorization: Bearer $API_KEY"

The events API also supports filtering (camera_id, person_id, from/to, label, limit, offset) and bulk/clear delete with on-disk file cleanup; deleting an identity recursively removes its crop dirs, face-sample files and solo event clips. Monitor data/ growth and prune events on a schedule appropriate to your disk budget — on a shared pool, keeping the data dir bounded is part of keeping the host healthy.

Startup self-cleaning¶

On boot, lifespan runs purge_orphans to remove on-disk artifacts (thumbnails/clips/segments/frame slots) left by cameras/events that no longer exist in the DB, so a deleted camera never leaves stale footage behind.

Updating / restarting¶

git pull
docker compose up -d --build      # rebuild + recreate; data/ and models/ persist
docker compose restart vms        # restart without rebuild

data/ (DB + media) and models/ are bind-mounted and survive rebuilds. Back up by snapshotting the ./data directory (the SQLite DB runs in WAL mode — quiesce or copy vms.db, vms.db-wal, vms.db-shm together, or use sqlite3 vms.db ".backup").

9. Quick reference¶

Knob (`.env`)	Default	Purpose
`AUTH_REQUIRED`	`true`	Fail-closed auth gate
`API_KEY`	unset	Bearer key for tunnel/CLI (constant-time compared)
`SSO_HEADER`	`X-Email`	Trusted header injected by nginx
`DETECTOR_BACKEND` / `DETECTOR_DEVICE`	`onnx` / `cuda`	Detection backend & device (`cpu` to survive GPU eviction)
`DETECT_FPS`	`5`	Worker processing rate (not stream fps)
`SEGMENT_SECONDS`	`2`	Rolling pre-roll segment length
`PRE_SECONDS` / `POST_SECONDS`	`5` / `10`	Clip pre/post-roll
`TRIGGER_COOLDOWN`	`15`	Event debounce per camera
`mem_limit` (compose)	`6g`	cgroup RAM cap — host-OOM protection
`oom_score_adj` (compose)	`800`	Iris is preferred OOM victim

Endpoint	Auth	Use
`GET /health`	none	Liveness + GPU/worker snapshot (monitoring)
`GET /api/system`	yes	Providers, device, GPU mem, worker states
`GET/POST /api/cameras`	yes	Camera CRUD (drives workers)
`GET /api/cameras/{id}/status`	yes	Live status / fps / detector
`DELETE /api/events/{id}`	yes	Delete event + clip + thumbnail