Skip to content

Analytics — audit, cross-camera design & critical-sound warnings (2026-06)

Synthesis of a multi-auditor workflow that verified the existing analytics against the code and the live DB, then designed the cross-camera person/voice analytics and the critical-sound warning system. The product thesis: a person-centric, cross-camera, multimodal archive that beats a DVR — and alerts on critical sounds.

Audit verdict (verified, not assumed)

  • Face / cross-camera identity — BROKEN. DB proof: every busy identity is pinned to ONE camera; 769/788 sightings matched by appearance, only 10 by face, so cross-camera re-ID is never exercised. The operator's "дублируется" is identity COLLAPSE (many people glued into one identity per camera), not explosion. Root cause: face_match=0.42 is far too low for buffalo_l surveillance faces (merges different people), and require_face_for_new_person=True forces 90% faceless crops to attach by appearance to the freshest identity on a near-static camera → one mega-identity/cam. Maintenance merge is face-only + temporally gated (never links across cameras, never splits the contaminated bins). Per-worker gallery rebuild on a 30 s timer biases toward per-camera identities.
  • STT / translate — PARTIAL, not duplicating. The STT loop is atomic + idempotent (_claim_next), so there is no re-transcription; the "broken" surface is clips skipped by silence/no-speech gates with no persisted reason. (The duplication the operator sees is identity collapse + over-claimed event linkage from NULL event_ids.)
  • Voiceprint biometrics — DEAD scaffolding. 0 exemplars: enrollment is triple-gated (voice off + face-grade-only + consent NULL). Voice is modeled as columns ON Identity, so a voice-discovered human can never fuse with a face-discovered one.
  • Sound tagging — WORKS, but no alerting. YAMNet runs + populates events.sound_tags and the full critical class map is present, BUT it mean-pools over the whole clip (averaging away a sub-second gunshot) and there is zero warning/severity/alert layer.
  • NAMING TRAP: a persons table + Person ORM already exist (legacy insightface enrollment) — the fusion layer must NOT reuse that name.

Roadmap

  • Phase 0 — stop the collapse (cheap, config-led): raise reid face thresholds; require face corroboration for cross-camera appearance links; stamp event_id on the track path; a maintenance auto-split to repair existing mega-identities.
  • Phase 1 — critical-sound warnings (SHIPPED, first slice; default-OFF). The headline differentiator; fully additive + fail-open.
  • Phase 2 — unblock voice biometrics: source voice PCM from the gapless audio archive, decode once, write ≤1 exemplar per (identity,event), relax the enrollment gates (config-fail-closed), persist STT skip reason.
  • Phase 3 — cross-camera multimodal fusion: a NEW person_cluster + person_member layer ABOVE reid Identity (additive, fallback to identity_id); conservative cross-camera face-centroid merge + voice-centroid merge (gated by measured EER) + face↔voice bridge, all operator-confirmable; a person-centric timeline (GET /api/people/{cluster}/timeline) + person filter + voice-match chips.

Critical-sound warnings — what shipped (Phase 1)

Gunshot / scream-cry / glass / alarm-siren → an operator Warning with severity. - app/audio/sound_tagger.py: tag_pcm_critical() max-pools YAMNet per-frame scores (vs the ambient mean-pool) so a 0.2 s transient crosses threshold; CRITICAL_CLASSES maps 18 AudioSet names → (severity high|medium, per-class threshold). - app/db/models.py: Warning table (camera, ts, critical_class, severity, score, event_id, clip_path, acknowledged, ack_by) — auto-created by create_all. Not named Person/persons. - app/transcribe/manager.py: in the existing sound-tag pass (PCM already decoded once) it also runs tag_pcm_critical and writes a Warning per class over threshold, debounced per (camera, class) via a DB lookup (restart-safe); best-effort, never regresses STT. - app/api/warnings.py: GET /api/warnings (unack, newest-first), POST /{id}/ack, POST /ack-all; mounted via _OPTIONAL_ROUTERS. - app/static: polls /api/warnings every 15 s → a persistent red banner (pulsing for high severity) with Просмотр (open the clip) + OK (ack) + Скрыть все. - Config: critical_sound_enabled=False (default-OFF until thresholds are validated), critical_sound_debounce_seconds=30.

Coverage note (first slice): detection piggybacks the clip transcribe pass (video-triggered, queue latency). The seconds-latency analyzer on the continuous audio archive (camera_worker thread over AudioSegmenter, fires even with no video) is the Phase-1 follow-up. Risk: max-pool raises more false positives (TV gunfire, door slams) — tune per-class thresholds on this deployment's audio before trusting high-severity automation.

Cross-camera fusion + soft-biometrics + voice-only (designed; first slice shipped)

A second multi-architect + red-team workflow designed the "one person across cameras" layer. Strict signal hierarchy (mirrors the collapse fix): FACE anchors + is the only cross-camera authority; VOICE corroborates cross-camera only when EER-calibrated + margin-gated; APPEARANCE/ soft-bio link ONLY within a camera/session, never mint, never cross-day.

  • Fusion (Phase 3, designed): a NEW reversible grouping layer person_cluster + person_member (NOT the legacy persons/Person) over un-named identities; an API-process pass links by face-centroid (auto ≥0.66 + margin; 0.55–0.62 = operator suggestion only), voice-centroid (≥ measured EER), and a face↔voice bridge; detach is non-destructive (Identity survives, unlike merge); chain-verify against the cluster centroid; person-centric multimodal timeline.
  • Back-view / soft-biometrics (SHIPPED — first slice, default-OFF): _decide_by_appearance gains a within-camera band — a faceless person whose OSNet clothing sits in [app_gate, app_match) with margin is accepted only if colour (veto) + build (aspect) + height-proxy corroborate (softbio_score ≥ floor). Per-sighting geometry (height_frac/aspect/area_frac/ foot_y_frac) stored on sightings; running summary in Identity.attributes["softbio"]. Honest reliability: colour is the trustworthy signal; build is a weak nudge; height is aspirational without per-camera ground-plane calibration (foot_y_frac stored for a future homography). Cross-camera ban + require-face-for-new run FIRST (unit-tested order guard). Config softbio_corroboration=False etc.
  • Voice-only attribution (designed): on a faceless label="audio" event with no identity, if a voiceprint matches a person's voice_centroid ≥ EER (+margin, ≥2 s speech, consent), write a match_kind="voice" Sighting / stamp the event — so a person shows on the timeline even with NO video. Voice never mints/merges; OFF while voice_match_threshold<=0.

Key false-positive guards (red-team): collapse floor sacred (never lower face/app bars); soft-bio never cross-camera/never mints/colour-veto; voice needs calibration + margin + consent + faceless-only; fusion auto-bar ≥0.66 with margin + chain-verify + operator-confirm + rejection memory; everything additive, default-OFF, fail-open, reversible.

Same-person-different-angle (anti-fragmentation) — appearance-corroborated merge

After the collapse fix, the opposite failure appeared: one person seen fas-then-profile becomes two identities (ArcFace fas-vs-profile cosine ≈ 0, indistinguishable from two strangers — the face is uninformative across very different angles). Fix (SHIPPED, default-OFF, enabled here): maintenance._auto_merge_pass gains an appearance-corroborated SAME-CAMERA merge — two un-named identities merge when CLOTHING (appearance_centroid) cosine ≥ reid_appearance_merge_threshold (0.80) AND they share a camera AND their sighting ranges are the same session (reid_appearance_merge_max_gap_seconds) AND they were never in the same frame (_same_camera_simultaneous, ≈ same timestamp ⇒ two boxes ⇒ two people). NEVER cross-camera (that stays face-only — the collapse fix). merges_by_appearance counter in the maintenance stats.

Precision is deliberately conservative — it only auto-merges unambiguous cases. When the evidence is mixed (e.g. two distinct boxes co-occur in a frame ⇒ possibly two people, or clothing only moderately similar) it abstains rather than risk a wrong merge. For those, the operator uses the existing manual merge (POST /api/identities/merge; Identities view → select → "Merge selected") — human judgement resolves what the algorithm shouldn't guess. (The reversible operator-confirmable person_cluster suggestion layer is the planned upgrade so borderline pairs are surfaced for one-click confirm instead of needing manual selection.)

Per-identity DOSSIER (SHIPPED) + people-count/activity (roadmap)

A workflow designed per-person analytics; the dossier first slice is shipped (read-only, zero GPU, zero migration, no hot-path touch): - GET /api/identities/{id}/dossier — merges EVERY recognized speech line across the person's clips into ONE chronological original+translation stream (reuses the on-disk WebVTT via events._parse_vtt_cues, no STT re-run; abs ts = clip_start_ts|ts + cue), + speech stats (utterances/words/languages/busiest camera), + aggregate stats (dwell, sightings, cameras, events, age/gender/colour, voice-exemplars, top sound tags), + occupancy (Event.num_objects peak/avg), + an ESTIMATED kinematic activity summary from PresenceSegments. Fail-open per sub-block. UI: a "📋 Досье — текстовая картина" section in identities.js openDetail (stats strip + chips + day-grouped clickable transcript). Verified: id2 = 254 lines/1528 words. Honesty: speech is "recorded while present" (per-clip), NOT voice-verified attribution. - Roadmap (touch the hot path → second/third): per-event PEAK people count (peak_people column set at track-finalize from simultaneous tracks) for true occupancy; per-track kinematic activity (add a small position buffer to the tracker → speed in body-heights/s → stationary/loitering/walking/running/approaching/leaving, stored per sighting/event, labelled estimated). Fall/fight/gesture = a NEW lightweight pose model, gated on GPU headroom (T4 is at ~12.8/15.3 GB — not safe to add now).

Analytics DASHBOARD (SHIPPED) — charts + reports

A workflow designed a comprehensive adaptive dashboard; the MVP is shipped (read-only, zero GPU, zero migration, no hot-path touch, no external CDN/lib): - GET /api/analytics/summary?from=&to=&camera_id=&bucket=day|hour (app/api/analytics.py, mounted via _OPTIONAL_ROUTERS): ONE bounded SELECT-only JSON powering charts+report+CSV — kpis, events_per_bucket (strftime), per_camera, occupancy (num_objects), match_kind, warnings_by_class + warnings_per_bucket, sound_tags (CSV tally), transcript_coverage, leaderboard. Each dataset in its own try/except → [] (fail-open); hour auto-coarsens to day past 31 days so a query never scans unbounded. - app/static/charts.jshand-rolled inline-SVG chart helper window.Charts (bar/barH/line/area/donut/sparkline): viewBox-responsive, theme-aware via CSS custom properties (auto light/dark), <title> tooltips, empty-state guard, NO dependency. - New Analytics tab + renderAnalytics() (app.js): sticky filter bar (date range + camera + day/hour), a KPI strip (events/transcribed/persons/warnings/busiest camera+day with a trend sparkline), and an adaptive .an-grid (auto-fill minmax(320px,1fr); wide cards span 2 ≥1000px) of 9 charts. A 🖨 Отчёт button → window.print() with a print stylesheet (hides chrome, ink-friendly, no break-inside) → operator "Save as PDF", zero-dep. Verified live: 819 events, per-camera/occupancy/match_kind/warnings/sound-tags/coverage/ leaderboard all populate; cam5 hour-filter = 413 events, peak hour 14:00.

Data enrichment + CSV/report (SHIPPED)

  • CSV export + printable report. Analytics view: ⬇ CSV exports any aggregate (events-by-period, per-camera, occupancy, activity, warnings, sound-tags, leaderboard) as RFC-4180 + UTF-8-BOM CSV (client-side from the loaded JSON — csvDownload, zero backend). 🖨 Отчётwindow.print() with a print stylesheet (hides chrome, ink-friendly, vector SVG, @page margins) + a print-only title/period/timestamp header → Save-as-PDF.
  • Hot-path data ENRICHMENT (estimated, GPU-free, additive, fail-open). Computed at track finalize, NOT on the detection inference path; written through the existing _create_event_row(**meta) hasattr filter; nullable columns via ensure_event_kinematics_schema (no backfill — legacy rows NULL):
  • events.peak_people — max same-class tracks alive together during the presence (real co-presence; num_objects was always 1). _track_and_identify tallies concurrent tracks per frame → each track keeps its peak.
  • events.activity / activity_speedTrack keeps a bounded position buffer (filled O(1) in the tracker's matched branch); Track.kinematics() derives speed in body-heights/sec (scale-robust, no calibration) + straightness + height-trend; classify_activity() → stationary / loitering / walking / running / approaching / leaving (conservative cfg thresholds), or NULL when too few samples. Verified live: event 1164 = peak_people 2, activity "walking", 0.282 bh/s. Surfaced in the dashboard (occupancy now = peak_people; new "Активность (оценка)" donut) and in the dossier (activity-mix chips). Unit-tested (tests/test_tracker_kinematics.py).
  • Honesty: monocular/uncalibrated speed; PTZ motion can inject false motion (documented); fall/fight/gesture still need a pose model (GPU-gated; T4 ~12.8/15.3 GB).
  • Activity heatmap (hour × weekday). /api/analytics/summary returns activity_heatmap (24×7 event counts; %w remapped to Mon..Sun) drawn by a new Charts.heatmap inline-SVG primitive (intensity = accent opacity, theme-aware).
  • Loitering watch-list. loitering (recent events with activity="loitering", newest-first) → a clickable panel (open the clip).
  • Deep-link. Clicking a "по камерам" bar jumps to the Events view filtered by that camera (Charts.barH gained an onBar callback + _analyticsJumpToCamera).

Zones + roles + spatial heatmap (SHIPPED — first slice)

The "sellable" layer: understand WHERE (bar / hall / staff area / entrance) and WHO (staff vs visitor), so analytics is per-place and per-role. - Zones (per-camera ROIs). zones table (camera_id, name, kind ∈ bar/staff_area/public/entrance/checkout/custom, polygon = normalized [x,y] list) auto-created by create_all; CRUD at /api/cameras/{id}/zones + /api/zones/{id} (app/api/zones.py). A draw-on-snapshot rectangle editor in the Cameras view ("🗺 Зоны" per row) over the live /api/live/{id}/snapshot. Attribution: a sighting's foot point (new Sighting.foot_x_frac + existing foot_y_frac, computed in the pipeline) point-in-polygon (app/geom.py) at analytics time → per-zone people / staff / visitor / visits. (Analytics-time attribution = no hot-path/worker change; legacy sightings lack foot_x and simply aren't attributed.) - Roles (staff vs visitor). Authoritative Identity.role (operator-set via POST /api/identities/{id}/role, buttons in the dossier). Analytics roles block = distinct people by role; per-zone staff/visitor split. (Zone-dwell and uniform-colour SUGGESTIONS — the other two operator-chosen methods — are the next increment on top of this authoritative base.) - Spatial heatmap (где стоят люди). Single-camera 28×16 foot-point density grid over the window → Charts.heatmap. Honest: foot-point density, no ground-plane calibration. Shown when one camera is selected in the filter. - Why it sells: footfall + dwell-by-area + staff-coverage + queue/loiter + heatmaps = the hospitality/retail analytics buyers pay for.

Open questions

  • Per-class thresholds need tuning on real audio; do we want a default-OFF high-severity webhook (push) for gunshot/scream?
  • Voice fusion needs a measured EER for voice_match_threshold (currently 0 = disabled).
  • Phase-0 threshold raise may fragment some existing identities — acceptable vs collapse?