Analytics — audit, cross-camera design & critical-sound warnings (2026-06)¶
Synthesis of a multi-auditor workflow that verified the existing analytics against the code and the live DB, then designed the cross-camera person/voice analytics and the critical-sound warning system. The product thesis: a person-centric, cross-camera, multimodal archive that beats a DVR — and alerts on critical sounds.
Audit verdict (verified, not assumed)¶
- Face / cross-camera identity — BROKEN. DB proof: every busy identity is pinned to
ONE camera; 769/788 sightings matched by appearance, only 10 by face, so
cross-camera re-ID is never exercised. The operator's "дублируется" is identity
COLLAPSE (many people glued into one identity per camera), not explosion. Root cause:
face_match=0.42is far too low for buffalo_l surveillance faces (merges different people), andrequire_face_for_new_person=Trueforces 90% faceless crops to attach by appearance to the freshest identity on a near-static camera → one mega-identity/cam. Maintenance merge is face-only + temporally gated (never links across cameras, never splits the contaminated bins). Per-worker gallery rebuild on a 30 s timer biases toward per-camera identities. - STT / translate — PARTIAL, not duplicating. The STT loop is atomic + idempotent
(
_claim_next), so there is no re-transcription; the "broken" surface is clips skipped by silence/no-speech gates with no persisted reason. (The duplication the operator sees is identity collapse + over-claimed event linkage from NULLevent_ids.) - Voiceprint biometrics — DEAD scaffolding. 0 exemplars: enrollment is triple-gated
(voice off + face-grade-only + consent NULL). Voice is modeled as columns ON
Identity, so a voice-discovered human can never fuse with a face-discovered one. - Sound tagging — WORKS, but no alerting. YAMNet runs + populates
events.sound_tagsand the full critical class map is present, BUT it mean-pools over the whole clip (averaging away a sub-second gunshot) and there is zero warning/severity/alert layer. - NAMING TRAP: a
personstable +PersonORM already exist (legacy insightface enrollment) — the fusion layer must NOT reuse that name.
Roadmap¶
- Phase 0 — stop the collapse (cheap, config-led): raise reid face thresholds; require
face corroboration for cross-camera appearance links; stamp
event_idon the track path; a maintenance auto-split to repair existing mega-identities. - Phase 1 — critical-sound warnings (SHIPPED, first slice; default-OFF). The headline differentiator; fully additive + fail-open.
- Phase 2 — unblock voice biometrics: source voice PCM from the gapless audio archive, decode once, write ≤1 exemplar per (identity,event), relax the enrollment gates (config-fail-closed), persist STT skip reason.
- Phase 3 — cross-camera multimodal fusion: a NEW
person_cluster+person_memberlayer ABOVE reidIdentity(additive, fallback to identity_id); conservative cross-camera face-centroid merge + voice-centroid merge (gated by measured EER) + face↔voice bridge, all operator-confirmable; a person-centric timeline (GET /api/people/{cluster}/timeline) + person filter + voice-match chips.
Critical-sound warnings — what shipped (Phase 1)¶
Gunshot / scream-cry / glass / alarm-siren → an operator Warning with severity.
- app/audio/sound_tagger.py: tag_pcm_critical() max-pools YAMNet per-frame scores
(vs the ambient mean-pool) so a 0.2 s transient crosses threshold; CRITICAL_CLASSES
maps 18 AudioSet names → (severity high|medium, per-class threshold).
- app/db/models.py: Warning table (camera, ts, critical_class, severity, score,
event_id, clip_path, acknowledged, ack_by) — auto-created by create_all. Not named
Person/persons.
- app/transcribe/manager.py: in the existing sound-tag pass (PCM already decoded once)
it also runs tag_pcm_critical and writes a Warning per class over threshold, debounced
per (camera, class) via a DB lookup (restart-safe); best-effort, never regresses STT.
- app/api/warnings.py: GET /api/warnings (unack, newest-first), POST /{id}/ack,
POST /ack-all; mounted via _OPTIONAL_ROUTERS.
- app/static: polls /api/warnings every 15 s → a persistent red banner (pulsing for
high severity) with Просмотр (open the clip) + OK (ack) + Скрыть все.
- Config: critical_sound_enabled=False (default-OFF until thresholds are validated),
critical_sound_debounce_seconds=30.
Coverage note (first slice): detection piggybacks the clip transcribe pass
(video-triggered, queue latency). The seconds-latency analyzer on the continuous audio
archive (camera_worker thread over AudioSegmenter, fires even with no video) is the
Phase-1 follow-up. Risk: max-pool raises more false positives (TV gunfire, door slams) —
tune per-class thresholds on this deployment's audio before trusting high-severity automation.
Cross-camera fusion + soft-biometrics + voice-only (designed; first slice shipped)¶
A second multi-architect + red-team workflow designed the "one person across cameras" layer. Strict signal hierarchy (mirrors the collapse fix): FACE anchors + is the only cross-camera authority; VOICE corroborates cross-camera only when EER-calibrated + margin-gated; APPEARANCE/ soft-bio link ONLY within a camera/session, never mint, never cross-day.
- Fusion (Phase 3, designed): a NEW reversible grouping layer
person_cluster+person_member(NOT the legacypersons/Person) over un-named identities; an API-process pass links by face-centroid (auto ≥0.66 + margin; 0.55–0.62 = operator suggestion only), voice-centroid (≥ measured EER), and a face↔voice bridge; detach is non-destructive (Identity survives, unlike merge); chain-verify against the cluster centroid; person-centric multimodal timeline. - Back-view / soft-biometrics (SHIPPED — first slice, default-OFF):
_decide_by_appearancegains a within-camera band — a faceless person whose OSNet clothing sits in[app_gate, app_match)with margin is accepted only if colour (veto) + build (aspect) + height-proxy corroborate (softbio_score ≥ floor). Per-sighting geometry (height_frac/aspect/area_frac/ foot_y_frac) stored onsightings; running summary inIdentity.attributes["softbio"]. Honest reliability: colour is the trustworthy signal; build is a weak nudge; height is aspirational without per-camera ground-plane calibration (foot_y_fracstored for a future homography). Cross-camera ban + require-face-for-new run FIRST (unit-tested order guard). Configsoftbio_corroboration=Falseetc. - Voice-only attribution (designed): on a faceless
label="audio"event with no identity, if a voiceprint matches a person'svoice_centroid≥ EER (+margin, ≥2 s speech, consent), write amatch_kind="voice"Sighting / stamp the event — so a person shows on the timeline even with NO video. Voice never mints/merges; OFF whilevoice_match_threshold<=0.
Key false-positive guards (red-team): collapse floor sacred (never lower face/app bars); soft-bio never cross-camera/never mints/colour-veto; voice needs calibration + margin + consent + faceless-only; fusion auto-bar ≥0.66 with margin + chain-verify + operator-confirm + rejection memory; everything additive, default-OFF, fail-open, reversible.
Same-person-different-angle (anti-fragmentation) — appearance-corroborated merge¶
After the collapse fix, the opposite failure appeared: one person seen fas-then-profile becomes
two identities (ArcFace fas-vs-profile cosine ≈ 0, indistinguishable from two strangers — the
face is uninformative across very different angles). Fix (SHIPPED, default-OFF, enabled here):
maintenance._auto_merge_pass gains an appearance-corroborated SAME-CAMERA merge — two
un-named identities merge when CLOTHING (appearance_centroid) cosine ≥ reid_appearance_merge_threshold
(0.80) AND they share a camera AND their sighting ranges are the same session
(reid_appearance_merge_max_gap_seconds) AND they were never in the same frame
(_same_camera_simultaneous, ≈ same timestamp ⇒ two boxes ⇒ two people). NEVER cross-camera
(that stays face-only — the collapse fix). merges_by_appearance counter in the maintenance stats.
Precision is deliberately conservative — it only auto-merges unambiguous cases. When the
evidence is mixed (e.g. two distinct boxes co-occur in a frame ⇒ possibly two people, or clothing
only moderately similar) it abstains rather than risk a wrong merge. For those, the operator
uses the existing manual merge (POST /api/identities/merge; Identities view → select →
"Merge selected") — human judgement resolves what the algorithm shouldn't guess. (The reversible
operator-confirmable person_cluster suggestion layer is the planned upgrade so borderline pairs
are surfaced for one-click confirm instead of needing manual selection.)
Per-identity DOSSIER (SHIPPED) + people-count/activity (roadmap)¶
A workflow designed per-person analytics; the dossier first slice is shipped (read-only,
zero GPU, zero migration, no hot-path touch):
- GET /api/identities/{id}/dossier — merges EVERY recognized speech line across the person's
clips into ONE chronological original+translation stream (reuses the on-disk WebVTT via
events._parse_vtt_cues, no STT re-run; abs ts = clip_start_ts|ts + cue), + speech stats
(utterances/words/languages/busiest camera), + aggregate stats (dwell, sightings, cameras,
events, age/gender/colour, voice-exemplars, top sound tags), + occupancy (Event.num_objects
peak/avg), + an ESTIMATED kinematic activity summary from PresenceSegments. Fail-open per
sub-block. UI: a "📋 Досье — текстовая картина" section in identities.js openDetail (stats
strip + chips + day-grouped clickable transcript). Verified: id2 = 254 lines/1528 words.
Honesty: speech is "recorded while present" (per-clip), NOT voice-verified attribution.
- Roadmap (touch the hot path → second/third): per-event PEAK people count
(peak_people column set at track-finalize from simultaneous tracks) for true occupancy;
per-track kinematic activity (add a small position buffer to the tracker → speed in
body-heights/s → stationary/loitering/walking/running/approaching/leaving, stored per
sighting/event, labelled estimated). Fall/fight/gesture = a NEW lightweight pose model, gated
on GPU headroom (T4 is at ~12.8/15.3 GB — not safe to add now).
Analytics DASHBOARD (SHIPPED) — charts + reports¶
A workflow designed a comprehensive adaptive dashboard; the MVP is shipped (read-only,
zero GPU, zero migration, no hot-path touch, no external CDN/lib):
- GET /api/analytics/summary?from=&to=&camera_id=&bucket=day|hour (app/api/analytics.py,
mounted via _OPTIONAL_ROUTERS): ONE bounded SELECT-only JSON powering charts+report+CSV —
kpis, events_per_bucket (strftime), per_camera, occupancy (num_objects), match_kind,
warnings_by_class + warnings_per_bucket, sound_tags (CSV tally), transcript_coverage,
leaderboard. Each dataset in its own try/except → [] (fail-open); hour auto-coarsens to
day past 31 days so a query never scans unbounded.
- app/static/charts.js — hand-rolled inline-SVG chart helper window.Charts
(bar/barH/line/area/donut/sparkline): viewBox-responsive, theme-aware via CSS custom
properties (auto light/dark), <title> tooltips, empty-state guard, NO dependency.
- New Analytics tab + renderAnalytics() (app.js): sticky filter bar (date range +
camera + day/hour), a KPI strip (events/transcribed/persons/warnings/busiest camera+day with
a trend sparkline), and an adaptive .an-grid (auto-fill minmax(320px,1fr); wide cards span 2
≥1000px) of 9 charts. A 🖨 Отчёт button → window.print() with a print stylesheet
(hides chrome, ink-friendly, no break-inside) → operator "Save as PDF", zero-dep.
Verified live: 819 events, per-camera/occupancy/match_kind/warnings/sound-tags/coverage/
leaderboard all populate; cam5 hour-filter = 413 events, peak hour 14:00.
Data enrichment + CSV/report (SHIPPED)¶
- CSV export + printable report. Analytics view:
⬇ CSVexports any aggregate (events-by-period, per-camera, occupancy, activity, warnings, sound-tags, leaderboard) as RFC-4180 + UTF-8-BOM CSV (client-side from the loaded JSON —csvDownload, zero backend).🖨 Отчёт→window.print()with a print stylesheet (hides chrome, ink-friendly, vector SVG,@pagemargins) + a print-only title/period/timestamp header → Save-as-PDF. - Hot-path data ENRICHMENT (estimated, GPU-free, additive, fail-open). Computed at track
finalize, NOT on the detection inference path; written through the existing
_create_event_row(**meta)hasattr filter; nullable columns viaensure_event_kinematics_schema(no backfill — legacy rows NULL): events.peak_people— max same-class tracks alive together during the presence (real co-presence;num_objectswas always 1)._track_and_identifytallies concurrent tracks per frame → each track keeps its peak.events.activity/activity_speed—Trackkeeps a bounded position buffer (filled O(1) in the tracker's matched branch);Track.kinematics()derives speed in body-heights/sec (scale-robust, no calibration) + straightness + height-trend;classify_activity()→ stationary / loitering / walking / running / approaching / leaving (conservative cfg thresholds), or NULL when too few samples. Verified live: event 1164 = peak_people 2, activity "walking", 0.282 bh/s. Surfaced in the dashboard (occupancy now = peak_people; new "Активность (оценка)" donut) and in the dossier (activity-mix chips). Unit-tested (tests/test_tracker_kinematics.py).- Honesty: monocular/uncalibrated speed; PTZ motion can inject false motion (documented); fall/fight/gesture still need a pose model (GPU-gated; T4 ~12.8/15.3 GB).
Analytics polish — heatmaps + watch-list + deep-link (SHIPPED)¶
- Activity heatmap (hour × weekday).
/api/analytics/summaryreturnsactivity_heatmap(24×7 event counts; %w remapped to Mon..Sun) drawn by a newCharts.heatmapinline-SVG primitive (intensity = accent opacity, theme-aware). - Loitering watch-list.
loitering(recent events withactivity="loitering", newest-first) → a clickable panel (open the clip). - Deep-link. Clicking a "по камерам" bar jumps to the Events view filtered by
that camera (
Charts.barHgained anonBarcallback +_analyticsJumpToCamera).
Zones + roles + spatial heatmap (SHIPPED — first slice)¶
The "sellable" layer: understand WHERE (bar / hall / staff area / entrance) and WHO
(staff vs visitor), so analytics is per-place and per-role.
- Zones (per-camera ROIs). zones table (camera_id, name, kind ∈
bar/staff_area/public/entrance/checkout/custom, polygon = normalized [x,y] list)
auto-created by create_all; CRUD at /api/cameras/{id}/zones + /api/zones/{id}
(app/api/zones.py). A draw-on-snapshot rectangle editor in the Cameras view
("🗺 Зоны" per row) over the live /api/live/{id}/snapshot. Attribution: a
sighting's foot point (new Sighting.foot_x_frac + existing foot_y_frac,
computed in the pipeline) point-in-polygon (app/geom.py) at analytics time →
per-zone people / staff / visitor / visits. (Analytics-time attribution = no
hot-path/worker change; legacy sightings lack foot_x and simply aren't attributed.)
- Roles (staff vs visitor). Authoritative Identity.role (operator-set via
POST /api/identities/{id}/role, buttons in the dossier). Analytics roles
block = distinct people by role; per-zone staff/visitor split. (Zone-dwell and
uniform-colour SUGGESTIONS — the other two operator-chosen methods — are the next
increment on top of this authoritative base.)
- Spatial heatmap (где стоят люди). Single-camera 28×16 foot-point density grid
over the window → Charts.heatmap. Honest: foot-point density, no ground-plane
calibration. Shown when one camera is selected in the filter.
- Why it sells: footfall + dwell-by-area + staff-coverage + queue/loiter +
heatmaps = the hospitality/retail analytics buyers pay for.
Open questions¶
- Per-class thresholds need tuning on real audio; do we want a default-OFF high-severity webhook (push) for gunshot/scream?
- Voice fusion needs a measured EER for
voice_match_threshold(currently 0 = disabled). - Phase-0 threshold raise may fragment some existing identities — acceptable vs collapse?