PoinCLIP-VAD: a hyperbolic cross-modal fusion framework for video anomaly detection – Psychiatry AI: Real-Time AI Scoping Review

AI Summary

Performs cross-modal fusion in Poincaré ball hyperbolic space, yielding more expressive semantic distances and clearer distinction between normal and anomalous patterns.
Dual-block design: classification block produces coarse anomaly scores; video text alignment block refines correspondence via negative Poincaré distance for fine-grained matching.
Demonstrates strong performance: 90.62% AUC on UCF-Crime and 86.93% AP on XD-Violence, confirming improved anomaly discrimination under weak supervision.

AI Summary

Vision-language models address this by enabling cross-modal alignment between visual features and textual labels, but when these representations are learned in Euclidean space, they struggle to capture subtle semantic variations and often produce ambiguous instance ranking under weak supervision.

Extensive experiments on benchmark datasets demonstrate that PoinCLIP-VAD achieves an AUC of 90.62% on UCF-Crime and an AP of 86.93% on XD-Violence, confirming improved anomaly discrimination and more consistent cross-modal alignment under weak supervision.

Basic summary

Sci Rep. 2026 Jun 8. doi: 10.1038/s41598-026-56792-z. Online ahead of print.

ABSTRACT

Weakly supervised video anomaly detection (WSVAD) is fundamentally constrained by the absence of frame-level annotations, which leads to noisy instance selection in Multiple Instance Learning (MIL) and weak correspondence between temporal video segments and semantic descriptions. Vision-language models address this by enabling cross-modal alignment between visual features and textual labels, but when these representations are learned in Euclidean space, they struggle to capture subtle semantic variations and often produce ambiguous instance ranking under weak supervision. To address this limitation, we propose PoinCLIP-VAD, a vision-language framework that performs cross-modal fusion in hyperbolic space. The model embeds visual and textual features into a shared Poincaré ball geometry, where non-linear distance scaling provides a more expressive representation of latent semantic relationships induced by cross-modal interactions, without relying on predefined hierarchical structures. This geometry-consistent formulation enables more reliable similarity estimation and better preserves distinctions between normal and anomalous patterns. The framework adopts a dual-block architecture consisting of a classification block for coarse anomaly scoring and a video-text alignment block for fine-grained correspondence using negative Poincaré distance. Extensive experiments on benchmark datasets demonstrate that PoinCLIP-VAD achieves an AUC of 90.62% on UCF-Crime and an AP of 86.93% on XD-Violence, confirming improved anomaly discrimination and more consistent cross-modal alignment under weak supervision.

PMID:42260062 | DOI:10.1038/s41598-026-56792-z

Document this CPD