Abstract: Thanks to the development of cross-modal models, text-to-video retrieval (T2VR) is advancing rapidly, but its robustness remains largely unexamined. Existing attacks against T2VR are ...
Video Temporal Grounding (VTG) localizes moments in untrimmed videos using natural language queries. Most VTG datasets focus on short videos, and existing approaches excel in short-term cross-modal ...