Neuro Symbolic Video Search

The surge in video data necessitates advanced frame extraction tools. Foundation models like VideoLLaMA and ViCLIP falter in long-term reasoning, conflating frame perception with temporal analysis. We propose separating semantic understanding, using vision-language models for individual frames, from temporal reasoning, employing state machines and temporal logic. This approach significantly enhances complex event identification, improving F1 scores by 9-15% on self-driving datasets like Waymo and NuScenes compared to GPT4-based reasoning, showcasing the importance of decoupling these processes for effective scene identification.

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×