Minkyu Choi, Harsh Goel, Mohammad Omama, Yunhao, Yang, Sahil Shah, and Sandeep Chinchali
European Conference on Computer Vision (ECCV), 2024
Introduction
Imagine if I asked you to locate the iconic “I am flying” scene from the 3-hour-long Titanic movie. This scene is a complex symphony of multiple semantic events and their long-term temporal relations. Modern state-of-the-art (SOTA) activity recognition networks, which couple semantic reasoning and temporal logic, surprisingly fail at long-term reasoning across frames. Is there a way to decouple the two for effective long-term video understanding?
Introducing NSVS: Neuro-Symbolic Video Search. Our recent paper, accepted at ECCV 2024, tackles this problem and outperforms competing baselines by 9-15% on state-of-the-art datasets such as Waymo and NuScenes.
Why do we need long-term reasoning in videos?
There has been a significant increase in video data production, with platforms such as YouTube receiving 500 hours of uploads every minute. Additionally, autonomous vehicle companies like Waymo generate 10-100 TB of data daily, and worldwide security cameras record around 500 PB daily. Consequently, we require tools with sophisticated query capabilities to navigate this immense volume of video content.
For instance, a query such as “Find me all scenes where event A happened, event B did not occur, and event C occurs hours later” requires advanced methods capable of long-term temporal reasoning. Such long-term reasoning is a common use case in surveillance, video analysis, and similar fields that existing video foundation models fail to address.
Why do existing methods fail at long-term reasoning in videos?
Our key insight is that video foundation models intertwine per-frame perception and temporal reasoning into a single deep network. This makes it difficult for them to understand temporal nuances over the long term. Hence, decoupling but co-designing semantic understanding and temporal reasoning is essential for efficient scene identification. We propose a system that
leverages vision-language models for semantic understanding of individual frames but effectively reasons about the long-term evolution of events using state machines and temporal logic (TL) formulae that inherently capture memory.
The figure below shows comparative performance on the event identification tasks. The accuracy of event identification with Video Language Models (Blue/Green) drops as video length or query complexity increases. On the other hand, NSVS (Orange) shows consistent performance irrespective of video length or query complexity.
NSVS - Demystified
We attribute the consistent performance of NSVS observed in the above figure to the decoupling of per-frame semantic understanding and temporal reasoning. While we plug and play off-the-shelf foundation models like YOLO, CLIP, or LLAVA for semantic understanding, we build upon the massive literature on Formal Methods using state machines and temporal logic (TL) formulae for temporal reasoning.
Formal Methods are mathematical techniques used to specify, verify, and prove the correctness of systems. Temporal logic (TL) is a subset of formal methods that describe sequences of events or states over time. TL provides a structured framework for describing and reasoning about the temporal properties of sequences or processes. It extends classical logic with temporal operators to express propositions about the flow of time. To the best of our knowledge, this is the first work to adapt TL for long-term activity recognition. Although it is not necessary to understand this blog, we recommend that readers refer to this crash course for an in-depth understanding of TL.
The NSVS Pipeline
Coming back to our example of locating the “I am flying” scene from the 3-hour-long Titanic movie, how does NSVS solve it? The query “I’m flying” is first decomposed into semantically meaningful atomic propositions such as “man hugging woman”, “ship on the sea”, and “kiss” from a high-level user query. SOTA vision and vision-language models are then employed to annotate the existence of these atomic propositions in each video frame. Subsequently, we construct an automaton or state machine that models the video’s temporal evolution based on the list of per-frame atomic propositions detected in the video. Finally, we evaluate when and where this automaton satisfies the user’s query. This also provides confidence measures through formal verification which enables the user to further assess the specific scenes pertaining to a complex query in a long video. We further assess this pipeline for long-term reasoning in videos for queries with varying complexity on a suite of experiments.
Long-term Video Understanding Results
As shown previously, current video-language foundation models such as Video-Llama and ViCLIP excel at scene identification and description in short videos, however, they struggle with long-term and complex temporal queries. Hence, we’ve crafted stronger benchmarks that couple Large Language Models (LLMs) like GPT for reasoning with per-frame annotations from a CV model. Essentially, we replace the sophisticated state machines in NSVS that reason about temporal logic queries with an LLM. This allows us to see how video length impacts scene identification performance when utilizing LLMs.
Our comprehensive evaluations include scene identification tasks in multi-event sequences with extended temporal events. Specifically, these tasks focus on scenarios where event A persists from the beginning until event B occurs at the end. Therefore, these tasks provide crucial insights into the long-term reasoning capabilities of Large Language Models (LLMs), especially as the temporal distances between events increase. We found that while GPT-3.5 and GPT-3.5 Turbo Instruct struggle with videos longer than 500 seconds, and GPT-4’s performance declines sharply beyond 1000 seconds, our NSVS method maintains consistent accuracy even for videos up to 40 minutes long. This demonstrates NSVS’s robust capability in handling complex, temporally extended video content, potentially opening new avenues for video analysis and understanding.
The TLV Datasets
Existing datasets comprise video annotations for events across short durations. To address this gap in state-of-the-art video datasets for temporally extended activity, we introduce the Temporal Logic Video (TLV) datasets. These datasets come in two flavors: synthetic and real-world. Our synthetic TLV datasets are crafted by cleverly stitching together static images from popular collections like COCO and ImageNet, allowing us to inject a wide array of temporal logic specifications. We’ve also created two video datasets with TL specifications based on real-world autonomous vehicle driving footage from NuScenes and Waymo open-source datasets. We believe that the proposed datasets would enable researchers to benchmark their methods for long-term video understanding and temporal reasoning tasks.
More on NSVS-TL
For more information, come see us at the upcoming ECCV 2024 conference. You can find the paper here, the project webpage, and play with our open-sourced datasets and code.
Citation
1 | @inproceedings{ |