Minkyu Choi*, S P Sharan*, Sahil Shah, Harsh Goel, Mohammad Omama, and Sandeep Chinchali
Computer Vision and Pattern Recognition (CVPR), 2025
Introduction
In recent years, generative models that create videos from text prompts—known as text-to-video (T2V) models—have advanced significantly. Models like Sora, Gen-3, MovieGen, and CogVideoX are now being used in diverse fields ranging from entertainment to critical applications like autonomous driving and robotics. However, while visually impressive, these models often struggle with accurately capturing temporal sequences described by their prompts.
Why Temporal Fidelity Matters
For many applications, especially those involving safety-critical tasks, temporal fidelity—the correct timing and sequencing of events—is as important as visual realism. Imagine retraining an autonomous vehicle’s motion planning system using synthetic videos generated from a prompt like, “A truck appears in frame 10 and veers in front of me onto my lane within 2 seconds.” If the timing and sequence of the truck’s movements aren’t accurately represented, it could lead to dangerous decisions in real-world scenarios.
Unveiling NeuS-V: Neuro-Symbolic Verification

To address the limitations of existing evaluation methods, we introduce NeuS-V, a groundbreaking evaluation technique that employs neuro-symbolic formal verification. NeuS-V translates natural language prompts into formal Temporal Logic (TL) specifications and generates an automaton representing the video’s events and sequences. By formally verifying this automaton against the TL specification, NeuS-V rigorously evaluates text-to-video alignment, significantly outperforming existing metrics.
How NeuS-V Works
NeuS-V operates through several key steps:
- Prompt Translation: Text prompts are converted into Temporal Logic specifications using Prompt Understanding via temporal Logic Specification (PULS).
- Semantic Scoring: Vision-language models assign confidence scores to each atomic proposition extracted from the prompt.
- Automaton Construction: Videos are represented as automata, capturing the sequential and probabilistic nature of frame-to-frame transitions.
- Formal Verification: Using probabilistic model checking, NeuS-V computes the satisfaction probability of the automaton meeting the prompt’s TL specification.

Superior Evaluation Results
NeuS-V’s formal approach has shown over five times greater correlation with human evaluations compared to traditional visual quality metrics such as VBench. This indicates NeuS-V’s strength in capturing temporal fidelity, making it particularly effective for scenarios where timing and sequence accuracy are critical.

Benchmark and Dataset Contributions
To further the development of reliable T2V models, NeuS-V includes a comprehensive benchmark suite with temporally complex prompts designed to rigorously test temporal coherence and event sequencing. These benchmarks, along with the NeuS-V framework and dataset, are publicly available to facilitate continued advancements in the field.

Conclusion and Future Directions
NeuS-V’s integration of neuro-symbolic verification and Temporal Logic addresses a crucial gap in text-to-video generation evaluation. It opens a path toward more reliable generative models, particularly in domains where precise temporal alignment is paramount. Future directions involve refining the NeuS-V framework to penalize unintended elements in generated videos and further exploring prompt optimization and model training methodologies.
Citation
1 | @InProceedings{Choi_2025_CVPR, |